rocksdict
Abstract
This package enables users to store, query, and delete a large number of key-value pairs on disk.
This is especially useful when the data cannot fit into RAM. If you have hundreds of GBs or many TBs of key-value data to store and query from, this is the package for you.
Installation
This package is built for macOS (amd64/arm64), Windows , and Linux amd64/arm64.
It can be installed from pypi with pip install rocksdict
.
Introduction
Below is a code example that shows how to do the following:
- Create Rdict
- Store something on disk
- Close Rdict
- Open Rdict again
- Check Rdict elements
- Iterate from Rdict
- Batch get
- Delete storage
Examples:
::
from rocksdict import Rdict, Options path = str("./test_dict") # create a Rdict with default options at `path` db = Rdict(path) # storing numbers db[1.0] = 1 db[1] = 1.0 db["huge integer"] = 2343546543243564534233536434567543 db["good"] = True db["bad"] = False db["bytes"] = b"bytes" db["this is a list"] = [1, 2, 3] db["store a dict"] = {0: 1} # for example numpy array import numpy as np import pandas as pd db[b"numpy"] = np.array([1, 2, 3]) db["a table"] = pd.DataFrame({"a": [1, 2], "b": [2, 1]}) # close Rdict db.close() # reopen Rdict from disk db = Rdict(path) assert db[1.0] == 1 assert db[1] == 1.0 assert db["huge integer"] == 2343546543243564534233536434567543 assert db["good"] == True assert db["bad"] == False assert db["bytes"] == b"bytes" assert db["this is a list"] == [1, 2, 3] assert db["store a dict"] == {0: 1} assert np.all(db[b"numpy"] == np.array([1, 2, 3])) assert np.all(db["a table"] == pd.DataFrame({"a": [1, 2], "b": [2, 1]})) # iterate through all elements for k, v in db.items(): print(f"{k} -> {v}") # batch get: print(db[["good", "bad", 1.0]]) # [True, False, 1] # delete Rdict from dict db.close() Rdict.destroy(path)
Supported types:
- key:
int, float, bool, str, bytes
- value:
int, float, bool, str, bytes
and anything that supportspickle
.
1from .rocksdict import * 2 3__doc__ = rocksdict.__doc__ 4 5__all__ = ["Rdict", 6 "WriteBatch", 7 "SstFileWriter", 8 "AccessType", 9 "WriteOptions", 10 "Snapshot", 11 "RdictIter", 12 "Options", 13 "ReadOptions", 14 "ColumnFamily", 15 "IngestExternalFileOptions", 16 "DBPath", 17 "MemtableFactory", 18 "BlockBasedOptions", 19 "PlainTableFactoryOptions", 20 "CuckooTableOptions", 21 "UniversalCompactOptions", 22 "UniversalCompactionStopStyle", 23 "SliceTransform", 24 "DataBlockIndexType", 25 "BlockBasedIndexType", 26 "Cache", 27 "ChecksumType", 28 "DBCompactionStyle", 29 "DBCompressionType", 30 "DBRecoveryMode", 31 "Env", 32 "FifoCompactOptions", 33 "CompactOptions", 34 "BottommostLevelCompaction", 35 "KeyEncodingType", 36 "DbClosedError", 37 "WriteBufferManager", 38 "Checkpoint"] 39 40Rdict.__enter__ = lambda self: self 41Rdict.__exit__ = lambda self, exc_type, exc_val, exc_tb: self.close()
A persistent on-disk dictionary. Supports string, int, float, bytes as key, values.
Example:
::
from rocksdict import Rdict db = Rdict("./test_dir") db[0] = 1 db = None db = Rdict("./test_dir") assert(db[0] == 1)
Opening DB created by other language is easy:
you don't need to manually configure Options and Column
Families. Just use db = Rdict("./db_path")
.
It will automatically open the db in right Options and
Column Families for you in RAW MODE.
Arguments:
- path (str): path to the database
- options (Options): Options object
- column_families (dict): (name, options) pairs, these
Options
must have the sameraw_mode
argument as the mainOptions
. A column family called 'default' is always created. - access_type (AccessType): there are four access types: ReadWrite, ReadOnly, WithTTL, and Secondary, use AccessType class to create.
Optionally disable WAL or sync for this write.
Example:
::
from rocksdict import Rdict, Options, WriteBatch, WriteOptions path = "_path_for_rocksdb_storageY1" db = Rdict(path) # set write options write_options = WriteOptions() write_options.set_sync(False) write_options.disable_wal(True) db.set_write_options(write_options) # write to db db["my key"] = "my value" db["key2"] = "value2" db["key3"] = "value3" # remove db del db Rdict.destroy(path)
Get value from key or a list of keys.
Arguments:
- key: a single key or list of keys.
- default: the default value to return if key not found.
- read_opt: override preset read options (or use Rdict.set_read_options to preset a read options used by default).
Returns:
None or default value if the key does not exist.
Get a wide-column from a key.
Arguments:
- key: a single key or list of keys.
- default: the default value to return if key not found.
- read_opt: override preset read options (or use Rdict.set_read_options to preset a read options used by default).
Returns:
A list of
(name, value)
tuples. If the value is not an entity, returns a single-column with default column name (empty bytes/string). None or default value if the key does not exist.
Insert key value into database.
Arguments:
- key: the key.
- value: the value.
- write_opt: override preset write options (or use Rdict.set_write_options to preset a write options used by default).
Insert a wide-column.
The length of names
and values
must be the same.
Arguments:
- key: the key.
- names: the names of the columns.
- values: the values of the columns.
- write_opt: override preset write options (or use Rdict.set_write_options to preset a write options used by default).
Check if a key may exist without doing any IO.
Notes:
If the key definitely does not exist in the database, then this method returns False, else True. If the caller wants to obtain value when the key is found in memory, fetch should be set to True. This check is potentially lighter-weight than invoking DB::get(). One way to make this lighter weight is to avoid doing any IOs.
The API follows the following principle:
- True, and value found => the key must exist.
- True => the key may or may not exist.
- False => the key definitely does not exist.
Flip it around:
- key exists => must return True, but value may or may not be found.
- key doesn't exists => might still return True.
Arguments:
- key: Key to check
- read_opt: ReadOptions
Returns:
if
fetch = False
, returning True implies that the key may exist. returning False implies that the key definitely does not exist. iffetch = True
, returning (True, value) implies that the key is found and definitely exist. returning (False, None) implies that the key definitely does not exist. returning (True, None) implies that the key may exist.
Delete entry from the database.
Arguments:
- key: the key.
- write_opt: override preset write options (or use Rdict.set_write_options to preset a write options used by default).
Reversible for iterating over keys and values.
Examples:
::
from rocksdict import Rdict, Options, ReadOptions path = "_path_for_rocksdb_storage5" db = Rdict(path) for i in range(50): db[i] = i ** 2 iter = db.iter() iter.seek_to_first() j = 0 while iter.valid(): assert iter.key() == j assert iter.value() == j ** 2 print(f"{iter.key()} {iter.value()}") iter.next() j += 1 iter.seek_to_first(); assert iter.key() == 0 assert iter.value() == 0 print(f"{iter.key()} {iter.value()}") iter.seek(25) assert iter.key() == 25 assert iter.value() == 625 print(f"{iter.key()} {iter.value()}") del iter, db Rdict.destroy(path)
Arguments:
- read_opt: ReadOptions
Returns: Reversible
Iterate through all keys and values pairs.
Examples:
::
for k, v in db.items(): print(f"{k} -> {v}")
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions
Iterate through all keys
Examples:
::
all_keys = [k for k in db.keys()]
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions
Iterate through all values.
Examples:
::
all_keys = [v for v in db.values()]
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions, must have the same
raw_mode
argument.
Iterate through all values as widecolumns
Examples:
::
all_entities = [v for v in db.columns()]
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions, must have the same
raw_mode
argument.
Iterate through all keys and entities pairs.
Examples:
::
for k, v in db.entities(): print(f"{k} -> {v}")
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions
Manually flush the current column family.
Notes:
Manually call mem-table flush. It is recommended to call flush() or close() before stopping the python program, to ensure that all written key-value pairs have been flushed to the disk.
Arguments:
- wait (bool): whether to wait for the flush to finish.
Flushes the WAL buffer. If sync
is set to true
, also syncs
the data to disk.
Creates column family with given name and options.
Arguments:
- name: name of this column family
- options: Rdict Options for this column family
Return:
the newly created column family
Get a column family Rdict
Arguments:
- name: name of this column family
- options: Rdict Options for this column family
Return:
the column family Rdict of this name
Use this method to obtain a ColumnFamily instance, which can be used in WriteBatch.
The name of the default column family name is "default"
.
Example:
::
wb = WriteBatch() for i in range(100): wb.put(i, i**2, db.get_column_family_handle(cf_name_1)) db.write(wb) wb = WriteBatch() wb.set_default_column_family(db.get_column_family_handle(cf_name_2)) for i in range(100, 200): wb[i] = i**2 db.write(wb)
A snapshot of the current column family.
Examples:
::
from rocksdict import Rdict db = Rdict("tmp") for i in range(100): db[i] = i # take a snapshot snapshot = db.snapshot() for i in range(90): del db[i] # 0-89 are no longer in db for k, v in db.items(): print(f"{k} -> {v}") # but they are still in the snapshot for i in range(100): assert snapshot[i] == i # drop the snapshot del snapshot, db Rdict.destroy("tmp")
Loads a list of external SST files created with SstFileWriter into the current column family.
Arguments:
- paths: a list a paths
- opts: IngestExternalFileOptionsPy instance
Tries to catch up with the primary by reading as much as possible from the log files.
Request stopping background work, if wait is true wait until it's done.
WriteBatch
Notes:
This WriteBatch does not write to the current column family.
Arguments:
- write_batch: WriteBatch instance. This instance will be consumed.
- write_opt: use default value if not provided.
Removes the database entries in the range ["from", "to")
of the current column family.
Arguments:
- begin: included
- end: excluded
- write_opt: WriteOptions
Flush memory to disk, and drop the current column family.
Notes:
Calling
db.close()
is nearly equivalent to first callingdb.flush()
and thendel db
. However,db.close()
does not guarantee the underlying RocksDB to be actually closed. Other Column FamilyRdict
instances,ColumnFamily
(cf handle) instances, iterator instances such asRdictIter
,RdictItems
,RdictKeys
,RdictValues
can all keep RocksDB alive.del
orclose
all associated instances mentioned above to actually shut down RocksDB.
Runs a manual compaction on the Range of keys given for the current Column Family.
Retrieves a RocksDB property by name, for the current column family.
Retrieves a RocksDB property and casts it to an integer (for the current column family).
Full list of properties that return int values could be find here.
Delete the database.
Arguments:
- path (str): path to this database
- options (rocksdict.Options): Rocksdb options object
Repair the database.
Arguments:
- path (str): path to this database
- options (rocksdict.Options): Rocksdb options object
WriteBatch class. Use db.write() to ingest WriteBatch.
Notes:
A WriteBatch instance can only be ingested once, otherwise an Exception will be raised.
Arguments:
- raw_mode (bool): make sure that this is consistent with the Rdict.
Set the default item for a[i] = j
and del a[i]
syntax.
You can also use put(key, value, column_family)
to explicitly choose column family.
Arguments:
- - column_family (ColumnFamily | None): column family descriptor or None (for default family).
Insert a value into the database under the given key.
Arguments:
- column_family: override the default column family set by set_default_column_family
Insert a wide-column.
The length of names
and values
must be the same.
Arguments:
- key: the key.
- names: the names of the columns.
- values: the values of the columns.
Removes the database entry for key. Does nothing if the key was not found.
Arguments:
- column_family: override the default column family set by set_default_column_family
Remove database entries in column family from start key to end key.
Notes:
Removes the database entries in the range ["begin_key", "end_key"), i.e., including "begin_key" and excluding "end_key". It is not an error if no keys exist in the range ["begin_key", "end_key").
Arguments:
- begin: begin key
- end: end key
- column_family: override the default column family set by set_default_column_family
SstFileWriter is used to create sst files that can be added to database later All keys in files generated by SstFileWriter will have sequence number = 0.
Arguments:
- options: this options must have the same
raw_mode
as the Rdict DB.
Define DB Access Types.
Notes:
There are four access types:
- ReadWrite: default value
- ReadOnly
- WithTTL
- Secondary
Examples:
::
from rocksdict import Rdict, AccessType # open with 24 hours ttl db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600)) # open as read_only db = Rdict("./main_path", access_type = AccessType.read_only()) # open as secondary db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
Define DB Access Types.
Notes:
There are four access types:
- ReadWrite: default value
- ReadOnly
- WithTTL
- Secondary
Examples:
::
from rocksdict import Rdict, AccessType # open with 24 hours ttl db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600)) # open as read_only db = Rdict("./main_path", access_type = AccessType.read_only()) # open as secondary db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
Define DB Access Types.
Notes:
There are four access types:
- ReadWrite: default value
- ReadOnly
- WithTTL
- Secondary
Examples:
::
from rocksdict import Rdict, AccessType # open with 24 hours ttl db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600)) # open as read_only db = Rdict("./main_path", access_type = AccessType.read_only()) # open as secondary db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
Define DB Access Types.
Notes:
There are four access types:
- ReadWrite: default value
- ReadOnly
- WithTTL
- Secondary
Examples:
::
from rocksdict import Rdict, AccessType # open with 24 hours ttl db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600)) # open as read_only db = Rdict("./main_path", access_type = AccessType.read_only()) # open as secondary db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
Define DB Access Types.
Notes:
There are four access types:
- ReadWrite: default value
- ReadOnly
- WithTTL
- Secondary
Examples:
::
from rocksdict import Rdict, AccessType # open with 24 hours ttl db = Rdict("./main_path", access_type = AccessType.with_ttl(24 * 3600)) # open as read_only db = Rdict("./main_path", access_type = AccessType.read_only()) # open as secondary db = Rdict("./main_path", access_type = AccessType.secondary("./secondary_path"))
Optionally disable WAL or sync for this write.
Example:
::
from rocksdict import Rdict, Options, WriteBatch, WriteOptions path = "_path_for_rocksdb_storageY1" db = Rdict(path, Options()) # set write options write_options = WriteOptions() write_options.set_sync(false) write_options.disable_wal(true) db.set_write_options(write_options) # write to db db["my key"] = "my value" db["key2"] = "value2" db["key3"] = "value3" # remove db del db Rdict.destroy(path, Options())
If true, this write request is of lower priority if compaction is behind. In this case, no_slowdown = true, the request will be cancelled immediately with Status::Incomplete() returned. Otherwise, it will be slowed down. The slowdown value is determined by RocksDB to guarantee it introduces minimum impacts to high priority writes.
Default: false
Sets the sync mode. If true, the write will be flushed from the operating system buffer cache before the write is considered complete. If this flag is true, writes will be slower.
Default: false
If true and we need to wait or sleep for the write request, fails immediately with Status::Incomplete().
Default: false
If true, writebatch will maintain the last insert positions of each memtable as hints in concurrent write. It can improve write performance in concurrent writes if keys in one writebatch are sequential. In non-concurrent writes (when concurrent_memtable_writes is false) this option will be ignored.
Default: false
A consistent view of the database at the point of creation.
Examples:
::
from rocksdict import Rdict db = Rdict("tmp") for i in range(100): db[i] = i # take a snapshot snapshot = db.snapshot() for i in range(90): del db[i] # 0-89 are no longer in db for k, v in db.items(): print(f"{k} -> {v}") # but they are still in the snapshot for i in range(100): assert snapshot[i] == i # drop the snapshot del snapshot, db Rdict.destroy("tmp")
Creates an iterator over the data in this snapshot under the given column family, using the default read options.
Arguments:
- read_opt: ReadOptions, must have the same
raw_mode
argument.
Iterate through all keys and values pairs.
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions, must have the same
raw_mode
argument.
Iterate through all keys.
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions, must have the same
raw_mode
argument.
Iterate through all values.
Arguments:
- backwards: iteration direction, forward if
False
. - from_key: iterate from key, first seek to this key or the nearest next key for iteration (depending on iteration direction).
- read_opt: ReadOptions, must have the same
raw_mode
argument.
Returns true
if the iterator is valid. An iterator is invalidated when
it reaches the end of its defined range, or when it encounters an error.
To check whether the iterator encountered an error after valid
has
returned false
, use the status
method. status
will never
return an error when valid
is true
.
Returns an error Result
if the iterator has encountered an error
during operation. When an error is encountered, the iterator is
invalidated and valid
will return false
when called.
Performing a seek will discard the current status.
Seeks to the first key in the database.
Example:
::
from rocksdict import Rdict, Options, ReadOptions path = "_path_for_rocksdb_storage5" db = Rdict(path, Options()) iter = db.iter(ReadOptions()) # Iterate all keys from the start in lexicographic order iter.seek_to_first() while iter.valid(): print(f"{iter.key()} {iter.value()}") iter.next() # Read just the first key iter.seek_to_first(); print(f"{iter.key()} {iter.value()}") del iter, db Rdict.destroy(path, Options())
Seeks to the last key in the database.
Example:
::
from rocksdict import Rdict, Options, ReadOptions path = "_path_for_rocksdb_storage6" db = Rdict(path, Options()) iter = db.iter(ReadOptions()) # Iterate all keys from the start in lexicographic order iter.seek_to_last() while iter.valid(): print(f"{iter.key()} {iter.value()}") iter.prev() # Read just the last key iter.seek_to_last(); print(f"{iter.key()} {iter.value()}") del iter, db Rdict.destroy(path, Options())
Seeks to the specified key or the first key that lexicographically follows it.
This method will attempt to seek to the specified key. If that key does not exist, it will find and seek to the key that lexicographically follows it instead.
Example:
::
from rocksdict import Rdict, Options, ReadOptions path = "_path_for_rocksdb_storage6" db = Rdict(path, Options()) iter = db.iter(ReadOptions()) # Read the first string key that starts with 'a' iter.seek("a"); print(f"{iter.key()} {iter.value()}") del iter, db Rdict.destroy(path, Options())
Seeks to the specified key, or the first key that lexicographically precedes it.
Like .seek()
this method will attempt to seek to the specified key.
The difference with .seek()
is that if the specified key do not exist, this method will
seek to key that lexicographically precedes it instead.
Example:
::
from rocksdict import Rdict, Options, ReadOptions path = "_path_for_rocksdb_storage6" db = Rdict(path, Options()) iter = db.iter(ReadOptions()) # Read the last key that starts with 'a' seek_for_prev("b") print(f"{iter.key()} {iter.value()}") del iter, db Rdict.destroy(path, Options())
Database-wide options around performance and behavior.
Please read the official tuning guide and most importantly, measure performance under realistic workloads with realistic hardware.
Example:
::
from rocksdict import Options, Rdict, DBCompactionStyle def badly_tuned_for_somebody_elses_disk(): path = "path/for/rocksdb/storageX" opts = Options() opts.create_if_missing(true) opts.set_max_open_files(10000) opts.set_use_fsync(false) opts.set_bytes_per_sync(8388608) opts.optimize_for_point_lookup(1024) opts.set_table_cache_num_shard_bits(6) opts.set_max_write_buffer_number(32) opts.set_write_buffer_size(536870912) opts.set_target_file_size_base(1073741824) opts.set_min_write_buffer_number_to_merge(4) opts.set_level_zero_stop_writes_trigger(2000) opts.set_level_zero_slowdown_writes_trigger(0) opts.set_compaction_style(DBCompactionStyle.universal()) opts.set_disable_auto_compactions(true) return Rdict(path, opts)
Arguments:
- raw_mode (bool): set this to True to operate in raw mode (i.e. it will only allow bytes as key-value pairs, and is compatible with other RockDB database).
Load latest options from the rocksdb path
Returns a tuple, where the first item is Options
and the second item is a Dict
of column families.
By default, RocksDB uses only one background thread for flush and
compaction. Calling this function will set it up such that total of
total_threads
is used. Good value for total_threads
is the number of
cores. You almost definitely want to call this function if your system is
bottlenecked by RocksDB.
Optimize level style compaction.
Default values for some parameters in Options
are not optimized for heavy
workloads and big datasets, which means you might observe write stalls under
some conditions.
This can be used as one of the starting points for tuning RocksDB options in such cases.
Internally, it sets write_buffer_size
, min_write_buffer_number_to_merge
,
max_write_buffer_number
, level0_file_num_compaction_trigger
,
target_file_size_base
, max_bytes_for_level_base
, so it can override if those
parameters were set before.
It sets buffer sizes so that memory consumption would be constrained by
memtable_memory_budget
.
Optimize universal style compaction.
Default values for some parameters in Options
are not optimized for heavy
workloads and big datasets, which means you might observe write stalls under
some conditions.
This can be used as one of the starting points for tuning RocksDB options in such cases.
Internally, it sets write_buffer_size
, min_write_buffer_number_to_merge
,
max_write_buffer_number
, level0_file_num_compaction_trigger
,
target_file_size_base
, max_bytes_for_level_base
, so it can override if those
parameters were set before.
It sets buffer sizes so that memory consumption would be constrained by
memtable_memory_budget
.
If true, any column families that didn't exist when opening the database will be created.
Default: true
If true, any column families that didn't exist when opening the database will be created.
Default: false
Specifies whether an error should be raised if the database already exists.
Default: false
Enable/disable paranoid checks.
If true, the implementation will do aggressive checking of the data it is processing and will stop early if it detects any errors. This may have unforeseen ramifications: for example, a corruption of one DB entry may cause a large number of entries to become unreadable or for the entire DB to become unopenable. If any of the writes to the database fails (Put, Delete, Merge, Write), the database will switch to read-only mode and fail all other Write operations.
Default: false
A list of paths where SST files can be put into, with its target size. Newer data is placed into paths specified earlier in the vector while older data gradually moves to paths specified later in the vector.
For example, you have a flash device with 10GB allocated for the DB, as well as a hard drive of 2TB, you should config it to be: [{"/flash_path", 10GB}, {"/hard_drive", 2TB}]
The system will try to guarantee data under each path is close to but not larger than the target size. But current and future file sizes used by determining where to place a file are based on best-effort estimation, which means there is a chance that the actual size under the directory is slightly more than target size under some workloads. User should give some buffer room for those cases.
If none of the paths has sufficient room to place a file, the file will be placed to the last path anyway, despite to the target size.
Placing newer data to earlier paths is also best-efforts. User should expect user files to be placed in higher levels in some extreme cases.
If left empty, only one path will be used, which is path
passed when
opening the DB.
Default: empty
from rocksdict import Options, DBPath
opt = Options()
flash_path = DBPath("/flash_path", 10 * 1024 * 1024 * 1024) # 10 GB
hard_drive = DBPath("/hard_drive", 2 * 1024 * 1024 * 1024 * 1024) # 2 TB
opt.set_db_paths([flash_path, hard_drive])
Use the specified object to interact with the environment, e.g. to read/write files, schedule background work, etc. In the near future, support for doing storage operations such as read/write files through env will be deprecated in favor of file_system.
Sets the compression algorithm that will be used for compressing blocks.
Default: DBCompressionType::Snappy
(DBCompressionType::None
if
snappy feature is not enabled).
Example:
::
from rocksdict import Options, DBCompressionType opts = Options() opts.set_compression_type(DBCompressionType.snappy())
Different levels can have different compression policies. There are cases where most lower levels would like to use quick compression algorithms while the higher levels (which have more data) use compression algorithms that have better compression but could be slower. This array, if non-empty, should have an entry for each level of the database; these override the value specified in the previous field 'compression'.
Example:
::
from rocksdict import Options, DBCompressionType opts = Options() opts.set_compression_per_level([ DBCompressionType.none(), DBCompressionType.none(), DBCompressionType.snappy(), DBCompressionType.snappy(), DBCompressionType.snappy() ])
Maximum size of dictionaries used to prime the compression library. Enabling dictionary can improve compression ratios when there are repetitions across data blocks.
The dictionary is created by sampling the SST file data. If
zstd_max_train_bytes
is nonzero, the samples are passed through zstd's
dictionary generator. Otherwise, the random samples are used directly as
the dictionary.
When compression dictionary is disabled, we compress and write each block before buffering data for the next one. When compression dictionary is enabled, we buffer all SST file data in-memory so we can sample it, as data can only be compressed and written after the dictionary has been finalized. So users of this feature may see increased memory usage.
Default: 0
Sets maximum size of training data passed to zstd's dictionary trainer. Using zstd's
dictionary trainer can achieve even better compression ratio improvements than using
max_dict_bytes
alone.
The training data will be used to generate a dictionary of max_dict_bytes.
Default: 0.
If non-zero, we perform bigger reads when doing compaction. If you're running RocksDB on spinning disks, you should set this to at least 2MB. That way RocksDB's compaction is doing sequential instead of random reads.
When non-zero, we also force new_table_reader_for_compaction_inputs to true.
Default: 0
Allow RocksDB to pick dynamic base of bytes for levels. With this feature turned on, RocksDB will automatically adjust max bytes for each level. The goal of this feature is to have lower bound on size amplification.
Default: false.
Sets the optimize_filters_for_hits flag
Default: false
Sets the periodicity when obsolete files get deleted.
The files that get out of scope by compaction process will still get automatically delete on every compaction, regardless of this setting.
Default: 6 hours
Prepare the DB for bulk loading.
All data will be in level 0 without any automatic compaction. It's recommended to manually call CompactRange(NULL, NULL) before reading from the database, because otherwise the read can be very slow.
Sets the number of open files that can be used by the DB. You may need to
increase this if your database has a large working set. Value -1
means
files opened are always kept open. You can estimate number of files based
on target_file_size_base and target_file_size_multiplier for level-based
compaction. For universal-style compaction, you can usually set it to -1
.
Default: -1
If max_open_files is -1, DB will open all files on DB::Open(). You can use this option to increase the number of threads used to open the files. Default: 16
If true, then every store to stable storage will issue a fsync. If false, then every store to stable storage will issue a fdatasync. This parameter should be set to true while storing data to filesystem like ext3 that can lose files after a reboot.
Default: false
Specifies the absolute info LOG dir.
If it is empty, the log files will be in the same dir as data. If it is non empty, the log files will be in the specified dir, and the db data dir's absolute path will be used as the log file name's prefix.
Default: empty
Allows OS to incrementally sync files to disk while they are being
written, asynchronously, in the background. This operation can be used
to smooth out write I/Os over time. Users shouldn't rely on it for
persistency guarantee.
Issue one request for every bytes_per_sync written. 0
turns it off.
Default: 0
You may consider using rate_limiter to regulate write rate to device. When rate limiter is enabled, it automatically enables bytes_per_sync to 1MB.
This option applies to table files
Same as bytes_per_sync, but applies to WAL files.
Default: 0, turned off
Dynamically changeable through SetDBOptions() API.
Sets the maximum buffer size that is used by WritableFileWriter.
On Windows, we need to maintain an aligned buffer for writes. We allow the buffer to grow until it's size hits the limit in buffered IO and fix the buffer size when using direct IO to ensure alignment of write requests if the logical sector size is unusual
Default: 1024 * 1024 (1 MB)
Dynamically changeable through SetDBOptions() API.
If true, allow multi-writers to update mem tables in parallel. Only some memtable_factory-s support concurrent writes; currently it is implemented only for SkipListFactory. Concurrent memtable writes are not compatible with inplace_update_support or filter_deletes. It is strongly recommended to set enable_write_thread_adaptive_yield if you are going to use this feature.
Default: true
If true, threads synchronizing with the write batch group leader will wait for up to write_thread_max_yield_usec before blocking on a mutex. This can substantially improve throughput for concurrent workloads, regardless of whether allow_concurrent_memtable_write is enabled.
Default: true
Specifies whether an iteration->Next() sequentially skips over keys with the same user-key or not.
This number specifies the number of keys (with the same userkey) that will be sequentially skipped before a reseek is issued.
Default: 8
Enable direct I/O mode for reading they may or may not improve performance depending on the use case
Files will be opened in "direct I/O" mode which means that data read from the disk will not be cached or buffered. The hardware buffer of the devices may however still be used. Memory mapped files are not impacted by these parameters.
Default: false
Enable direct I/O mode for flush and compaction
Files will be opened in "direct I/O" mode which means that data written to the disk will not be cached or buffered. The hardware buffer of the devices may however still be used. Memory mapped files are not impacted by these parameters. they may or may not improve performance depending on the use case
Default: false
Enable/dsiable child process inherit open files.
Default: true
Sets the number of shards used for table cache.
Default: 6
By default target_file_size_multiplier is 1, which means by default files in different levels will have similar size.
Dynamically changeable through SetOptions() API
Sets the minimum number of write buffers that will be merged together
before writing to storage. If set to 1
, then
all write buffers are flushed to L0 as individual files and this increases
read amplification because a get request has to check in all of these
files. Also, an in-memory merge may result in writing lesser
data to storage if there are duplicate records in each of these
individual write buffers.
Default: 1
Sets the maximum number of write buffers that are built up in memory. The default and the minimum number is 2, so that when 1 write buffer is being flushed to storage, new writes can continue to the other write buffer. If max_write_buffer_number > 3, writing will be slowed down to options.delayed_write_rate if we are writing to the last write buffer allowed.
Default: 2
Sets the amount of data to build up in memory (backed by an unsorted log on disk) before converting to a sorted on-disk file.
Larger values increase performance, especially during bulk loads. Up to max_write_buffer_number write buffers may be held in memory at the same time, so you may wish to adjust this parameter to control memory usage. Also, a larger write buffer will result in a longer recovery time the next time the database is opened.
Note that write_buffer_size is enforced per column family. See db_write_buffer_size for sharing memory across column families.
Default: 0x4000000
(64MiB)
Dynamically changeable through SetOptions() API
Amount of data to build up in memtables across all column families before writing to disk.
This is distinct from write_buffer_size, which enforces a limit for a single memtable.
This feature is disabled by default. Specify a non-zero value to enable it.
Default: 0 (disabled)
Control maximum total data size for a level. max_bytes_for_level_base is the max total for level-1. Maximum number of bytes for level L can be calculated as (max_bytes_for_level_base) * (max_bytes_for_level_multiplier ^ (L-1)) For example, if max_bytes_for_level_base is 200MB, and if max_bytes_for_level_multiplier is 10, total data size for level-1 will be 200MB, total file size for level-2 will be 2GB, and total file size for level-3 will be 20GB.
Default: 0x10000000
(256MiB).
Dynamically changeable through SetOptions() API
The manifest file is rolled over on reaching this limit. The older manifest file be deleted. The default value is MAX_INT so that roll-over does not take place.
Sets the target file size for compaction. target_file_size_base is per-file size for level-1. Target file size for level L can be calculated by target_file_size_base * (target_file_size_multiplier ^ (L-1)) For example, if target_file_size_base is 2MB and target_file_size_multiplier is 10, then each file on level-1 will be 2MB, and each file on level 2 will be 20MB, and each file on level-3 will be 200MB.
Default: 0x4000000
(64MiB)
Dynamically changeable through SetOptions() API
Sets the minimum number of write buffers that will be merged together
before writing to storage. If set to 1
, then
all write buffers are flushed to L0 as individual files and this increases
read amplification because a get request has to check in all of these
files. Also, an in-memory merge may result in writing lesser
data to storage if there are duplicate records in each of these
individual write buffers.
Default: 1
Sets the number of files to trigger level-0 compaction. A value < 0
means that
level-0 compaction will not be triggered by number of files at all.
Default: 4
Dynamically changeable through SetOptions() API
Sets the soft limit on number of level-0 files. We start slowing down writes at this
point. A value < 0
means that no writing slow down will be triggered by
number of files in level-0.
Default: 20
Dynamically changeable through SetOptions() API
Sets the maximum number of level-0 files. We stop writes at this point.
Default: 24
Dynamically changeable through SetOptions() API
Sets the compaction style.
Default: DBCompactionStyle.level()
Sets the options needed to support Universal Style compactions.
Sets unordered_write to true trades higher write throughput with relaxing the immutability guarantee of snapshots. This violates the repeatability one expects from ::Get from a snapshot, as well as :MultiGet and Iterator's consistent-point-in-time view property. If the application cannot tolerate the relaxed guarantees, it can implement its own mechanisms to work around that and yet benefit from the higher throughput. Using TransactionDB with WRITE_PREPARED write policy and two_write_queues=true is one way to achieve immutable snapshots despite unordered_write.
By default, i.e., when it is false, rocksdb does not advance the sequence number for new snapshots unless all the writes with lower sequence numbers are already finished. This provides the immutability that we except from snapshots. Moreover, since Iterator and MultiGet internally depend on snapshots, the snapshot immutability results into Iterator and MultiGet offering consistent-point-in-time view. If set to true, although Read-Your-Own-Write property is still provided, the snapshot immutability property is relaxed: the writes issued after the snapshot is obtained (with larger sequence numbers) will be still not visible to the reads from that snapshot, however, there still might be pending writes (with lower sequence number) that will change the state visible to the snapshot after they are landed to the memtable.
Default: false
Sets maximum number of threads that will concurrently perform a compaction job by breaking it into multiple, smaller ones that are run simultaneously.
Default: 1 (i.e. no subcompactions)
Sets maximum number of concurrent background jobs (compactions and flushes).
Default: 2
Dynamically changeable through SetDBOptions() API.
Disables automatic compactions. Manual compactions can still be issued on this column family
Default: false
Dynamically changeable through SetOptions() API
SetMemtableHugePageSize sets the page size for huge page for arena used by the memtable. If <=0, it won't allocate from huge page but from malloc. Users are responsible to reserve huge pages for it to be allocated. For example: sysctl -w vm.nr_hugepages=20 See linux doc Documentation/vm/hugetlbpage.txt If there isn't enough free huge page available, it will fall back to malloc.
Dynamically changeable through SetOptions() API
Sets the maximum number of successive merge operations on a key in the memtable.
When a merge operation is added to the memtable and the maximum number of successive merges is reached, the value of the key will be calculated and inserted into the memtable instead of the merge operation. This will ensure that there are never more than max_successive_merges merge operations in the memtable.
Default: 0 (disabled)
Control locality of bloom filter probes to improve cache miss rate. This option only applies to memtable prefix bloom and plaintable prefix bloom. It essentially limits the max number of cache lines each bloom filter check can touch.
This optimization is turned off when set to 0. The number should never be greater than number of probes. This option can boost performance for in-memory workload but should use with care since it can cause higher false positive rate.
Default: 0
Enable/disable thread-safe inplace updates.
Requires updates if
- key exists in current memtable
- new sizeof(new_value) <= sizeof(old_value)
- old_value for that key is a put i.e. kTypeValue
Default: false.
Sets the number of locks used for inplace update.
Default: 10000 when inplace_update_support = true, otherwise 0.
Different max-size multipliers for different levels. These are multiplied by max_bytes_for_level_multiplier to arrive at the max-size of each level.
Default: 1
Dynamically changeable through SetOptions() API
If true, then DB::Open() will not fetch and check sizes of all sst files. This may significantly speed up startup if there are many sst files, especially when using non-default Env with expensive GetFileSize(). We'll still check that all required sst files exist. If paranoid_checks is false, this option is ignored, and sst files are not checked at all.
Default: false
The total maximum size(bytes) of write buffers to maintain in memory including copies of buffers that have already been flushed. This parameter only affects trimming of flushed buffers and does not affect flushing. This controls the maximum amount of write history that will be available in memory for conflict checking when Transactions are used. The actual size of write history (flushed Memtables) might be higher than this limit if further trimming will reduce write history total size below this limit. For example, if max_write_buffer_size_to_maintain is set to 64MB, and there are three flushed Memtables, with sizes of 32MB, 20MB, 20MB. Because trimming the next Memtable of size 20MB will reduce total memory usage to 52MB which is below the limit, RocksDB will stop trimming.
When using an OptimisticTransactionDB: If this value is too low, some transactions may fail at commit time due to not being able to determine whether there were any write conflicts.
When using a TransactionDB: If Transaction::SetSnapshot is used, TransactionDB will read either in-memory write buffers or SST files to do write-conflict checking. Increasing this value can reduce the number of reads to SST files done for conflict detection.
Setting this value to 0 will cause write buffers to be freed immediately after they are flushed. If this value is set to -1, 'max_write_buffer_number * write_buffer_size' will be used.
Default: If using a TransactionDB/OptimisticTransactionDB, the default value will be set to the value of 'max_write_buffer_number * write_buffer_size' if it is not explicitly set by the user. Otherwise, the default is 0.
By default, a single write thread queue is maintained. The thread gets to the head of the queue becomes write batch group leader and responsible for writing to WAL and memtable for the batch group.
If enable_pipelined_write is true, separate write thread queue is maintained for WAL write and memtable write. A write thread first enter WAL writer queue and then memtable writer queue. Pending thread on the WAL writer queue thus only have to wait for previous writers to finish their WAL writing but not the memtable writing. Enabling the feature may improve write throughput and reduce latency of the prepare phase of two-phase commit.
Default: false
Defines the underlying memtable implementation. See official wiki for more information. Defaults to using a skiplist.
Example:
::
from rocksdict import Options, MemtableFactory opts = Options() factory = MemtableFactory.hash_skip_list(bucket_count=1_000_000, height=4, branching_factor=4) opts.set_allow_concurrent_memtable_write(false) opts.set_memtable_factory(factory)
Sets the table factory to a CuckooTableFactory (the default table factory is a block-based table factory that provides a default implementation of TableBuilder and TableReader with default BlockBasedTableOptions). See official wiki for more information on this table format.
Example:
::
from rocksdict import Options, CuckooTableOptions opts = Options() factory_opts = CuckooTableOptions() factory_opts.set_hash_ratio(0.8) factory_opts.set_max_search_depth(20) factory_opts.set_cuckoo_block_size(10) factory_opts.set_identity_as_first_hash(true) factory_opts.set_use_module_hash(false) opts.set_cuckoo_table_factory(factory_opts)
This is a factory that provides TableFactory objects. Default: a block-based table factory that provides a default implementation of TableBuilder and TableReader with default BlockBasedTableOptions. Sets the factory as plain table. See official wiki for more information.
Example:
::
from rocksdict import Options, PlainTableFactoryOptions opts = Options() factory_opts = PlainTableFactoryOptions() factory_opts.user_key_length = 0 factory_opts.bloom_bits_per_key = 20 factory_opts.hash_table_ratio = 0.75 factory_opts.index_sparseness = 16 opts.set_plain_table_factory(factory_opts)
Measure IO stats in compactions and flushes, if true
.
Default: false
Once write-ahead logs exceed this size, we will start forcing the flush of column families whose memtables are backed by the oldest live WAL file (i.e. the ones that are causing all the space amplification).
Default: 0
Recovery mode to control the consistency while replaying WAL.
Default: DBRecoveryMode::PointInTime
If not zero, dump rocksdb.stats
to LOG every stats_dump_period_sec
.
Default: 600
(10 mins)
If not zero, dump rocksdb.stats to RocksDB to LOG every stats_persist_period_sec
.
Default: 600
(10 mins)
When set to true, reading SST files will opt out of the filesystem's readahead. Setting this to false may improve sequential iteration performance.
Default: true
Enable/disable adaptive mutex, which spins in the user space before resorting to kernel.
This could reduce context switch when the mutex is not heavily contended. However, if the mutex is hot, we could end up wasting spin time.
Default: false
When a prefix_extractor
is defined through opts.set_prefix_extractor
this
creates a prefix bloom filter for each memtable with the size of
write_buffer_size * memtable_prefix_bloom_ratio
(capped at 0.25).
Default: 0
Sets the maximum number of bytes in all compacted files. We try to limit number of bytes in one compaction to be lower than this threshold. But it's not guaranteed.
Value 0 will be sanitized.
Default: target_file_size_base * 25
Specifies the absolute path of the directory the write-ahead log (WAL) should be written to.
Default: same directory as the database
Sets the WAL ttl in seconds.
The following two options affect how archived logs will be deleted.
- If both set to 0, logs will be deleted asap and will not get into the archive.
- If wal_ttl_seconds is 0 and wal_size_limit_mb is not 0, WAL files will be checked every 10 min and if total size is greater then wal_size_limit_mb, they will be deleted starting with the earliest until size_limit is met. All empty files will be deleted.
- If wal_ttl_seconds is not 0 and wall_size_limit_mb is 0, then WAL files will be checked every wal_ttl_seconds / 2 and those that are older than wal_ttl_seconds will be deleted.
- If both are not 0, WAL files will be checked every 10 min and both checks will be performed with ttl being first.
Default: 0
Sets the WAL size limit in MB.
If total size of WAL files is greater then wal_size_limit_mb, they will be deleted starting with the earliest until size_limit is met.
Default: 0
Sets the number of bytes to preallocate (via fallocate) the manifest files.
Default is 4MB, which is reasonable to reduce random IO as well as prevent overallocation for mounts that preallocate large amounts of data (such as xfs's allocsize option).
If true, then DB::Open() will not update the statistics used to optimize compaction decision by loading table properties from many files. Turning off this feature will improve DBOpen time especially in disk environment.
Default: false
Specify the maximal number of info log files to be kept.
Default: 1000
Allow the OS to mmap file for writing.
Default: false
Allow the OS to mmap file for reading sst tables.
Default: false
Guarantee that all column families are flushed together atomically.
This option applies to both manual flushes (db.flush()
) and automatic
background flushes caused when memtables are filled.
Note that this is only useful when the WAL is disabled. When using the WAL, writes are always consistent across column families.
Default: false
Sets global cache for table-level rows. Cache must outlive DB instance which uses it.
Default: null (disabled) Not supported in ROCKSDB_LITE mode!
Use to control write rate of flush and compaction. Flush has higher priority than compaction. If rate limiter is enabled, bytes_per_sync is set to 1MB by default.
Default: disable
Sets the maximal size of the info log file.
If the log file is larger than max_log_file_size
, a new info log file
will be created. If max_log_file_size
is equal to zero, all logs will
be written to one log file.
Default: 0
Example:
::
from rocksdict import Options options = Options() options.set_max_log_file_size(0)
Sets the time for the info log file to roll (in seconds).
If specified with non-zero value, log file will be rolled
if it has been active longer than log_file_time_to_roll
.
Default: 0 (disabled)
Controls the recycling of log files.
If non-zero, previously written log files will be reused for new logs, overwriting the old data. The value indicates how many such files we will keep around at any point in time for later use. This is more efficient because the blocks are already allocated and fdatasync does not need to update the inode after each write.
Default: 0
Example:
::
from rocksdict import Options options = Options() options.set_recycle_log_file_num(5)
Sets the threshold at which all writes will be slowed down to at least delayed_write_rate if estimated bytes needed to be compaction exceed this threshold.
Default: 64GB
Sets the bytes threshold at which all writes are stopped if estimated bytes needed to be compaction exceed this threshold.
Default: 256GB
Sets the size of one block in arena memory allocation.
If <= 0, a proper value is automatically calculated (usually 1/10 of writer_buffer_size).
Default: 0
If true, then print malloc stats together with rocksdb.stats when printing to LOG.
Default: false
Enable whole key bloom filter in memtable. Note this will only take effect if memtable_prefix_bloom_size_ratio is not 0. Enabling whole key filtering can potentially reduce CPU usage for point-look-ups.
Default: false (disable)
Dynamically changeable through SetOptions() API
Enable the use of key-value separation.
More details can be found here: Integrated BlobDB.
Default: false (disable)
Dynamically changeable through SetOptions() API
Sets the minimum threshold value at or above which will be written to blob files during flush or compaction.
Dynamically changeable through SetOptions() API
Sets the size limit for blob files.
Dynamically changeable through SetOptions() API
Sets the blob compression type. All blob files use the same compression type.
Dynamically changeable through SetOptions() API
If this is set to true RocksDB will actively relocate valid blobs from the oldest blob files as they are encountered during compaction.
Dynamically changeable through SetOptions() API
Sets the threshold that the GC logic uses to determine which blob files should be considered “old.”
For example, the default value of 0.25 signals to RocksDB that blobs residing in the oldest 25% of blob files should be relocated by GC. This parameter can be tuned to adjust the trade-off between write amplification and space amplification.
Dynamically changeable through SetOptions() API
Sets the blob GC force threshold.
Dynamically changeable through SetOptions() API
Sets the blob compaction read ahead size.
Dynamically changeable through SetOptions() API
Set this option to true during creation of database if you want
to be able to ingest behind (call IngestExternalFile() skipping keys
that already exist, rather than overwriting matching keys).
Setting this option to true has the following effects:
1) Disable some internal optimizations around SST file compression.
2) Reserve the last level for ingested files only.
3) Compaction will not include any file from the last level.
Note that only Universal Compaction supports allow_ingest_behind.
num_levels
should be >= 3 if this option is turned on.
DEFAULT: false Immutable.
A factory of a table property collector that marks an SST file as need-compaction when it observe at least "D" deletion entries in any "N" consecutive entries, or the ratio of tombstone entries >= deletion_ratio.
window_size
: is the sliding window size "N"
num_dels_trigger
: is the deletion trigger "D"
deletion_ratio
: if <= 0 or > 1, disable triggering compaction based on
deletion ratio.
Write buffer manager helps users control the total memory used by memtables across multiple column families and/or DB instances.
https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager
Users can enable this control by 2 ways:
1- Limit the total memtable usage across multiple column families and DBs under a threshold. 2- Cost the memtable memory usage to block cache so that memory of RocksDB can be capped by the single limit.
The usage of a write buffer manager is similar to rate_limiter and sst_file_manager. Users can create one write buffer manager object and pass it to all the options of column families or DBs whose memtable size they want to be controlled by this object.
If true, working thread may avoid doing unnecessary and long-latency operation (such as deleting obsolete files directly or deleting memtable) and will instead schedule a background job to do it.
Use it if you're latency-sensitive.
Default: false (disabled)
ReadOptions allows setting iterator bounds and so on.
Arguments:
- raw_mode (bool): this must be the same as
Options
raw_mode argument.
Specify whether the "data block"/"index block"/"filter block" read for this iteration should be cached in memory? Callers may wish to set this field to false for bulk scans.
Default: true
Enforce that the iterator only iterates over the same prefix as the seek. This option is effective only for prefix seeks, i.e. prefix_extractor is non-null for the column family and total_order_seek is false. Unlike iterate_upper_bound, prefix_same_as_start only works within a prefix but in both directions.
Default: false
Enable a total order seek regardless of index format (e.g. hash index) used in the table. Some table format (e.g. plain table) may not support this option.
If true when calling Get(), we also skip prefix bloom when reading from block based table. It provides a way to read existing data after changing implementation of prefix extractor.
Sets a threshold for the number of keys that can be skipped before failing an iterator seek as incomplete. The default value of 0 should be used to never fail a request as incomplete, even on skipping too many keys.
Default: 0
If true, when PurgeObsoleteFile is called in CleanupIteratorState, we schedule a background job in the flush job queue and delete obsolete files in background.
Default: false
If true, keys deleted using the DeleteRange() API will be visible to readers until they are naturally deleted during compaction. This improves read performance in DBs with many range deletions.
Default: false
If true, all data read from underlying storage will be verified against corresponding checksums.
Default: true
If non-zero, an iterator will create a new table reader which performs reads of the given size. Using a large size (> 2MB) can improve the performance of forward iteration on spinning disks. Default: 0
from rocksdict import ReadOptions
opts = ReadOptions() opts.set_readahead_size(4_194_304) # 4mb
If true, create a tailing iterator. Note that tailing iterators only support moving in the forward direction. Iterating in reverse or seek_to_last are not supported.
Specifies the value of "pin_data". If true, it keeps the blocks loaded by the iterator pinned in memory as long as the iterator is not deleted, If used when reading from tables created with BlockBasedTableOptions::use_delta_encoding = false, Iterator's property "rocksdb.iterator.is-key-pinned" is guaranteed to return 1.
Default: false
Column family handle. This can be used in WriteBatch to specify Column Family.
If set to false, an ingested file keys could appear in existing snapshots that where created before the file was ingested.
If set to false, IngestExternalFile() will fail if the file key range overlaps with existing keys or tombstones in the DB.
If set to false and the file key range overlaps with the memtable key range (memtable flush required), IngestExternalFile will fail.
Set to true if you would like duplicate keys in the file being ingested to be skipped rather than overwriting existing data under that key. Usecase: back-fill of some historical data in the database without over-writing existing newer version of data. This option could only be used if the DB has been running with allow_ingest_behind=true since the dawn of time. All files will be ingested at the bottommost level with seqno=0.
Defines the underlying memtable implementation. See official wiki for more information.
For configuring block-based file storage.
Approximate size of user data packed per block. Note that the block size specified here corresponds to uncompressed data. The actual size of the unit read from disk may be smaller if compression is enabled. This parameter can be changed dynamically.
Block size for partitioned metadata. Currently applied to indexes when kTwoLevelIndexSearch is used and to filters when partition_filters is used. Note: Since in the current implementation the filters and index partitions are aligned, an index/filter block is created when either index or filter block size reaches the specified limit.
Note: this limit is currently applied to only index blocks; a filter partition is cut right after an index block is cut.
Note: currently this option requires kTwoLevelIndexSearch to be set as well.
Use partitioned full filters for each SST file. This option is incompatible with block-based filters.
Sets global cache for blocks (user data is stored in a set of blocks, and a block is the unit of reading from disk). Cache must outlive DB instance which uses it.
If set, use the specified cache for blocks. By default, rocksdb will automatically create and use an 8MB internal cache.
Sets the filter policy to reduce disk read
Defines the index type to be used for SS-table lookups.
Example:
::
from rocksdict import BlockBasedOptions, BlockBasedIndexType, Options opts = Options() block_opts = BlockBasedOptions() block_opts.set_index_type(BlockBasedIndexType.hash_search()) opts.set_block_based_table_factory(block_opts)
If cache_index_and_filter_blocks is true and the below is true, then filter and index blocks are stored in the cache, but a reference is held in the "table reader" object so the blocks are pinned and only evicted from cache when the table reader is freed.
Default: false.
If cache_index_and_filter_blocks is true and the below is true, then the top-level index of partitioned filter and index blocks are stored in the cache, but a reference is held in the "table reader" object so the blocks are pinned and only evicted from cache when the table reader is freed. This is not limited to l0 in LSM tree.
Default: false.
Format version, reserved for backward compatibility.
See full list of the supported versions.
Default: 2.
Number of keys between restart points for delta encoding of keys. This parameter can be changed dynamically. Most clients should leave this parameter alone. The minimum value allowed is 1. Any smaller value will be silently overwritten with 1.
Default: 16.
Same as block_restart_interval but used for the index block.
If you don't plan to run RocksDB before version 5.16 and you are
using index_block_restart_interval
> 1, you should
probably set the format_version
to >= 4 as it would reduce the index size.
Default: 1.
Set the data block index type for point lookups:
DataBlockIndexType::BinarySearch
to use binary search within the data block.DataBlockIndexType::BinaryAndHash
to use the data block hash index in combination with the normal binary search.
The hash table utilization ratio is adjustable using set_data_block_hash_ratio
, which is
valid only when using DataBlockIndexType::BinaryAndHash
.
Default: BinarySearch
Example:
::
from rocksdict import BlockBasedOptions, BlockBasedIndexType, Options opts = Options() block_opts = BlockBasedOptions() block_opts.set_data_block_index_type(DataBlockIndexType.binary_and_hash()) block_opts.set_data_block_hash_ratio(0.85) opts.set_block_based_table_factory(block_opts)
Set the data block hash index utilization ratio.
The smaller the utilization ratio, the less hash collisions happen, and so reduce the risk for a point lookup to fall back to binary search due to the collisions. A small ratio means faster lookup at the price of more space overhead.
Default: 0.75
Used with DBOptions::set_plain_table_factory. See official wiki for more information.
Defaults:
user_key_length: 0 (variable length) bloom_bits_per_key: 10 hash_table_ratio: 0.75 index_sparseness: 16
Configuration of cuckoo-based storage.
Determines the utilization of hash tables. Smaller values result in larger hash tables with fewer collisions. Default: 0.9
A property used by builder to determine the depth to go to to search for a path to displace elements in case of collision. See Builder.MakeSpaceForKey method. Higher values result in more efficient hash tables with fewer lookups but take more time to build. Default: 100
In case of collision while inserting, the builder attempts to insert in the next cuckoo_block_size locations before skipping over to the next Cuckoo hash function. This makes lookups more cache friendly in case of collisions. Default: 5
If this option is enabled, user key is treated as uint64_t and its value is used as hash value directly. This option changes builder's behavior. Reader ignore this option and behave according to what specified in table property. Default: false
If this option is set to true, module is used during hash calculation. This often yields better space efficiency at the cost of performance. If this option is set to false, # of entries in table is constrained to be power of two, and bit and is used to calculate hash, which is faster in general. Default: true
Sets the algorithm used to stop picking files into a single compaction run.
Default: ::Total
sets the size amplification.
It is defined as the amount (in percentage) of additional storage needed to store a single byte of data in the database. For example, a size amplification of 2% means that a database that contains 100 bytes of user-data may occupy upto 102 bytes of physical storage. By this definition, a fully compacted database has a size amplification of 0%. Rocksdb uses the following heuristic to calculate size amplification: it assumes that all files excluding the earliest file contribute to the size amplification.
Default: 200, which means that a 100 byte database could require upto 300 bytes of storage.
Sets the percentage flexibility while comparing file size. If the candidate file(s) size is 1% smaller than the next file's size, then include next file into this candidate set.
Default: 1
Sets the percentage of compression size.
If this option is set to be -1, all the output files will follow compression type specified.
If this option is not negative, we will try to make sure compressed size is just above this value. In normal cases, at least this percentage of data will be compressed. When we are compacting to a new file, here is the criteria whether it needs to be compressed: assuming here are the list of files sorted by generation time: A1...An B1...Bm C1...Ct where A1 is the newest and Ct is the oldest, and we are going to compact B1...Bm, we calculate the total size of all the files as total_size, as well as the total size of C1...Ct as total_C, the compaction output file will be compressed iff total_C / total_size < this percentage
Default: -1
Creates a HyperClockCache with capacity in bytes.
estimated_entry_charge
is an important tuning parameter. The optimal
choice at any given time is
(cache.get_usage() - 64 * cache.get_table_address_count()) /
cache.get_occupancy_count()
, or approximately cache.get_usage() /
cache.get_occupancy_count()
.
However, the value cannot be changed dynamically, so as the cache composition changes at runtime, the following tradeoffs apply:
- If the estimate is substantially too high (e.g., 25% higher), the cache may have to evict entries to prevent load factors that would dramatically affect lookup times.
- If the estimate is substantially too low (e.g., less than half), then meta data space overhead is substantially higher.
The latter is generally preferable, and picking the larger of block size and meta data block size is a reasonable choice that errs towards this side.
Used by BlockBasedOptions::set_checksum_type.
Call the corresponding functions of each to get one of the following.
- NoChecksum
- CRC32c
- XXHash
- XXHash64
- XXH3
This is to be treated as an enum.
Call the corresponding functions of each to get one of the following.
- Level
- Universal
- Fifo
Below is an example to set compaction style to Fifo.
Example:
::
opt = Options() opt.set_compaction_style(DBCompactionStyle.fifo())
This is to be treated as an enum.
Call the corresponding functions of each to get one of the following.
- None
- Snappy
- Zlib
- Bz2
- Lz4
- Lz4hc
- Zstd
Below is an example to set compression type to Snappy.
Example:
::
opt = Options() opt.set_compression_type(DBCompressionType.snappy())
This is to be treated as an enum.
Calling the corresponding functions of each to get one of the following.
- TolerateCorruptedTailRecords
- AbsoluteConsistency
- PointInTime
- SkipAnyCorruptedRecord
Below is an example to set recovery mode to PointInTime.
Example:
::
opt = Options() opt.set_wal_recovery_mode(DBRecoveryMode.point_in_time())
Returns a new environment that stores its data in memory and delegates all non-file-storage tasks to base_env.
Sets the number of background worker threads of a specific thread pool for this environment.
LOW
is the default pool.
Default: 1
Sets the size of the high priority thread pool that can be used to prevent compactions from stalling memtable flushes.
Sets the size of the low priority thread pool that can be used to prevent compactions from stalling memtable flushes.
Sets the size of the bottom priority thread pool that can be used to prevent compactions from stalling memtable flushes.
Lowering IO priority for threads from the specified pool.
Lowering IO priority for high priority thread pool.
If more than one thread calls manual compaction, only one will actually schedule it while the other threads will simply wait for the scheduled manual compaction to complete. If exclusive_manual_compaction is set to true, the call will disable scheduling of automatic compaction jobs and wait for existing automatic compaction jobs to finish.
Used in PlainTableFactoryOptions
.
Raised when accessing a closed database instance.
https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager Write buffer manager helps users control the total memory used by memtables across multiple column families and/or DB instances. Users can enable this control by 2 ways:
1- Limit the total memtable usage across multiple column families and DBs under a threshold. 2- Cost the memtable memory usage to block cache so that memory of RocksDB can be capped by the single limit. The usage of a write buffer manager is similar to rate_limiter and sst_file_manager. Users can create one write buffer manager object and pass it to all the options of column families or DBs whose memtable size they want to be controlled by this object.
A memory limit is given when creating the write buffer manager object. RocksDB will try to limit the total memory to under this limit.
a flush will be triggered on one column family of the DB you are inserting to,
If mutable memtable size exceeds about 90% of the limit, If the total memory is over the limit, more aggressive flush may also be triggered only if the mutable memtable size also exceeds 50% of the limit. Both checks are needed because if already more than half memory is being flushed, triggering more flush may not help.
The total memory is counted as total memory allocated in the arena, even if some of that may not yet be used by memtable.
buffer_size: the memory limit in bytes. allow_stall: If set true, it will enable stalling of all writers when memory usage exceeds buffer_size (soft limit). It will wait for flush to complete and memory usage to drop down
Users can set up RocksDB to cost memory used by memtables to block cache. This can happen no matter whether you enable memtable memory limit or not. This option is added to manage memory (memtables + block cache) under a single limit.
buffer_size: the memory limit in bytes. allow_stall: If set true, it will enable stalling of all writers when memory usage exceeds buffer_size (soft limit). It will wait for flush to complete and memory usage to drop down cache: the block cache instance
Database's checkpoint object. Used to create checkpoints of the specified DB from time to time.