How can I improve the readrandom performance?

韩光阳

unread,

Oct 12, 2022, 5:20:16 AM10/12/22

to rocksdb

Hi! I am now aiming to use RocksDB to manage SSD as a read cache for backend HDDs. The most improtant order is GET(), so I am curious about how to modify the options to improve readrandom performance.

Now on a 190K iops 4k random read SSD, the highest speed i got is 12w ops for 16B key + 32B value, which is too slow. If there is anyone who can give me some advise to improve this? thanks a lot!

BTW, another setting we are considering is 16B key + 4KiB value with blobdb on, I would appreciate it if anyone could share the configuration to improve read.

thanks again!

Mark Callaghan

unread,

Oct 13, 2022, 1:27:51 PM10/13/22

to rocksdb

What is "12w ops"?
How many client threads were used in your benchmark?
What was your RocksDB configuration?

韩光阳

unread,

Oct 20, 2022, 10:30:29 PM10/20/22

to rocksdb

Sorry, 12w is a Chinese way of saying 120k, the run cmds are list below:

----------------------------------------------------------------------------------------

--for seq write:

./db_bench \

-benchmarks=fillrandom,stats \
-num=10000000 \
-threads=1 \
-db=/root/db/data \
-duration=600 \
-key_size=16 \
-value_size=4096 \
-compression_type=none \
-enable_pipelined_write=true \
-enable_blob_files=true \
-min_blob_size=1024 \
-enable_blob_garbage_collection=true \
-blob_garbage_collection_age_cutoff=0.300000 \
-blob_garbage_collection_force_threshold=0.100000 \
-pin_l0_filter_and_index_blocks_in_cache=false \
-disable_wal=true \
-write_buffer_size=536870912 \
-blob_file_size=536870912 \
-target_file_size_base=67108864 \
-max_bytes_for_level_base=671088640 \
-cache_size=0 \
-use_direct_io_for_flush_and_compaction \
-max_background_flushes=8 \
-max_background_compactions=32 \
-subcompactions=16 \
-max_write_buffer_number=3 \
-level0_file_num_compaction_trigger=4 \
-level0_slowdown_writes_trigger=16 \
-level0_stop_writes_trigger=24 \
-num_levels=4 \
-max_bytes_for_level_multiplier=10 \
-seed=1665573037454110

-------------------------------------------------------------------------------------

--for random read:

./db_bench \
-benchmarks=compact,stats \
-db=/root/db/data \
-threads=64 \
-duration=300 \
-enable_blob_files=true \
-min_blob_size=1024 \
-enable_blob_garbage_collection=true \
-blob_garbage_collection_age_cutoff=0.30000 \
-blob_garbage_collection_force_threshold=0.100000 \
-pin_l0_filter_and_index_blocks_in_cache=false \
-write_buffer_size=536870912 \
-blob_file_size=536870912 \
-target_file_size_base=67108864 \
-max_bytes_for_level_base=671088640 \
-num_levels=4 \
-max_bytes_for_level_multiplier=10 \
-disable_wal=true \
-use_existing_db=1 \
-use_existing_keys=1 \
-cache_size=0 \
-use_direct_io_for_flush_and_compaction \
-use_direct_reads \
-seed=1056573037454101

--------------------------------------------------------------------------------------------

I first run the fillseq to get a db, then readrandom from it.

The write performance is pretty well, we can get a 800MB/s+ result, but the read-random is not so good, only (6w ops / 250MB/s).

The fio direct 4k random read shows a 190k result on the SSDs, which is composed of two Intel 960G SATA SSD under riad0.

In comparison, rocksdb's read performance utilization seems to be too low.

韩光阳

unread,

Oct 20, 2022, 10:39:00 PM10/20/22

to rocksdb

Thanks for your responds,

As you can see below, I used 1 thread for write, then 64 thread for read, which is the best config I have test.

I also used the blobdb for my large value KV entry, however, the read performance is similar with or without blobdb,

250MB/s of random read (16B key + 4KiB value) seems to be the best performance I have get.

在2022年10月14日星期五 UTC+8 01:27:51<mdca...@gmail.com> 写道：

Siying Dong

unread,

Oct 21, 2022, 1:06:07 PM10/21/22

to 韩光阳, rocksdb

Mark, I believe by “12w” he means Chinese “12万”, which is 120K. However, it doesn’t feel like “too slow” to me, given that the drive can do 190K IOPs. So that’s a typo for 12K I guess?

From: '韩光阳' via rocksdb <roc...@googlegroups.com>
Sent: Thursday, October 20, 2022 7:39 PM
To: rocksdb <roc...@googlegroups.com>
Subject: Re: How can I improve the readrandom performance?

Thanks for your responds, As you can see below, I used 1 thread for write, then 64 thread for read, which is the best config I have test. I also used the blobdb for my large value KV entry, however, the read performance is similar with or without

ZjQcmQRYFpfptBannerStart

This Message Is From an Untrusted Sender

You have not previously corresponded with this sender.

ZjQcmQRYFpfptBannerEnd

--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/2cc6e5fc-e41a-4079-a7ae-f460737e7b75n%40googlegroups.com.

韩光阳

unread,

Oct 21, 2022, 3:00:02 PM10/21/22

to rocksdb

Yes, “12w” means Chinese “12万”, i.e. 120K ops of rocksdb readrandom 16B+32B bench, for 16B+4KiB bench, the best result I have ever got is 60K ops.

You just said it is not slow for a 190K iops SSD, I believe that's ture, but I need more evidence to convince my leader T_T..., so

What kind of relationship should there be between SSD fio 4K readrandom performance and rocksdb readrandom performance?
Could you share some results you got on your platform? Like your SSD IOPs and the best RocksDB readramdom results?

For 16B+4KiB performance, a way to explain is that to get a kv entry, RocksDB need one 4k page read to get the (Key + pointer) in LSM tree,

and two 4k page read to get the (key_len + key + value_len + value) in blobdb, as the (key_len + key + value_len + value) exceeds one 4k page.

So, 60K RocksDB ops means nearly 60K * 3 = 180K IOPs for SSD, which is close to the 190K limit performance of SSD, am I some way right?

Thanks a lot! Looking forward to your reply.

Siying Dong

unread,

Oct 21, 2022, 4:12:19 PM10/21/22

to 韩光阳, rocksdb

It depends on configuration. That’s why Mark asked you for RocksDB configuration.

If you use default configuration with 4KB blocks, your analysis is correct. Some block reads will span to two blocks and do two 4KB I/Os. There is a configuration you could do aligned I/O (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/table.h#L528-L529) and in this way you can avoid two 4KB blocks per request.

It’s hard to speculate what happens in the BlobDB case. I think providing full configuration will be helpful. Since there is no way to align a 4KB blob to a 4KB block (if you don’t compress it or it is not compressible), so you would almost always need to read 8KB for a blob.

Thanks,

Siying

From: '韩光阳' via rocksdb <roc...@googlegroups.com>
Sent: Friday, October 21, 2022 12:00 PM
To: rocksdb <roc...@googlegroups.com>
Subject: Re: How can I improve the readrandom performance?

Yes, “12w” means Chinese “12万”, i. e. 120K ops of rocksdb readrandom 16B+32B bench, for 16B+4KiB bench, the best result I have ever got is 60K ops. You just said it is not slow for a 190K iops SSD, I believe that's ture, but I need more evidence

To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/975ef7ed-ab1c-4a3b-88d7-16a2ee7e206cn%40googlegroups.com.

MARK CALLAGHAN

unread,

Oct 22, 2022, 9:47:09 PM10/22/22

to Siying Dong, 韩光阳, rocksdb

I am running tests and learning new things while trying to get an answer. Will take a few days to have a real answer.

For now:
1) From the command line above that you shared RocksDB was configured to not cache the leveled LSM tree. So each query will do 2 storage reads per query -- one from the LSM tree, one from a blob log file. If you change RocksDB to cache the LSM tree then you reduce it to one storage read per query and probably double query throughput.

To do that:
--cache_index_and_filter_blocks=true \
--cache_high_pri_pool_ratio=0.5 \
--cache_low_pri_pool_ratio=0 \
--pin_l0_filter_and_index_blocks_in_cache=true \

--cache_size=$something-larger-than-zero

With 10M key-value pairs, a cache_size of 300M might be sufficient to cache the tree.

2) How many CPU cores per server do expect to use for this? Is that with or without hyperthread enabled?

To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/BYAPR15MB286925FFF675286EEE3BB80CE12D9%40BYAPR15MB2869.namprd15.prod.outlook.com.

--

Mark Callaghan
mdca...@gmail.com

韩光阳

unread,

Oct 23, 2022, 3:25:38 AM10/23/22

to rocksdb

Mark, siying, thank you very much for your patience and joint efforts.

There are two reasons why I set the block cache as zero:

1) I wonder to what extent rocksdb can take advantage of SSD read and write ability. FIO tests show that the SSD can achieve sequential write of 900+MB/s and 4k random read of 190K IOPs.

I was also monitoring the disk read/write speed shown by the "iostat" command while running Rocksdb bench.

For the fillseq bench, iostat shows a 850+MB/s, which is the same as RocksDB's write result, and closing to the SSD's extreming write ability.

But for readrandom bench, iostat shows that the read speed of SSD remains 400~500MB/s, indicating that the read performance of SSD is not fully developed.

Both benchs bypassed the system OS page cache , with "-use_direct_io_for_flush_and_compaction" in fillseq and "-use_direct_reads" for readrandom.

2) We want to implement a caching system based on Rocksdb, there are three alternatives: RocksDB manages only the cache metadata (key+mata data, 16B+32B), and Rocksdb directly manages the entire cache system (key+page data, 16B+4KiB page) with or whthout blobdb. Under different situation, the size of LSM tree differents a lot, so the same size of cache means different. To be fair, we disabled the RocksDB cache to test the lower limit performance.

The cache should always be transparent, RocksDB now can not fully exploiting the direct read ability of SSD, so I was curious to see how the settings could be adjusted to improve read performance.

Looking forward to your good news!

MARK CALLAGHAN

unread,

Oct 23, 2022, 10:32:14 AM10/23/22

to 韩光阳, rocksdb

I asked about the number of CPU cores on this server because you need CPU to drive IO. Perhaps RocksDB uses too much CPU per request or perhaps it just uses more CPU than we want it too, and it certainly will use more CPU per operation than fio does -- but that can be a debate and experiment for another day. Today I can tell you that RocksDB can certainly saturate storage devices given enough CPU. And all of this can be true even with compression disabled, because with compression enabled the CPU per IO overhead is increases by 1 to 5 microseconds.

Browsing the fio manpage quickly, it isn't clear that fio has anything that would consume CPU per read -- checksum is optional with the verify option, but not for random read tests. So fio uses as little CPU as possible per read.

You received this message because you are subscribed to a topic in the Google Groups "rocksdb" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rocksdb/ahYhGiFxc3I/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/14fc45fa-ddd4-4984-bc77-363e7efa08e2n%40googlegroups.com.

--

Mark Callaghan
mdca...@gmail.com

韩光阳

unread,

Oct 23, 2022, 11:05:13 PM10/23/22

to rocksdb

Sorry, the CPU in the server is 32 * Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, there is one service occupying three cores, leaving 29 cores available, so processor resources are unlikely to be the bottleneck.

when running readrandom with 64 thread, the total CPUs used by db_bench process is around 200% (two cores). There are no restrictions on CPU usage.

MARK CALLAGHAN

unread,

Oct 24, 2022, 1:01:05 AM10/24/22

to 韩光阳, rocksdb

Thanks, one more question, what version of RocksDB have you been using? I am using 7.7.3 in my attempts to understand the problem.

To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/40538bb9-b3f7-47bb-aa97-0cda9221c64cn%40googlegroups.com.

--

Mark Callaghan
mdca...@gmail.com

韩光阳

unread,

Oct 24, 2022, 2:08:04 AM10/24/22

to rocksdb

I used the latest version in the beginning, tested 16+32, 16+4096, 16+4096 with blobdb, and got the above results.

Now I am using the 6.15.fb branch with the latest spdk, but the improvement is tiny.

The SSD is a raid0 array composed of 2 * INTEL SSDSC2KB960G8 (each one has 95k+ IOPs random read individually),

BTW, it must be late at night on your end, Please take care of yourself, thanks.

MARK CALLAGHAN

unread,

Oct 24, 2022, 12:01:21 PM10/24/22

to 韩光阳, rocksdb

For your CPU, is that 32 cores with hyperthread enabled or disabled?

I have begun doing performance work for the new, integrated BlobDB. For leveled and universal I include 4.1, 5.1, all of the 6.x releases and obviously all of the 7.x releases. For integrated BlobDB I limit myself to the 7.x releases because the feature isn't as old and there has been a lot of improvement in 7.x.

To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/f72a1117-1f4c-471a-88d7-65d82f103e06n%40googlegroups.com.

--

Mark Callaghan
mdca...@gmail.com

韩光阳

unread,

Oct 25, 2022, 1:40:59 AM10/25/22

to rocksdb

62226R is 16c32t CPU.

I checked the official performance page, for the LSM tree part, 5.x, 6.x, and 7.x perform similarly, but 7.x does a lot of optimizations for blobdb, am I right?

Waiting for your performance work result.

Mark Callaghan

unread,

Oct 27, 2022, 1:36:02 PM10/27/22

to rocksdb

Not suggesting you change for perf reasons. But I pay attention more to the recent releases.
I also need to start updating the wiki page with per-release benchmark results, but I have been busy on other tasks.

Changes in 7.x:
* BlobDB can share RocksDB block cache, and memory accounting for BlobDB is more accurate. But I think you don't want to cache blobs.
* Worst-case write stalls with leveled compaction are greatly reduced starting in 7.5.3
* Write-heavy throughput can be 1.5X better starting 7.6.0 because of a change to reduce mutex contention (between concurrent writes and between writes vs compaction).

Mark Callaghan

unread,

Feb 22, 2023, 10:42:22 AM2/22/23

to rocksdb

My answer took longer than expected but I learned a bit. Like many storage engines, the answer to "can RocksDB saturate my IO" is it depends.

http://smalldatum.blogspot.com/2023/02/can-rocksdb-use-all-of-iops-from-fast.html

Reply all

Reply to author

Forward