High read latencies

167 views
Skip to first unread message

Rahul Arora

unread,
Oct 14, 2023, 6:39:23 AM10/14/23
to rocksdb
Context:

Hi, my RocksDB workload is entirely read based.

The data in it is generated in a single server and the the db is then replicated to all the replicas to bootstrap data from it.

There are roughly 20M keys and the data size is around 15G.

The data has a serialised POJO as a value that’s associated with its ID.

Problem:

My read latencies are way too high. So, I tried doing a lot of things to optimise that.

Changed the filter block bits, pinned L0 blocks in cache, increased the block cache size to more than the DB size

Even after all this, my data block cache miss is almost equal to the block cache hit. Weirdly, my LRU cache usage is still 16G out of 20G, what should I do?

One more observation. I have noticed that even increasing the block size doesn’t have any impact on the read latencies even though the cache is not even fully utilised.

The reads, all random in nature somehow always increases the disk util. However, if the DB is compacted before reading, no disk util is observed and the read latencies are within limits.

Also, my cache miss metric is almost equal to the cache hit for the data blocks. I’m unable to understand why could this happen.

Mark Callaghan

unread,
Oct 14, 2023, 7:06:29 PM10/14/23
to rocksdb
If you don't do full compaction is it possible that compaction is in progress during your tests?

Rahul Arora

unread,
Oct 14, 2023, 7:14:35 PM10/14/23
to rocksdb
Hey, valid point, but I checked the logs and found no ongoing compaction at that time.

Attaching the LOG file as well, just in case you wanna have a look. The dump was taken every 30 sec. Towards the end of the LOG file, you would notice that the block cache contains all the DB data but cache miss is still happening
LOG

Rahul Arora

unread,
Oct 15, 2023, 4:52:04 AM10/15/23
to rocksdb

I was looking at the LRUCache docs here https://github.com/facebook/rocksdb/wiki/Block-Cache#lru-cache

This says that a lock is acquired even during a read from a shard. And, I remember reading somewhere that if the time taken to fetch the data from the cache exceeds a certain threshold (because of lock etc) then that constitutes a cache miss, and the data is fetched from the disk.

Is this observation correct?

If yes, how do I fix it? Should I increase the num_shard_bits?

------------

I'm quite convinced now that this problem is happening because of the inappropriate block cache config because the results are very random. Sometimes I'm able to fetch 5M keys/min and sometimes it's just 500K/min. All the configurations are the same in each run.

But, how do I find the right config for my cache? What factors are there to be considered?

@jowlyzhang can you help, please?

MARK CALLAGHAN

unread,
Oct 15, 2023, 4:40:01 PM10/15/23
to Rahul Arora, rocksdb
On Sun, Oct 15, 2023 at 1:52 AM Rahul Arora <coder...@gmail.com> wrote:

I was looking at the LRUCache docs here https://github.com/facebook/rocksdb/wiki/Block-Cache#lru-cache

This says that a lock is acquired even during a read from a shard. And, I remember reading somewhere that if the time taken to fetch the data from the cache exceeds a certain threshold (because of lock etc) then that constitutes a cache miss, and the data is fetched from the disk.

Is this observation correct?

I am not aware of that behavior.
 

------------

I'm quite convinced now that this problem is happening because of the inappropriate block cache config because the results are very random. Sometimes I'm able to fetch 5M keys/min and sometimes it's just 500K/min. All the configurations are the same in each run.

But, how do I find the right config for my cache? What factors are there to be considered?

From the LOG you provided earlier in the thread I see:
    capacity : 8388608
    num_shard_bits : 4

So that is an 8MB block cache with 16 (2^4) shards. While 16 shards is probably OK, an 8MB block cache is small and you will probably be better with a larger block cache. Also, this setup means there are 2^19 bytes/shard given 8MB / 16 and there are problems that can happen when shards are small -- http://smalldatum.blogspot.com/2023/07/one-source-write-stalls-with-rocksdb.html

But elsewhere in LOG I see a ~14G block cache with 2^6 shards which is better
    capacity : 15032385536
    num_shard_bits : 6

Elsewhere in LOG I see
2023/10/15-00:03:32.729977 140258010806016          Options.compression: Snappy
2023/10/15-00:03:32.729979 140258010806016                  Options.bottommost_compression: Disabled

And I also see that the database is ~15G on disk. If that is 15G with Snappy compression then it will probably be larger than 15G when uncompressed in the block cache.
Is the database compressed?
 
All of the block cache eviction counters have a 0 count, grep for "evict COUNT : 0"

It is hard for me to guess something about data.hit. It could be that these counters just reflect the population of an empty cache. The value of block_size was 4096 so the size of data blocks should be <= 4096 and data.miss * 4096 ~= 20GB so it might be that the size of data is ~20G uncompressed (in the block cache) and ~15G on disk.
rocksdb.block.cache.data.miss COUNT : 5250132
rocksdb.block.cache.data.hit COUNT : 5229701
rocksdb.block.cache.data.add COUNT : 5250132

From this, the storage device(s) might be overloaded. The line for (380000, 570000] means that ~1.5% of the reads take 380ms or longer, and there are many outliers exceeding 1 second.

 ** File Read Latency Histogram By Level [f::s::n::l] **
** Level 3 read latency histogram (micros):

(   22000,   33000 ]    55733   1.092%  71.443%
(   33000,   50000 ]   233936   4.583%  76.026% #
(   50000,   75000 ]   449437   8.804%  84.830% ##
(   75000,  110000 ]   398808   7.812%  92.643% ##
(  110000,  170000 ]   198621   3.891%  96.534% #
(  170000,  250000 ]    54475   1.067%  97.601%
(  250000,  380000 ]    27719   0.543%  98.144%
(  380000,  570000 ]    18829   0.369%  98.513%
(  570000,  860000 ]    17670   0.346%  98.859%
(  860000, 1200000 ]    46012   0.901%  99.760%
( 1200000, 1900000 ]    50130   0.982% 100.742%
( 1900000, 2900000 ]      133   0.003% 100.745%
( 2900000, 4300000 ]       25   0.000% 100.745%
--
Mark Callaghan
mdca...@gmail.com
Reply all
Reply to author
Forward
0 new messages