Block Cache Freeing Taking too long

83 views
Skip to first unread message

Will Morgan

unread,
Nov 19, 2024, 11:34:04 PM11/19/24
to rocksdb
Hi,

We're running our RocksDB instance on fairly large instances (128 Core machines with 1TB of RAM) and we're running with a block cache anywhere from 256GB to 512 GB. 

What we've noticed is that at this size, the block cache seems to take on the order of minutes (3-5) to free itself when we restart the process. Is this normal? Is there anything I can do to optimize it?

Thanks

Ted Mostly

unread,
Nov 20, 2024, 3:04:09 AM11/20/24
to rocksdb
restart process?including stop and start?

if stop slow, maybe flush WAL slow, OR compaction slow, you could manual flush wal and disable compaction, then stop

if start slow, enable SYNC_POINT and check logs

Will Morgan

unread,
Nov 20, 2024, 10:26:31 AM11/20/24
to rocksdb
So we have a process that utilizes multiple RocksDB dbs. Part of that is a shared block cache amongst all the DBs. When we upgrade the service that owns the DBs, we need to restart it. Part of that restart is essentially calling: 
block_cache_.reset();

But it hangs when this happens and I see via htop that the memory is reclaimed by the OS at about 2 GB/s (which seems incredibly slow)

We initialize the cache by doing the following:

rocksdb::LRUCacheOptions lru_block_cache_opts;
lru_block_cache_opts.capacity = 
1024*1024*1024*256; // 256 GiB
lru_block_cache_opts.strict_capacity_limit = false;
lru_block_cache_opts.high_pri_pool_ratio = 0.5;
block_cache_ = rocksdb::NewLRUCache(lru_block_cache_opts)

I'm curious if anyone else has had a similar issue and how they were able to fix it.

Thanks!

Will

malik hou

unread,
Nov 21, 2024, 5:51:47 AM11/21/24
to rocksdb
there is no good way, this is purely a limitation of the operating system to recycle memory

Mark Callaghan

unread,
Nov 21, 2024, 12:11:15 PM11/21/24
to rocksdb
Which malloc implementation do you use -- glibc, jemalloc or tcmalloc?

I have not noticed this, but I also don't look closely for the problem and my servers don't have more than 256G of RAM.

It would help to see a few stack traces or a CPU profile (perf record) from the slow shutdown. But I am not asking you to risk problems in production to collect them.

We don't talk much about the speed at which memory can be allocated from or returned to the OS. But with big memory servers it is something that needs more attention.

Long ago we might have made MySQL shutdown faster by skipping some of the code that free's memory. But in this case, the process was ending and it isn't clear whether you are restarting a process, or just restarting RocksDB within a process. While free'ing all memory is good for memory leak tests, it isn't that useful beyond that. 
This blog post was inspired by our MySQL issue -- https://dom.as/2009/12/10/best-free-is-exit/

Will Morgan

unread,
Nov 21, 2024, 12:27:06 PM11/21/24
to rocksdb
We utilize the latest release of jemalloc with default settings. I'll see if I can get a profile without leaking anything otherwise I'll summarize it here.

Will Morgan

unread,
Nov 24, 2024, 3:51:34 PM11/24/24
to rocksdb
So I found a block_cache_->DisownData() that seems to do exactly what I want:

// Call this on shutdown if you want to speed it up. Cache will disown
// any underlying data and will not free it on delete. This call will leak
// memory - call this only if you're shutting down the process.
// Any attempts of using cache after this call will fail terribly.
// Always delete the DB object before calling this method!
virtual void DisownData()

I will be going down the track of leaking the block cache on process shutdown as that seems to solve my issue. Thanks for the help!
Reply all
Reply to author
Forward
0 new messages