WriteBufferManager Stalls Indefinitely

41 views

Skip to first unread message

Henning Lohse

unread,

May 24, 2024, 7:11:30 AMMay 24

to rocksdb

Hi everyone,

we are using multiple instances of RocksDB, capping their memory usage via the WriteBufferManager (WBM) with a cost-to-cache configuration for memtables.

One test scenario led to the WBM stalling indefinitely because the total active memtables' memory usage exceeds the WBM limit, but no more flushes get performed, meaning this situation never resolves itself as the memory usage doesn't drop.

Am I correctly understanding that there is no mechanism in RocksDB and/or the WBM to actively look for memtables to flush while the memory usage is exceeded? And as long as no other flushes are pending etc.
And that we would need to trigger flushes manually on the application-side then?
There seems to be no leak like e.g. iterators that remain open, verified in long-running tests and using jemalloc's heap profiling capabilities.

Thanks for any help!

More data points regarding the test scenario:

- rocksdbjni 9.1.1
- WBM: 1.1 GB LRU block cache, with 748 MB bufferSizeBytes (for memtables)
- Max. 4 MB memtables (Options#setWriteBufferSize)
- Default 2 max. write buffers (Options#setMaxWriteBufferNumber)
- Tried Options#setLevel0FileNumCompactionTrigger = 1 to compact immediately after flush
- Max. 4 background threads (vCPU count, Options#setMaxBackgroundJobs)

The test scenario creates a sudden influx of 32 MB values, which obviously exceed the 4 MB memtables limit, and trigger some flushes and compactions. But in the end, total memtables' memory usage exceeds the WBM's bufferSizeBytes, and remains like that.
I can see in the metrics that there are no immutable memtables, neither flushed nor unflushed, and no pending flushes or compactions.
So according to the metrics and the WBM, almost all "pinned" memory (834 MB) stem from the active memtables (768 and 800 MB), exceeding the limit of 748 MB.
Still, no more flushes get triggered.

I can see via gdb that our applicaton's JNI calls to e.g. RocksDB#put get stalled:
(gdb) bt
#0 0x00007ffa81aef377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ffa0c4e232d in rocksdb::port::CondVar::Wait() () from /tmp/librocksdbjni13129105919697093945.so
#2 0x00007ffa0c2d0278 in rocksdb::DBImpl::WriteBufferManagerStallWrites() () from /tmp/librocksdbjni13129105919697093945.so
#3 0x00007ffa0c2d5792 in rocksdb::DBImpl::PreprocessWrite(rocksdb::WriteOptions const&, rocksdb::DBImpl::LogContext*, rocksdb::DBImpl::WriteContext*) ()
from /tmp/librocksdbjni13129105919697093945.so
#4 0x00007ffa0c2d931c in rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*, rocksdb::PostMemTableCallback*) () from /tmp/librocksdbjni13129105919697093945.so
#5 0x00007ffa0c2da6dc in rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*) () from /tmp/librocksdbjni13129105919697093945.so
#6 0x00007ffa0c2da8ff in rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&) ()
from /tmp/librocksdbjni13129105919697093945.so
#7 0x00007ffa0c2daa15 in rocksdb::DBImpl::Put(rocksdb::WriteOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::Slice const&) ()
from /tmp/librocksdbjni13129105919697093945.so
#8 0x00007ffa0c1199b5 in rocksdb::DB::Put(rocksdb::WriteOptions const&, rocksdb::Slice const&, rocksdb::Slice const&) () from /tmp/librocksdbjni13129105919697093945.so
#9 0x00007ffa0c108e55 in Java_org_rocksdb_RocksDB_put__J_3BII_3BII () from /tmp/librocksdbjni13129105919697093945.so
[...]

But the background threads who I assume should deal with flushes and compactions are waiting to be signalled:
(gdb) bt
#0 0x00007ffa81aef377 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ffa0bb27b0c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib64/libstdc++.so.6
#2 0x00007ffa0c613917 in rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long) () from /tmp/librocksdbjni13129105919697093945.so
#3 0x00007ffa0c613c62 in rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*) () from /tmp/librocksdbjni13129105919697093945.so
#4 0x00007ffa0c83cb1f in ?? () from /tmp/librocksdbjni13129105919697093945.so
#5 0x00007ffa81ae944b in start_thread () from /lib64/libpthread.so.0
#6 0x00007ffa8162052f in clone () from /lib64/libc.so.6

It is possible that the RocksDB#put operation in this case was on an instance with an (almost) empty memtable.
From my understanding, this then does not necessarily lead to a flush of this memtable, and also to on flush of another RocksDB instance's larger memtable.

There is another thread, quoting:
https://groups.google.com/g/rocksdb/c/5A78MPk9xWM/m/9Gu-VgNRAAAJ
"With rocksdb , you better of if you make sure that the write buffer manager will never get to a flush/stall conditions. I suggest strongly to update to Speedb 2.5.1"
Briefly tried speedb 2.8.0 with ran into the same situation.

But I was wondering if such a WBM stall behavior requires "external"/application-side resolution by triggering flushes manually?

Thanks and regards,
Henning

Henning Lohse

unread,

May 24, 2024, 7:21:50 AMMay 24

to rocksdb

Dumping some screenshots from the Grafana, highlighting that the active memtables' memory is exceeding the WBM limit while there are no more immutable memtables to flush, and no pending flushes or compactions.

Screenshot 2024-05-24 at 13.18.01.png

Reply all

Reply to author

Forward

0 new messages