Compaction gets stuck in DBImpl::WaitForFlushMemTables

Jan Steemann

unread,

May 18, 2022, 8:18:58 AM5/18/22

to rocksdb

Hi everyone,

we sometimes see some of our threads getting stuck in db->CompactRange() calls for a very long time.

The RocksDB version is 7.2, and the exact commit in use is 2b5df21e95096fbfc25e8aac33b2153302e710e9.

Here is an example backtrace of such thread:

Thread 10 (LWP 143252):

#0 __syscall_cp_asm () at src/thread/aarch64/syscall_cp.s:28

#1 0x0000000003ceeedc in __syscall_cp_c (nr=98, u=<optimized out>, v=<optimized out>, w=<optimized out>, x=<optimized out>, y=<optimized out>, z=<optimized out>) at src/thread/pthread_cancel.c:33

#2 0x0000000003cf908c in __futex4_cp (to=0x0, val=2, op=128, addr=0xffff783f9154) at src/thread/__timedwait.c:52

#3 __timedwait_cp (addr=addr@entry=0xffff783f9154, val=val@entry=2, clk=clk@entry=0, at=at@entry=0x0, priv=128, priv@entry=1) at src/thread/__timedwait.c:52

#4 0x0000000003cef3b0 in __pthread_cond_timedwait (c=0xffff81639250, m=0xffff81638f00, ts=0x0) at src/thread/pthread_cond_timedwait.c:100

#5 0x0000000002229350 in rocksdb::port::CondVar::Wait () at /work/ArangoDB/3rdParty/rocksdb/port/port_posix.cc:122

#6 0x0000000002100ee8 in rocksdb::InstrumentedCondVar::WaitInternal () at /work/ArangoDB/3rdParty/rocksdb/monitoring/instrumented_mutex.cc:52

#7 rocksdb::InstrumentedCondVar::Wait () at /work/ArangoDB/3rdParty/rocksdb/monitoring/instrumented_mutex.cc:45

#8 0x0000000001f9b1c0 in rocksdb::DBImpl::WaitForFlushMemTables () at /work/ArangoDB/3rdParty/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2352

#9 0x0000000001f9fd68 in rocksdb::DBImpl::FlushMemTable () at /work/ArangoDB/3rdParty/rocksdb/db/db_impl/db_impl_compaction_flush.cc:2101

#10 0x0000000001fa976c in rocksdb::DBImpl::CompactRangeInternal () at /work/ArangoDB/3rdParty/rocksdb/db/db_impl/db_impl_compaction_flush.cc:1023

#11 0x0000000001fa9c2c in rocksdb::DBImpl::CompactRange () at /work/ArangoDB/3rdParty/rocksdb/db/db_impl/db_impl_compaction_flush.cc:904

#12 0x0000000001913f7c in rocksdb::StackableDB::CompactRange () at /work/ArangoDB/3rdParty/rocksdb/include/rocksdb/utilities/stackable_db.h:271

No other threads are doing relevant work when it gets stuck here.

The thread is waiting in DBImpl::WaitForFlushMemTables, and doesn't make any progress. There is no background error, and no shutdown happening (i.e. db->Close() wasn't called yet).

The compaction options are:

rocksdb::CompactRangeOptions opts;
opts.exclusive_manual_compaction = false;
opts.allow_write_stall = true;
opts.canceled = &::cancelCompactions;

We use cancelable compactions, and the compaction in question should have been canceled already. The compaction cancelation check however happens only at the beginning of a compaction run, and not after it has been started.

Would it be an option to pass an optional pointer to the cancelation variable into WaitForFlushMemTables, and check from its while loop if the waiting should be canceled? That cancelation variable could be fed in from compactions that trigger flushes, and could be omitted from other callers.

If you think this is a good way forward, I am happy to work on a PR with the change.

Thanks

J

Yanqin Jin

unread,

May 18, 2022, 11:20:03 AM5/18/22

to Jan Steemann, rocksdb

The code you are using includes a prior commit 6d2577e56 which, when writes are stopped by the db (full write stalls), cause unprotected concurrent accesses to some of ColumnFamilyData's member and/or other shared data structures. When this happens, I think the state of these data structures can be in an undefined state.

It has been fixed by b58a1a035 which reverts the former. The issue reported by the former still exists and we are fixing it in https://github.com/facebook/rocksdb/pull/10001.

Manual flush with `wait=false` should not stall when writes stopped by riversand963 · Pull Request #10001 · facebook/rocksdb

When FlushOptions::wait is set to false, manual flush should not stall forever. If the database has already stopped writes, then the thread calling DB::Flush() with FlushOptions::wait=false should ...

github.com

From: roc...@googlegroups.com <roc...@googlegroups.com> on behalf of Jan Steemann <jan.st...@gmail.com>
Sent: Wednesday, May 18, 2022 5:18 AM
To: rocksdb <roc...@googlegroups.com>
Subject: Compaction gets stuck in DBImpl::WaitForFlushMemTables

--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/089d84ff-5672-42e6-ba4f-85c6629f55c3n%40googlegroups.com.

Jan Steemann

unread,

May 18, 2022, 11:58:26 AM5/18/22

to Yanqin Jin, rocksdb

Hi Yanqin,

thanks a lot for the very informative reply.

Indeed, 7.2 includes that commit you mentioned, which by now has been reverted in upstream RocksDB.

I will give this a try.

Although I have to say that I have seen CompactRange() calls being stuck before that, in older versions.

We have one branch of our application which is using RocksDB 6.29, and that also had sporadic problems with CompactRange calls hanging at the same code location.

RocksDB 6.29 predates the commit 6d2577e5672a7abe7b41a67f1cccce3a6601b30e, so I am wondering if there are also some other issues contributing to the problem.

But I will try reverting that one commit as a test.

Thanks

J

Yanqin Jin

unread,

May 18, 2022, 12:17:40 PM5/18/22

to Jan Steemann, rocksdb

I think that's possible, because the commit mentioned above is trying to fix an issue of flush being blocked forever. 🙁

From: Jan Steemann <jan.st...@gmail.com>
Sent: Wednesday, May 18, 2022 8:58 AM
To: Yanqin Jin <yan...@fb.com>
Cc: rocksdb <roc...@googlegroups.com>
Subject: Re: Compaction gets stuck in DBImpl::WaitForFlushMemTables

Jan Steemann

unread,

May 19, 2022, 10:40:52 AM5/19/22

to rocksdb

Hi,

I don't think that the open PR https://github.com/facebook/rocksdb/pull/10001 that you mentioned will fix the issue of stuck compaction threads.

The PR will change DBImpl::FlushMemTable() to return status TryAgain early if writes have stopped, but _only_ if FlushOptions.wait is not set.

The problem is that DBimpl::CompactRangeInternal() will still call it with FlushOptions that have their "wait" member set to true (as true is the default value for that member in FlushOptions).

Here is the calling code:

if (s.ok() && flush_needed) {
FlushOptions fo;
fo.allow_write_stall = options.allow_write_stall;
if (immutable_db_options_.atomic_flush) {
autovector<ColumnFamilyData*> cfds;
mutex_.Lock();
SelectColumnFamiliesForAtomicFlush(&cfds);
mutex_.Unlock();
s = AtomicFlushMemTables(cfds, fo, FlushReason::kManualCompaction,
false /* writes_stopped */);
} else {
s = FlushMemTable(cfd, fo, FlushReason::kManualCompaction,
false /* writes_stopped*/);
}

So the code that would exit early in FlushMemTable() would not be executed, due to the mismatching value in FlushOptions.wait.