Intra0 compaction cause big write stall

你爸爸

unread,

Oct 26, 2022, 11:20:40 AM10/26/22

to rocksdb

Hi, recently I was testing some insert and read loads with RocksDB. I found that LeveledCompaction in RocksDB uses the mechanism of Intra0compaction, but this usually causes a big write stall.

There is a SetupInitialFiles() function in LeveledCompactionBuilder with the following logic:
if (PickFileToCompact()) {
......
}else{
......
if(PickIntraL0Compaction())
......
}

The meaning of this logic is that if a level 0->level other compaction is not selected this time, then try to perform an Intral0compaction to merge multiple l0 sst once.

However, I noticed that one Intral0compaction will set being_compacted to true for multiple ssts. When new sst is flushed to l0, the compaction from level 0 to level other will be selected through PickFileToCompact(). Because the last selected Intral0compaction has not been completed, the being_compacted of the sst involved is still true, which makes the compaction of level 0->level other unable to be carried out this time... This level 0->level other compaction will be blocked until level 0 ssts' number almost up to level0_stop_writes_trigger.At this time, the number of ssts in level0 has far exceeded level0_file_num_compaction_trigger. This will cause the compaction of level 0-> level other(such as level 1) to involve too many files, resulting in a very long write stall (even 50s in my test).

Is there any mechanism that can be used here to avoid this from happening? Or is it possible to configure an option here to allow users to choose whether Intrl0Compaction should be performed without restrictions?

Looking forward to your reply!

Mark Callaghan

unread,

Oct 27, 2022, 8:21:07 PM10/27/22

to rocksdb

Thank you for the detailed explanation of the stall. AFAIK there is no way to disable intra-L0 today. You can set max_compaction_bytes to limit the max amount of work done by an intra-L0 compaction, but that will also limit the max amount of work done by a regular (not intra-L0) compaction.
http://smalldatum.blogspot.com/2022/01/rocksdb-internals-intra-l0-compaction.html

With leveled compaction the worst-case write stalls are greatly reduced in RocksDB version 7.5.3, although I am not sure if that changes in 7.5.3 help with the problem you describe.

Qian Wang

unread,

Oct 28, 2022, 12:29:06 AM10/28/22

to rocksdb

thank you for your reply.

I went to read the code of max_compaction_bytes application at other levels. Its function is - "In a compaction, if the size of the currently constructed sst and the overlap size of the grandparent's sst exceeds max_compaction_bytes, the construction of this sst will be stopped in advance". This essentially avoids a large compaction in the future.

But I think this is fundamentally different from its role in level 0. Because the compaction of level n->level n+1, even if it is large, it will not block other compactions of level n->level n+1, as long as there is no overlapping key between them. But considering that level 0 is out of order, if there is a compaction in progress at level 0, it will block all compactions of level 0->level 1.

Another point of view is that this parameter is hardly used in the compaction of level other. The application of max_compaction_bytes in other levels of compaction is as follows:
if (grandparant_file_switched &&
overlapped_bytes_ + current_output_file_size_ >
compaction_->max_compaction_bytes()) {
// Too much overlap for current output; start new output
overlapped_bytes_ = 0;
return true;
}
But before that, there is a logic
if (current_output_file_size_ >= compaction_->max_output_file_size()) {
return true;
}
When an sst is constructed, before it uses the max_compaction_bytes parameter, it is very likely to exit because the size of the sst itself exceeds the target file size.

But in level 0, this parameter is almost closely related to the progress of intra0compaction. The size of level 0 often needs to reach the size of max_compaction_bytes to avoid intra0 compaction, and then level 0->level 1 compaction can be performed. Its role at level 0 is closer to the upper limit of the file size of a level 0, which is essentially different from its role in compaction at other levels.

Maybe we can split this option into two options? For level 0 and level others, respectively.

MARK CALLAGHAN

unread,

Oct 28, 2022, 11:25:20 AM10/28/22

to Qian Wang, rocksdb

I appreciate your careful reading of the code. Changing it upstream isn't for me to decide, but if you require a change I hope that doing it in your branch or fork is acceptable. For some history on the change see:
* https://github.com/facebook/rocksdb/issues/9371
* https://github.com/facebook/rocksdb/issues/6889
* https://github.com/facebook/rocksdb/pull/5299

--
You received this message because you are subscribed to a topic in the Google Groups "rocksdb" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rocksdb/9X7L765lKik/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/12210a83-38cc-4b8d-9cc0-3433d113fa24n%40googlegroups.com.

--

Mark Callaghan
mdca...@gmail.com

Qian Wang

unread,

Oct 29, 2022, 1:00:00 AM10/29/22

to rocksdb

Thank you! I read the issue and pull request you posted in detail, and found that someone has encountered similar problems before.

But as I said in the last email, max_compaction_bytes behaves differently at level 0 and level other, and different options should be used to control this. The fix for this problem should be quite simple, just add an option. I haven't contributed to the RocksDB community before, I don't know how can I change this upstream? Maybe I should raise an issue? Or a pull request?

MARK CALLAGHAN

unread,

Oct 29, 2022, 2:07:53 PM10/29/22

to Qian Wang, rocksdb

Create an issue and/or file a PR. In either case perf results from the problem would be useful, and if you have a PR or proof-of-concept, then provide perf results for that as well -- for example, worst-case write response time with the problem and the fix for a write-heavy workload.

I also asked a RocksDB developer to review this thread, but I can't guarantee that will happen.

You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/2fcd71ef-f3b6-4581-abe8-04ef7161c359n%40googlegroups.com.

--

Mark Callaghan
mdca...@gmail.com

Qian Wang

unread,

Oct 30, 2022, 2:04:58 AM10/30/22

to rocksdb

Thank you for your patience in answering! I have created an issue:https://github.com/facebook/rocksdb/issues/10903

Reply all

Reply to author

Forward