Is it normal for a db to keep compacting for days after no activity during last db connection

Joseph Cavani

unread,

Jan 16, 2023, 5:43:27 PM1/16/23

to rocksdb

Hello,

I've embedded a RocksDB 7.7.3 database into an application and ingested 30 billion entries racking on about 14TB on an SSD. When it finished, I understood it as "job done".

A couple of days later when I opened the database again, log showed compaction activities using a single CPU. It's been days and it seems the compaction is still going on.

I don't remember if this db was behaving the same immediately after ingestion when I opened and closed it a few times, and if days elapsed had anything to do with it (stale factor?). Could this happen, and are there time-based rules governing compaction behavior?

As the database still keeps compacting automatically, is it safe to close to db by calling db.close? And further, would there be data loss if the main application gets a segfault from somewhere else and application exits?

Thanks,

Joe

Mark Callaghan

unread,

Jan 16, 2023, 7:55:01 PM1/16/23

to rocksdb

Congrats, 14T is a lot of data.

Compaction state is crash-safe and AFAIK you have to modify source code to make it not crash safe.

Can you share details on the RocksDB configuration. If there is just one thread doing a long compaction I will guess you are using universal.

Joseph Cavani

unread,

Jan 17, 2023, 9:18:15 AM1/17/23

to rocksdb

It should be the default level compression. During ingestion I do see multiple threads working, like around 8.

I think it finally finished.

In the FAQ there is this:

Q: Can I close the DB when a manual compaction is in progress?

A: No, it's not safe to do that. However, you call CancelAllBackgroundWork(db, true) in another thread to abort the running compactions, so that you can close the DB sooner. Since 6.5, you can also speed it up using DB::DisableManualCompaction().

Would this be the exception to the rule of crash safety?

Here are my configs. Do they look OK?

```

db_options.create_if_missing = true;

db_options.create_missing_column_families = true;

db_options.unordered_write = true;

auto num_threads = 32;

db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::HIGH);

db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::LOW);

db_options.max_background_jobs = num_threads;

db_options.bytes_per_sync = 1048576; // 1MB

std::vector<rocksdb::ColumnFamilyDescriptor> column_families;

rocksdb::ColumnFamilyOptions cf_options;

cf_options.compression = rocksdb::CompressionType::kLZ4Compression;

cf_options.bottommost_compression = rocksdb::CompressionType::kZSTD;

// custom for SSD https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

cf_options.write_buffer_size = 64 << 20;

cf_options.max_write_buffer_number = 4;

cf_options.min_write_buffer_number_to_merge = 1;

cf_options.level_compaction_dynamic_level_bytes = true;

// table_options.index_type = rocksdb::BlockBasedTableOptions::kHashSearch; // maybe good for prefix db

rocksdb::BlockBasedTableOptions table_options;

table_options.block_cache = rocksdb::NewLRUCache(128 << 20);

table_options.filter_policy.reset(rocksdb::NewBloomFilterPolicy(10, false));

table_options.optimize_filters_for_memory = true;

table_options.block_size = 16 * 1024;

table_options.cache_index_and_filter_blocks = true;

table_options.pin_l0_filter_and_index_blocks_in_cache = true;

cf_options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_options));

// prefix

cf_options.prefix_extractor.reset(rocksdb::NewFixedPrefixTransform(sizeof(uint32_t)));

cf_options.memtable_prefix_bloom_size_ratio = 0.1;

cf_options.memtable_whole_key_filtering = true;

column_families.push_back(rocksdb::ColumnFamilyDescriptor(rocksdb::kDefaultColumnFamilyName, cf_options));

```

Thanks!

Mark Callaghan

unread,

Jan 17, 2023, 11:51:11 AM1/17/23

to rocksdb

When compaction took a long time to run
* was it a manual or normal compaction
* how many threads appeared to be in progress?
* do you have any compaction IO stats in the RocksDB LOG (grep for L0, it is formatted as a table)

Too bad the docs don't elaborate on what is meant by "not safe". Compaction state should be crash-safe and I would consider it to be a bug otherwise. But I have little experience with manual compaction -- long ago it was single-threaded so I stayed away from it given the poor performance.

Options look OK for the most part.
1) I don't know much about prefix bloom filters, so I won't answer about that
2) I filed a bug to improve the docs for SetBackgroundThreads, https://github.com/facebook/rocksdb/issues/11097
3) I am not sure about your usage of SetBackgroundThreads

You have:

auto num_threads = 32;

db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::HIGH);

db_options.env->SetBackgroundThreads(num_threads, rocksdb::Env::Priority::LOW);

db_options.max_background_jobs = num_threads;

I prefer:
auto num_threads = 32;

auto num_flushes = num_threads/4;

auto num_compactions = num_threads - num_flushes;

db_options.env->SetBackgroundThreads(num_compactions, rocksdb::Env::Priority::HIGH);

db_options.env->SetBackgroundThreads(num_flushes, rocksdb::Env::Priority::LOW);

db_options.max_background_jobs = num_threads;

Joseph Cavani

unread,

Jan 17, 2023, 4:30:08 PM1/17/23

to rocksdb

It was a non-manual compaction as soon as db was opened. Seemed to be 1 thread.

I did see in the logs previously there were a few levels holding data and now there is only L6 holding data. So it must have done the right thing.

To be clear do you know what "safe" means in that context?

Thank you for the issue there. I should monitor it.

Matt

unread,

Jan 30, 2023, 10:01:01 PM1/30/23

to Joseph Cavani, rocksdb

Hi Joseph, were you able to figure this out? I'm seeing it on my end also. Have level 0/1 set to 1GB and lots of smaller (about 70) .sst (about 700MB), a few 1GB .sst and one large (55GB) .sst. 75 total .sst. It appears to be constantly trying to compact the 55GB file getting a little bigger each time (maybe about 500gb).. running java rocksdb 7.7.3.

--
You received this message because you are subscribed to the Google Groups "rocksdb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/be404999-80ac-4cee-8449-5734d510d02en%40googlegroups.com.

Dan Carfas

unread,

Jan 31, 2023, 10:20:27 AM1/31/23

to rocksdb

Hi Joe and Matt,

The problem that you face is a known issue in 'universal compaction' on a large-scale installation and may become worse as times go by. Universal compaction is much better than leveled compaction in terms of write amplification, but it requires significantly more space and eventually needs to run a full compaction of the entire database.
Using leveled compaction on such a high scale is also not feasible due to the huge write amplification of random updates.
There is a lot of research work on trying to achieve better balances between write-amplification and space-amplification. Much is described in these two surveys (1) "LSM-based Storage Techniques: A Survey" and (2) "Constructing and analyzing the LSM compaction design space". For example, the Dostoevsky and Spooky papers (SIGMOD2018 & VLDB2022) propose lazier and finer-grained compaction policies that better balance between write and space amplification. However, there is no public code for these and they might still suffer from various problems like high tail latency.
At Speedb on top of our opensource, we have our enterprise version, specifically for scale and performance at scale use cases.
In the enterprise version, we have designed a hybrid compaction mechanism that can achieve low write-amplification and space-amplification at the same time.
Happy to follow up on a private thread or join us on discord
Dan

Matt

unread,

Jan 31, 2023, 11:10:40 AM1/31/23

to Dan Carfas, rocksdb

Thanks Dan. I think what may be occurring, in my case, is that, somehow, I ended up with a lot of fragmented .sst files; some that are small (lots of 50mb-ish, around 60-70 files) and some large (one large 50gb). Once compaction kicks in, it appears to be consolidating small files with very large files and since there are lots of small files, it takes a really long time to do the merge sort. If the small files can be consolidated first, it should dramatically reduce the time involved to compact the entire set. Not sure if this is what Joe experienced but I've seen this a least 2x in our installation.

To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/53cd76db-448c-4b67-bbd2-3a38c1fe4d8fn%40googlegroups.com.

Mark Callaghan

unread,

Jan 31, 2023, 2:13:50 PM1/31/23

to rocksdb

"Using leveled compaction on such a high scale is also not feasible due to the huge write amplification of random updates."

Not feasible, should I turn off the web-scale deployments of MyRocks that are running with leveled compaction?

Having supported leveled compaction at scale, the write amp ...
* is larger than what you get from universal/tiered, but leveled does much better on space-amp
* is much smaller than what you get from a world-class b-tree like InnoDB
* isn't as bad as the occasional claims I read in conference papers, alas too many of these claims are poorly documented

Tiered has a big problem -- single-threaded compaction for a large SST is too slow. I look forward to the solutions that have arrived (in Speedb, in ScyllaDB, not sure where else). Otherwise, sharding is the workaround.

Mark Callaghan

unread,

Jan 31, 2023, 2:26:47 PM1/31/23

to rocksdb

For one example of real numbers for write efficiency see table 1 in section 5.1, IO write rate with InnoDB was ~4X larger than with MyRocks
https://vldb.org/pvldb/vol13/p3217-matsunobu.pdf

Dan Carfas

unread,

Feb 1, 2023, 1:40:12 AM2/1/23

to rocksdb

Reply by Hilik, Speedb's co-founder and chief scientist,:
'Matt, large files are problematic in many aspects (compaction and huge index & filters) . We would like very much to understand the reason for these files. Do you use Universal compaction? can you also look for the value of a couple of options that determine the file size (target_file_size_base, target_file_size_multiplier )? Do you do ingest of an external SST?
p.s. if you can look for the event table_file_creation in the rocksdb log it has a field of data_size . this will allow you to find the job that created this huge file.

Join us on Discord; link to the discussion

Dan Carfas

unread,

Feb 1, 2023, 11:56:43 AM2/1/23

to rocksdb

Another comment From Hilik, Speedb's co founder and chief scientist, in continuation to Mark's last reply:
Mark, I think we agree ... We will be happy to review with you the solution we have in SpeedDB that allows for TB(s) scale single shard database. Our hybrid compaction is an adaptive, practically a combination of universal and leveled. It is always using very small compaction steps. it is also trying to achieve a balance between read and write amp ( use as many levels as needed so the write will flow but not more)

Mark Callaghan

unread,

Feb 1, 2023, 6:11:45 PM2/1/23

to rocksdb

Your hybrid solution sounds great. A nice side effect of publishing occasionally imperfect open source DBMS software is that smart people can come along and improve on it.

Dan Carfas

unread,

Feb 2, 2023, 2:31:14 AM2/2/23

to joseph....@gmail.com, rocksdb

Reply by Hilik:
'Joe, allow me to guess what happened. You did a load of 14TB while disabling the compaction (prepare for bulk load?) and then reopened the database (and now all the L0 files needed to be compacted together). Assuming you have enough disk space (u need approximately twice the size of the data for doing this) this will work OK (and takes about a week) . There are ways to make this process shorter if you are interested... if the data is now in a read - only mode than u can ignore the type of compaction ...'

You received this message because you are subscribed to a topic in the Google Groups "rocksdb" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rocksdb/Lruy6LV4g7U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rocksdb+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rocksdb/469c6e6b-7408-4d65-ac76-b5ce9ef524d9n%40googlegroups.com.

Joseph Cavani

unread,

Mar 3, 2023, 2:03:50 PM3/3/23

to rocksdb

Hey Dan,

It wasn't my case. I was using level compaction, and enabled compaction while bulk ingesting data. It just took a while before a db load triggered a final compaction of all data to L6.

Joseph Cavani

unread,

Mar 3, 2023, 2:17:53 PM3/3/23

to rocksdb

Incidentally, I created a smaller version of this db, with same number of entries, but entries were much smaller (1/5 of original 0.5KB per entry with 30 billion entries). But the exact query time for the same number of entries went up! I was confused but suspected the DB not being completely compacted to L6 led to more disk seeks. Note each of my query would do 5000 small prefixed range iterator seeks (each range seek going through all elements within the same prefix), racking up ~15000 IOPS.

To test my hypothesis, I would like to manually compact everything to L6, like how it happened "accidentally" when I reported the behavior in my first post of the thread. This is where one of my other db currently is at and I would like to trigger a manual compaction to L6:

** Compaction Stats [default] **
Level    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  L0      1/0   10.91 MB   0.2      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L2      4/0   207.88 MB   0.8      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L3     49/0    2.48 GB   1.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L4    475/0   24.74 GB   1.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L5   4761/0   247.75 GB   1.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
  L6  41774/0    2.42 TB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Sum  47064/0    2.69 TB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0
 Int      0/0    0.00 KB   0.0      0.0     0.0      0.0       0.0      0.0       0.0   0.0      0.0      0.0      0.00              0.00         0    0.000       0      0       0.0       0.0

** Compaction Stats [default] **
Priority    Files   Size     Score Read(GB)  Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

I have this function to trigger a manual compaction, but it returns immediately without doing anything. Did I do this wrong?

void manual_compaction() {
    rocksdb::CompactRangeOptions options;
    options.exclusive_manual_compaction = true;
    options.allow_write_stall = true;
    rocksdb::Slice begin(prefix_offset_to_string(0, 0));
    rocksdb::Slice end(prefix_offset_to_string(nprefix - 1, sizes[nprefix - 1] - 1));
    rocksdb::Status s = db->CompactRange(options, &begin, &end);
    spdlog::info("Manual compaction done: {}.", s.ToString());
    FAISS_THROW_IF_NOT_FMT(s.ok(), "%s", s.ToString().c_str());
}

There are 500M prefixes, and the numbers of elements in each prefix are stored in variable `sizes`.

Joseph Cavani

unread,

Mar 9, 2023, 10:27:18 PM3/9/23

to rocksdb

Update: Thanks to Hilik from the discord chat, I followed cue into the logs and found the "reason" was "ttl". The default 30 day rule (cf_options.ttl) kicked in and compacted all SSTs from other levels to L6. Of course, the cf_options.periodic_compaction_seconds option recompacted L6 files as well after 30 days.

I was able to trigger a manual range compaction with the correct setting change_level/target_level/max_subcompactions to actually get a lot done.

Reply all

Reply to author

Forward