Steve Newman

unread,

Apr 13, 2013, 11:09:12 AM4/13/13

to lev...@googlegroups.com

This will be a long post. Summary: our use case involves writing large amounts of data that are later deleted, resulting in a keyspace region with gigabytes of deleted data but no live values. It appears that (some of?) the deleted data is never compacted away. As a result, disk space is not reclaimed; also, attempting to iterate across this point in the keyspace causes the iterator to hang. I'm posting this partly to share our experience for others who might be in a similar situation, and partly to highlight the problem in case it might warrant a tweak to the compaction logic.

Details

We have built a replication system on top of LevelDB. This involves writing a serialized copy of each WriteBatch into a replication log contained within the database itself. For each WriteBatch, two cells are written:

replication_log/metadata/nnnnnnnnnnnn (a small value containing metadata about this log entry)

replication_log/payload/nnnnnnnnnnnn (the serialized WriteBatch; can be as large as 20MB or so)

Once the WriteBatch has been copied to all replicas, both cells are deleted. Over time, hundreds of GB have been written to the log, but in normal operation the undeleted portion of the log is very short.

Scans over the metadata portion of the log tend to hang for a long period of time, currently around 20 minutes. I recently tracked down the cause (I think). Internally, we have a thin convenience layer above LevelDB where we can say things like "give me an iterator from position X to position Y in the log". The convenience layer creates an iterator, seeks to position X, and then repeatedly calls Next() until it reaches a key greater than Y. If Y happens to be the last position in the log, then our loop won't terminate until it has reached an undeleted value after the replication_log/metadata key range. This means it will scan through all the deleted (but not scavenged) data in replication_log/payload. According to GetApproximateSizes, this "dead" region currently occupies around 25GB.

Once we realized what was happening, we wrote an empty "sentinel" value at replication_log/payload. When our iterator scans past the end of the replication/metadata/nnnnnnnnnnnn region, it now encounters the sentinel and halts; we no longer observe stuck iterators.

This leads to the question, why are the deleted payloads not eventually compacted away? I don't completely understand the compaction logic, but it seems that if there are never any writes to a particular key range, and the segments in that key range don't overlap, then those segments will never be visited by the compactor. Compaction::IsTrivialMove() will return true, and the segment will be pushed intact down to the next level. It looks like even a manual compaction wouldn't help (though I haven't tried it yet) -- if I'm reading DBImpl::CompactRange correctly, it will not touch segments at the bottom level. I see that this is a useful optimization in most cases, but it breaks down here. As a fix, perhaps a segment could be annotated with some metadata indicating how much self-masking (multiple timestamps at the same key) it contains? Then the IsTrivialMove optimization could be disabled for segments with significant self-masking.

It's possible that my picture of what's going on under the hood is not entirely accurate. The database must contain both old payloads, and the tombstones that mask them. I'm assuming that an old compaction left all this data in a single level, meaning the payloads and tombstones are interleaved in a single segment, but the payloads weren't deleted at that point. We don't hold onto snapshots for more than a few seconds, so the timing for this scenario seems tight, but it's the only way I can explain the observed behavior.

Advice if you're in a similar situation

The critical ingredient seems to be writing data at increasing keys, and then deleting that data, such that there are large ranges of deleted data with no intervening live data and no subsequent writes. A rolling log will tend to create this situation.

Symptoms are missing disk space (the database starts occupying significantly more space than the live data size would account for), and iterators running slowly or hanging. When an iterator "hangs", it will be spinning busily, with user+iowait CPU time totaling 100%. Use DB::GetApproximateSizes to verify that you're in this scenario: if it reports a value much larger than the live data size in that key range, this is your problem.

I suspect we can force the dead data to be compacted by writing a scattering of tombstones through the same key range, but haven't tried this yet. We solved the hung-iterator problem by avoiding any scans through the dead region, in our case by writing a sentinel value just before the dead region.

Steve

Dhruba Borthakur

unread,

Apr 13, 2013, 2:23:23 PM4/13/13

to lev...@googlegroups.com, lev...@googlegroups.com

If you are using default leveldb settings( 7 levels, L0 size 10MB) then the total capacity of the database is 1TB. So compaction might not trigger at the right times to remove deleted values, especially when newly inserted keys do not overlap existing keys.

If your data set is supposed to be within 10gb, you should see quicker removal of dead data if you have 5 levels.

Also, a manual compAction should have removed the dead-data, have u tried it?

If you can write a sample program that reproduces this behavior, I can run it on my machine and provide fixes (if any)

Thanks

Dhruba

Sent from my iPhone

--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sanjay Ghemawat

unread,

Apr 15, 2013, 7:25:01 PM4/15/13

to lev...@googlegroups.com

This will be a long post. Summary: our use case involves writing large amounts of data that are later deleted, resulting in a keyspace region with gigabytes of deleted data but no live values. It appears that (some of?) the deleted data is never compacted away. As a result, disk space is not reclaimed; also, attempting to iterate across this point in the keyspace causes the iterator to hang. I'm posting this partly to share our experience for others who might be in a similar situation, and partly to highlight the problem in case it might warrant a tweak to the compaction logic.

Details

We have built a replication system on top of LevelDB. This involves writing a serialized copy of each WriteBatch into a replication log contained within the database itself. For each WriteBatch, two cells are written:

replication_log/metadata/nnnnnnnnnnnn (a small value containing metadata about this log entry)
replication_log/payload/nnnnnnnnnnnn (the serialized WriteBatch; can be as large as 20MB or so)

Once the WriteBatch has been copied to all replicas, both cells are deleted. Over time, hundreds of GB have been written to the log, but in normal operation the undeleted portion of the log is very short.

Scans over the metadata portion of the log tend to hang for a long period of time, currently around 20 minutes. I recently tracked down the cause (I think). Internally, we have a thin convenience layer above LevelDB where we can say things like "give me an iterator from position X to position Y in the log". The convenience layer creates an iterator, seeks to position X, and then repeatedly calls Next() until it reaches a key greater than Y. If Y happens to be the last position in the log, then our loop won't terminate until it has reached an undeleted value after the replication_log/metadata key range. This means it will scan through all the deleted (but not scavenged) data in replication_log/payload. According to GetApproximateSizes, this "dead" region currently occupies around 25GB.

Once we realized what was happening, we wrote an empty "sentinel" value at replication_log/payload. When our iterator scans past the end of the replication/metadata/nnnnnnnnnnnn region, it now encounters the sentinel and halts; we no longer observe stuck iterators.

This is similar to Issue 777 : https://code.google.com/p/leveldb/issues/detail?id=77

This leads to the question, why are the deleted payloads not eventually compacted away? I don't completely understand the compaction logic, but it seems that if there are never any writes to a particular key range, and the segments in that key range don't overlap, then those segments will never be visited by the compactor. Compaction::IsTrivialMove() will return true, and the segment will be pushed intact down to the next level. It looks like even a manual compaction wouldn't help (though I haven't tried it yet) -- if I'm reading DBImpl::CompactRange correctly, it will not touch segments at the bottom level. I see that this is a useful optimization in most cases, but it breaks down here. As a fix, perhaps a segment could be annotated with some metadata indicating how much self-masking (multiple timestamps at the same key) it contains? Then the IsTrivialMove optimization could be disabled for segments with significant self-masking.

Your analysis right. Though note that the compaction code will avoid writing a deleted key to an sstable if there are no sstables that overlap the key at a higher level:

} else if (ikey.type == kTypeDeletion &&

ikey.sequence <= compact->smallest_snapshot &&

compact->compaction->IsBaseLevelForKey(ikey.user_key)) {

// For this user key:

// (1) there is no data in higher levels

// (2) data in lower levels will have larger sequence numbers

// (3) data in layers that are being compacted here and have

// smaller sequence numbers will be dropped in the next

// few iterations of this loop (by rule (A) above).

// Therefore this deletion marker is obsolete and can be dropped.

drop = true;

It's possible that my picture of what's going on under the hood is not entirely accurate. The database must contain both old payloads, and the tombstones that mask them. I'm assuming that an old compaction left all this data in a single level, meaning the payloads and tombstones are interleaved in a single segment, but the payloads weren't deleted at that point. We don't hold onto snapshots for more than a few seconds, so the timing for this scenario seems tight, but it's the only way I can explain the observed behavior.

Suppose there is a 5 second old snapshot when a compaction starts. compact->smallest_sequence will be set to the sequence# corresponding to when this snapshot was created. Every log entry deletion in this 5 second window will have a larger sequence number than compact->smallest_sequence. So it will be emitted by the preceding code. So every compaction will build up 5 seconds worth of log deletions in a base sstable. Over time, these can accumulate without bound and could explain what you are seeing. if you can reproduce this problem, could you try running "leveldbutil dump" to try and verify this hypothesis?

Advice if you're in a similar situation

The critical ingredient seems to be writing data at increasing keys, and then deleting that data, such that there are large ranges of deleted data with no intervening live data and no subsequent writes. A rolling log will tend to create this situation.

Yep. So will bulk-deleting and re-creating sub-databases prefixed with a new database key.

Also, some of the downsides will also happen if a particular region of the database goes idle: overwrites and deletions in that region may never get compacted.

Symptoms are missing disk space (the database starts occupying significantly more space than the live data size would account for), and iterators running slowly or hanging. When an iterator "hangs", it will be spinning busily, with user+iowait CPU time totaling 100%. Use DB::GetApproximateSizes to verify that you're in this scenario: if it reports a value much larger than the live data size in that key range, this is your problem.

I suspect we can force the dead data to be compacted by writing a scattering of tombstones through the same key range, but haven't tried this yet. We solved the hung-iterator problem by avoiding any scans through the dead region, in our case by writing a sentinel value just before the dead region.

I would love to hear about some simple heuristics that will allow the leveldb implementation to cheaply detect that there is a lot of useless entries in a particular sstable and do the corresponding compaction. I have thought off and on about such things (like maintaining samples), but haven't been able to come up with a simple enough scheme that I was happy with.

Steve Newman

unread,

Apr 15, 2013, 11:37:21 PM4/15/13

to lev...@googlegroups.com

Thanks for the responses. I'll follow up with dump results and (hopefully) a simple reproduction when I can. Meanwhile, one quick response:

On Mon, Apr 15, 2013 at 4:25 PM, Sanjay Ghemawat <san...@google.com> wrote:

I would love to hear about some simple heuristics that will allow the leveldb implementation to cheaply detect that there is a lot of useless entries in a particular sstable and do the corresponding compaction. I have thought off and on about such things (like maintaining samples), but haven't been able to come up with a simple enough scheme that I was happy with.

It seems to me that it would be sufficient to detect intra-table useless entries -- that is, entries which are masked by another entry in the same segment/table. If there are values in one level and deletions in another level, eventually the compactor will bring them together into the same segment. (Unless, I suppose, we have the bad luck to have a segment boundary fall precisely between a value and its tombstone -- that's a weakness in the scheme I'm about to propose.)

My thought was that TableBuilder could notice when a new key has the same user_key as the previous key in the table. The segment metadata would record some summary of these intra-table collisions -- e.g. the number of collisions, or the total size of the masked values. That would enable a variety of heuristics. For instance, the IsTrivialMove optimization could be disabled if both of the following conditions are met:

1. The segment contains at least one internal collision.

2. The newest timestamp in the segment is older than the oldest open snapshot.

In this case, I believe we know that processing the segment will result in at least one entry being discarded. That's an extreme heuristic; for a milder approach, modify clause 1 to require that the total size of self-masked values be at least X% of the segment size. Perhaps manual compactions could allow specifying the threshold.

A few other tweaks might also be needed, I'm not sure I have the full picture. For instance, the compactor might need to process segments at the lowest level if they contain self-masked values.

Steve

Dhruba Borthakur

unread,

Apr 16, 2013, 12:24:36 AM4/16/13

to lev...@googlegroups.com, lev...@googlegroups.com

Hi Steve,

Do you have snapshots on the db?

Thanks

Dhruba

Sent from my iPhone

--

Hiram Chirino

unread,

Apr 16, 2013, 6:59:25 PM4/16/13

to lev...@googlegroups.com

I can confirm I've also seen this before. In case anyone want to reproduce, here's a link to little test which make this happen: https://gist.github.com/chirino/5400336

It's written in java an compiles against leveldbjni [1], but you should be able to create a similar test in your language of choice and get the same results. On my system the output looks like: https://gist.github.com/chirino/5400375

Notice that at the end of the test there are no live keys in the database, yet it's using about 1.8 GB of disk space.

[1]: http://repo2.maven.org/maven2/org/fusesource/leveldbjni/leveldbjni-all/1.6.1/leveldbjni-all-1.6.1.jar

--

Hiram Chirino

Engineering | Red Hat, Inc.

hchi...@redhat.com | fusesource.com | redhat.com

skype: hiramchirino | twitter: @hiramchirino

blog: Hiram Chirino's Bit Mojo

Hiram Chirino

unread,

Apr 17, 2013, 5:09:09 PM4/17/13

to lev...@googlegroups.com

I've converted that java test case so that it can be run as part of the db_test. You can find the patch at:

https://github.com/chirino/leveldb/commit/028fdd112e3a3895f6fcd89a7cd6d252495c4189.patch

Steve Newman

unread,

Apr 22, 2013, 8:24:05 PM4/22/13

to lev...@googlegroups.com

On Mon, Apr 15, 2013 at 4:25 PM, Sanjay Ghemawat <san...@google.com> wrote:

It's possible that my picture of what's going on under the hood is not entirely accurate. The database must contain both old payloads, and the tombstones that mask them. I'm assuming that an old compaction left all this data in a single level, meaning the payloads and tombstones are interleaved in a single segment, but the payloads weren't deleted at that point. We don't hold onto snapshots for more than a few seconds, so the timing for this scenario seems tight, but it's the only way I can explain the observed behavior.

Suppose there is a 5 second old snapshot when a compaction starts. compact->smallest_sequence will be set to the sequence# corresponding to when this snapshot was created. Every log entry deletion in this 5 second window will have a larger sequence number than compact->smallest_sequence. So it will be emitted by the preceding code. So every compaction will build up 5 seconds worth of log deletions in a base sstable. Over time, these can accumulate without bound and could explain what you are seeing. if you can reproduce this problem, could you try running "leveldbutil dump" to try and verify this hypothesis?

I searched the manifest for .sst files in the appropriate key range and dumped a couple. Each is somewhere over 2MB (as expected) and happens to have a filesystem mod date of Dec. 19th. (Note that this is a production database and is in continuous use, so these files have been sitting unmolested through much database activity.) One file contains nothing but tombstones, here is a snippet:

'_repl_staging_payload_000000309832' @ 120910779 : del => ''

'_repl_staging_payload_000000309833' @ 120910780 : del => ''

'_repl_staging_payload_000000309834' @ 120910781 : del => ''

'_repl_staging_payload_000000309835' @ 120910782 : del => ''

The other looks like pairs of values and tombstones; here's a typical pair:

'_repl_staging_payload_000002131672' @ 232111207 : del => ''

'_repl_staging_payload_000002131672' @ 232111046 : val => '...much data...'

Here are the manifest lines that mention each file -- the tombstones file:

DeleteFile: 4 155031

AddFile: 5 155031 2120464 '_repl_staging_payload_000000182696' @ 117474353 : 0 .. '_repl_staging_payload_000000483425' @ 131064837 : 0

and the value/tombstone pairs file:

DeleteFile: 4 154710

AddFile: 5 154710 2679066 '_repl_staging_payload_000002131672' @ 232111207 : 0 .. '_repl_staging_payload_000002131942' @ 232129075 : 1

I'm not sure how much light this sheds. The first file contains around 300,000 tombstones, seemingly all with adjacent keys. We average only a few replication log records per second; while there might be spikes, I can't imagine that the tombstones in this file were all generated in less than an hour. We explicitly monitor the age of our oldest snapshot (to detect leaks), and while we've had one or two incidents, there has never or almost never been a snapshot that old.

The second file contains 271 value / tombstone pairs. Those could have been generated in a minute or so, but still this is unlikely to fall within the lifetime of a single snapshot for us. If you like, I can look at more files to see whether the pattern holds.

Hiram, thanks for posting the test case. It looks like the same issue I'm describing. Did you double-check whether the problem persists in your test case even after allowing more time for compactions?

Incidentally, I discovered that my workaround of writing a sentinel value before the deleted range was not actually helpful. I expect the problem is that the sentinel exists in only one level, so MergingIterator still winds up blocking in other levels... oops on my part. Next, I'm going to try adding code to occasionally scatter deletions throughout the already-empty keyspace.

Steve

Hiram Chirino

unread,

Apr 23, 2013, 12:23:55 PM4/23/13

to lev...@googlegroups.com

Hi Seve, yeah in

> Hiram, thanks for posting the test case. It looks like the same issue I'm describing. Did you double-check whether the problem persists in your test case even after allowing more time for compactions?

Yeah, the longer you run it the more disks pace is used up. Try it
out just modify the 'totalKeys' variable in the test to make it take
longer:

diff --git a/db/db_test.cc b/db/db_test.cc
index b71068f..5042b7d 100644
--- a/db/db_test.cc
+++ b/db/db_test.cc
@@ -1032,7 +1032,7 @@ static bool Between(uint64_t val, uint64_t low,
uint64_t high) {

TEST(DBTest, CompactionBug_Issue77) {
int liveKeys = (1024*1024*40)/(6+1024+12);
- int totalKeys = liveKeys * 50;
+ int totalKeys = liveKeys * 50*2;

Random rnd(301);
for( int i=0; i < totalKeys; i++) {

Sanjay, any chance you could let us know if the test case seems valid to you?

Steve Newman

unread,

Apr 23, 2013, 12:35:48 PM4/23/13

to lev...@googlegroups.com

On Tue, Apr 23, 2013 at 9:23 AM, Hiram Chirino <hi...@hiramchirino.com> wrote:

> Hiram, thanks for posting the test case. It looks like the same issue I'm describing. Did you double-check whether the problem persists in your test case even after allowing more time for compactions?

Yeah, the longer you run it the more disks pace is used up. Try it
out just modify the 'totalKeys' variable in the test to make it take
longer

What happens if you stop writing new keys, but leave the program running? (Meaning, no further activity except for the background compaction thread.) It would be a useful check to leave it running for an extra hour, with no further writes, and periodically log disk usage. Given the current compaction heuristics, my guess is that things will settle down very quickly, but it would be good to confirm.

Steve

Hiram Chirino

unread,

Apr 23, 2013, 1:29:20 PM4/23/13

to lev...@googlegroups.com

Even if I add the following after all keys have been deleted:

for( int i=0; i<1000; i++ ) {
env_->SleepForMicroseconds(1000000);
printf("Using: %.2f MB\n", (Size("", Key(totalKeys))/(1024.0*1024)));
}

The 'Using: 938.63 MB' just keeps repeating:
So no, the compaction thread is not collecting the dead entries.

> --
> You received this message because you are subscribed to the Google Groups
> "leveldb" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to leveldb+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Sanjay Ghemawat

unread,

Apr 23, 2013, 5:54:41 PM4/23/13

to lev...@googlegroups.com

The test case looks reasonable to me. When I ran it, all of the
remaining space usage (about 62MB) was covered by deletions and manual
compaction does get rid of it.

> Even if I add the following after all keys have been deleted:
>
> for( int i=0; i<1000; i++ ) {
> env_->SleepForMicroseconds(1000000);
> printf("Using: %.2f MB\n", (Size("", Key(totalKeys))/(1024.0*1024)));
> }
>
> The 'Using: 938.63 MB' just keeps repeating:
> So no, the compaction thread is not collecting the dead entries.

I would expect the sleep to do nothing since compaction is never
triggered based on time passing in an otherwise idle database.
Instead, activity is needed to trigger compactions:
(1) Writes that fill up memtable
(2) Overflowing the number of bytes that are allowed in a level
(3) Get() calls that need to touch more than one sstable

I am open to suggestions for simple heuristics that can help fix
things up. Ideally, they should also fix up the slow seek behaviour
documented in issue 77. What I am considering right now is a
time-triggered compaction that focuses on sstables that overlap with
sstables on the next level. It might be nice if this heuristic can
get more fine-grained information than overlapping ranges; e.g., if it
can tell that there are many common keys in the different levels.

Hiram Chirino

unread,

Jun 17, 2013, 5:13:08 PM6/17/13

to lev...@googlegroups.com

Any updates on this issue?

Sanjay Ghemawat

unread,

Jun 17, 2013, 5:23:12 PM6/17/13

to lev...@googlegroups.com

On Mon, Jun 17, 2013 at 2:13 PM, Hiram Chirino <chi...@gmail.com> wrote:

Any updates on this issue?

Sorry, nothing to report yet, though Jeff and I have been discussing possible solutions off and on.

--

Steve Newman

unread,

Jun 17, 2013, 6:18:18 PM6/17/13

to lev...@googlegroups.com

For what it's worth, I found a problem with my workaround, and a different workaround that works better for our use case.

As I described at the top of this thread, we'd tried inserting a "sentinel" value just before the keyspace region where our problematic deletions occurred. This turns out to help only some of the time. My current hypothesis is that the sentinel helps only when it happens to occupy the same level as the bulk of the deleted data. Otherwise our application's iterator gets stuck waiting for the nested iterator on that level (I'm guessing).

A more reliable workaround has been to replace the iterator with a series of get() calls. This is possible for us because we know exactly what keys to ask for. It's less efficient than scanning, but efficient enough for us. We still use iterators in other places where we know we won't be crashing into a large region of deleted data.

Steve

Viktor Szathmáry

unread,

Aug 23, 2013, 11:38:55 AM8/23/13

to lev...@googlegroups.com

Hi,

On Monday, June 17, 2013 11:23:12 PM UTC+2, sanjay wrote:

Sorry, nothing to report yet, though Jeff and I have been discussing possible solutions off and on.

There's mention of a fix for this issue on release 1.13 (https://code.google.com/p/leveldb/downloads/detail?name=leveldb-1.13.0.tar.gz). But issue #77 is not yet closed, and there are no details on how this is supposed to work.

Could you provide an overview of the solution? At the moment we're doing explicit compactions after deleting a large key-range, and I would like to understand if this release would allow us to stop doing so :)

Thanks,

Viktor

Sanjay Ghemawat

unread,

Aug 23, 2013, 11:49:26 AM8/23/13

to lev...@googlegroups.com

1.13 makes reads faster in the presence of such deleted ranges. In particular, as data is read, we maintain stats and use them to automatically trigger compactions. So if you have reads that are skipping over the large deleted ranges, the system will automatically compact those away. So issue 77 is actually fixed (I will update the bug tracker).

The remaining shortcoming is that if you delete a range but never touch it again (i.e., no reads that skip over it), that data won't get compacted away. I.e., some people might still see excessive disk space usage. If your issue is mostly read costs, not disk space, you can get rid of your explicit compactions. If you do so, let us know how it worked out.

Viktor Szathmáry

unread,

Aug 23, 2013, 12:02:36 PM8/23/13

to lev...@googlegroups.com

Hi Sanjay,

On Friday, August 23, 2013 5:49:26 PM UTC+2, sanjay wrote:

1.13 makes reads faster in the presence of such deleted ranges. In particular, as data is read, we maintain stats and use them to automatically trigger compactions. So if you have reads that are skipping over the large deleted ranges, the system will automatically compact those away. So issue 77 is actually fixed (I will update the bug tracker).

The remaining shortcoming is that if you delete a range but never touch it again (i.e., no reads that skip over it), that data won't get compacted away. I.e., some people might still see excessive disk space usage. If your issue is mostly read costs, not disk space, you can get rid of your explicit compactions. If you do so, let us know how it worked out.

The problem for us was the extreme cost increase of iterating reads skipping over thousands (or potentially millions) of dead records, after deleting large ranges. As we need a relatively consistent low latency on the reader use case, I don't believe this would work for us: as I understand the first several reads would incur the orders-of-magnitude higher cost first, then compaction would occur lazily afterwards based on stats. We can however take on the cost of the compaction on the deletion use-case, so I guess we will stick with explicit compactions for now...

Thanks,

Viktor

Viktor Szathmáry

unread,

Aug 23, 2013, 12:07:23 PM8/23/13

to lev...@googlegroups.com

By the way, I would think that exposing a range delete operation could allow LevelDB to implement better heuristics about compaction after deletes (rather than trying to figure later out that millions of seemingly separate deletions actually freed up a whole range)...

Reply all

Reply to author

Forward