--
You received this message because you are subscribed to the Google Groups "leveldb" group.
To unsubscribe from this group and stop receiving emails from it, send an email to leveldb+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
This will be a long post. Summary: our use case involves writing large amounts of data that are later deleted, resulting in a keyspace region with gigabytes of deleted data but no live values. It appears that (some of?) the deleted data is never compacted away. As a result, disk space is not reclaimed; also, attempting to iterate across this point in the keyspace causes the iterator to hang. I'm posting this partly to share our experience for others who might be in a similar situation, and partly to highlight the problem in case it might warrant a tweak to the compaction logic.DetailsWe have built a replication system on top of LevelDB. This involves writing a serialized copy of each WriteBatch into a replication log contained within the database itself. For each WriteBatch, two cells are written:replication_log/metadata/nnnnnnnnnnnn (a small value containing metadata about this log entry)replication_log/payload/nnnnnnnnnnnn (the serialized WriteBatch; can be as large as 20MB or so)Once the WriteBatch has been copied to all replicas, both cells are deleted. Over time, hundreds of GB have been written to the log, but in normal operation the undeleted portion of the log is very short.Scans over the metadata portion of the log tend to hang for a long period of time, currently around 20 minutes. I recently tracked down the cause (I think). Internally, we have a thin convenience layer above LevelDB where we can say things like "give me an iterator from position X to position Y in the log". The convenience layer creates an iterator, seeks to position X, and then repeatedly calls Next() until it reaches a key greater than Y. If Y happens to be the last position in the log, then our loop won't terminate until it has reached an undeleted value after the replication_log/metadata key range. This means it will scan through all the deleted (but not scavenged) data in replication_log/payload. According to GetApproximateSizes, this "dead" region currently occupies around 25GB.Once we realized what was happening, we wrote an empty "sentinel" value at replication_log/payload. When our iterator scans past the end of the replication/metadata/nnnnnnnnnnnn region, it now encounters the sentinel and halts; we no longer observe stuck iterators.
This leads to the question, why are the deleted payloads not eventually compacted away? I don't completely understand the compaction logic, but it seems that if there are never any writes to a particular key range, and the segments in that key range don't overlap, then those segments will never be visited by the compactor. Compaction::IsTrivialMove() will return true, and the segment will be pushed intact down to the next level. It looks like even a manual compaction wouldn't help (though I haven't tried it yet) -- if I'm reading DBImpl::CompactRange correctly, it will not touch segments at the bottom level. I see that this is a useful optimization in most cases, but it breaks down here. As a fix, perhaps a segment could be annotated with some metadata indicating how much self-masking (multiple timestamps at the same key) it contains? Then the IsTrivialMove optimization could be disabled for segments with significant self-masking.
It's possible that my picture of what's going on under the hood is not entirely accurate. The database must contain both old payloads, and the tombstones that mask them. I'm assuming that an old compaction left all this data in a single level, meaning the payloads and tombstones are interleaved in a single segment, but the payloads weren't deleted at that point. We don't hold onto snapshots for more than a few seconds, so the timing for this scenario seems tight, but it's the only way I can explain the observed behavior.
Advice if you're in a similar situationThe critical ingredient seems to be writing data at increasing keys, and then deleting that data, such that there are large ranges of deleted data with no intervening live data and no subsequent writes. A rolling log will tend to create this situation.
Symptoms are missing disk space (the database starts occupying significantly more space than the live data size would account for), and iterators running slowly or hanging. When an iterator "hangs", it will be spinning busily, with user+iowait CPU time totaling 100%. Use DB::GetApproximateSizes to verify that you're in this scenario: if it reports a value much larger than the live data size in that key range, this is your problem.I suspect we can force the dead data to be compacted by writing a scattering of tombstones through the same key range, but haven't tried this yet. We solved the hung-iterator problem by avoiding any scans through the dead region, in our case by writing a sentinel value just before the dead region.
I would love to hear about some simple heuristics that will allow the leveldb implementation to cheaply detect that there is a lot of useless entries in a particular sstable and do the corresponding compaction. I have thought off and on about such things (like maintaining samples), but haven't been able to come up with a simple enough scheme that I was happy with.
--
Hiram Chirino
Engineering | Red Hat, Inc.
hchi...@redhat.com | fusesource.com | redhat.com
skype: hiramchirino | twitter: @hiramchirino
blog: Hiram Chirino's Bit Mojo
It's possible that my picture of what's going on under the hood is not entirely accurate. The database must contain both old payloads, and the tombstones that mask them. I'm assuming that an old compaction left all this data in a single level, meaning the payloads and tombstones are interleaved in a single segment, but the payloads weren't deleted at that point. We don't hold onto snapshots for more than a few seconds, so the timing for this scenario seems tight, but it's the only way I can explain the observed behavior.
Suppose there is a 5 second old snapshot when a compaction starts. compact->smallest_sequence will be set to the sequence# corresponding to when this snapshot was created. Every log entry deletion in this 5 second window will have a larger sequence number than compact->smallest_sequence. So it will be emitted by the preceding code. So every compaction will build up 5 seconds worth of log deletions in a base sstable. Over time, these can accumulate without bound and could explain what you are seeing. if you can reproduce this problem, could you try running "leveldbutil dump" to try and verify this hypothesis?
> Hiram, thanks for posting the test case. It looks like the same issue I'm describing. Did you double-check whether the problem persists in your test case even after allowing more time for compactions?Yeah, the longer you run it the more disks pace is used up. Try it
out just modify the 'totalKeys' variable in the test to make it take
longer
--
Sorry, nothing to report yet, though Jeff and I have been discussing possible solutions off and on.
1.13 makes reads faster in the presence of such deleted ranges. In particular, as data is read, we maintain stats and use them to automatically trigger compactions. So if you have reads that are skipping over the large deleted ranges, the system will automatically compact those away. So issue 77 is actually fixed (I will update the bug tracker).The remaining shortcoming is that if you delete a range but never touch it again (i.e., no reads that skip over it), that data won't get compacted away. I.e., some people might still see excessive disk space usage. If your issue is mostly read costs, not disk space, you can get rid of your explicit compactions. If you do so, let us know how it worked out.