This will be a long post. Summary: our use case involves writing large amounts of data that are later deleted, resulting in a keyspace region with gigabytes of deleted data but no live values. It appears that (some of?) the deleted data is never compacted away. As a result, disk space is not reclaimed; also, attempting to iterate across this point in the keyspace causes the iterator to hang. I'm posting this partly to share our experience for others who might be in a similar situation, and partly to highlight the problem in case it might warrant a tweak to the compaction logic.
We have built a replication system on top of LevelDB. This involves writing a serialized copy of each WriteBatch into a replication log contained within the database itself. For each WriteBatch, two cells are written:
replication_log/metadata/nnnnnnnnnnnn (a small value containing metadata about this log entry)
Once the WriteBatch has been copied to all replicas, both cells are deleted. Over time, hundreds of GB have been written to the log, but in normal operation the undeleted portion of the log is very short.
Scans over the metadata portion of the log tend to hang for a long period of time, currently around 20 minutes. I recently tracked down the cause (I think). Internally, we have a thin convenience layer above LevelDB where we can say things like "give me an iterator from position X to position Y in the log". The convenience layer creates an iterator, seeks to position X, and then repeatedly calls Next() until it reaches a key greater than Y. If Y happens to be the last position in the log, then our loop won't terminate until it has reached an undeleted value after the replication_log/metadata key range. This means it will scan through all the deleted (but not scavenged) data in replication_log/payload. According to GetApproximateSizes, this "dead" region currently occupies around 25GB.
Once we realized what was happening, we wrote an empty "sentinel" value at replication_log/payload. When our iterator scans past the end of the replication/metadata/nnnnnnnnnnnn region, it now encounters the sentinel and halts; we no longer observe stuck iterators.
This leads to the question, why are the deleted payloads not eventually compacted away? I don't completely understand the compaction logic, but it seems that if there are never any writes to a particular key range, and the segments in that key range don't overlap, then those segments will never be visited by the compactor. Compaction::IsTrivialMove() will return true, and the segment will be pushed intact down to the next level. It looks like even a manual compaction wouldn't help (though I haven't tried it yet) -- if I'm reading DBImpl::CompactRange correctly, it will not touch segments at the bottom level. I see that this is a useful optimization in most cases, but it breaks down here. As a fix, perhaps a segment could be annotated with some metadata indicating how much self-masking (multiple timestamps at the same key) it contains? Then the IsTrivialMove optimization could be disabled for segments with significant self-masking.
It's possible that my picture of what's going on under the hood is not entirely accurate. The database must contain both old payloads, and the tombstones that mask them. I'm assuming that an old compaction left all this data in a single level, meaning the payloads and tombstones are interleaved in a single segment, but the payloads weren't deleted at that point. We don't hold onto snapshots for more than a few seconds, so the timing for this scenario seems tight, but it's the only way I can explain the observed behavior.
Advice if you're in a similar situation
The critical ingredient seems to be writing data at increasing keys, and then deleting that data, such that there are large ranges of deleted data with no intervening live data and no subsequent writes. A rolling log will tend to create this situation.
Symptoms are missing disk space (the database starts occupying significantly more space than the live data size would account for), and iterators running slowly or hanging. When an iterator "hangs", it will be spinning busily, with user+iowait CPU time totaling 100%. Use DB::GetApproximateSizes to verify that you're in this scenario: if it reports a value much larger than the live data size in that key range, this is your problem.
I suspect we can force the dead data to be compacted by writing a scattering of tombstones through the same key range, but haven't tried this yet. We solved the hung-iterator problem by avoiding any scans through the dead region, in our case by writing a sentinel value just before the dead region.