Hi,
I'd like to implement proper deletion for diskpacked blobserver,
to be able to use for more volatile use cases (for example, the "loose" storage of blobpacked, not just for the "packed").
As in diskpacked a bunch of blobs is stored contiguously in one big file (pack),
the deletion is not straightforward. Now we just overwrite the head (the hash)
with "0000" and the data with zeroes, and skip such blobs on reindex.
But this does not free up space.
My idea is to append the remaining blobs to the storage when some threshold of deleted blobs is reached in that particular .pack file.
Questions:
1. When should we do such garbage collection?
In RemoveBlobs? Or only on Reindex?
2. What should be the threshold?
50% of pack file size seems acceptable, with a minimum of some tens of MiBs.
Problems:
1. Today the code assumes that the pack files are numbered sequentially, without gaps. Either
a) we have to leave a 0-length pack file in place of the garbage collected,
b) or rewrite in-place (dance with temp file and rename),
c) rewrite the code to give up those assumptions, allow holes.
2. To know when to GC a pack file, we have to index the deleted blobs' places and sizes, too.
Or at least maintain the deleted ration per pack.
Ideas, suggestions, oppositions?
Thanks in advance,
Tamás