I have a problem that I imagine is not very specific to me, but I still can't find a satisfying solution.
I need to handle very large binary files (>1TiB) of fixed size records that support the following operations:
- read bytes in a range with size usually between 1kiB and 100kiB. Effectively a relatively coarse-grained random read access
- append to the end of the file
Additionally all metadata needs to be in the said file, i.e. the file must be self contained.
Currently I'm thinking about making my own solution. My solution would involve using blocks of uncompressed size at most either 128kiB or 256kiB, an index kept at the end of the file (loaded fully into memory, serialized when closing the file after changes were made, writes initially override the stored index), and a transparent cache for recently uncompressed blocks.
Blosc1 didn't have anything that would be of help, but the new features of blosc2 seem like a right tool for this kind of a job. So can this be implemented using blosc2 features without much pain? In particular I want to know:
- Does accessing a random chunk within a superchunk have constant complexity? Would such access be HDD friendly (i.e. ideally one read per access)?
- Can a superchunk handle tens of millions of chunks within it? What would be the memory and storage space overhead?
- Can appending a chunk to a (likely already large) frame-bound superchunk incur significant costs unrelated to the size of the appended chunk?
- Does a frame-bound superchunk use memory additional to what is necessary to perform the requested operations?
- Is it possible to have a superchunk with chunks of varying size (but each chunk less than 256kiB)? I ask because I sometimes had problems with that https://github.com/Blosc/c-blosc2/blob/73c1a4c6a052f467dc1ad877fbc5b4788495f730/blosc/schunk.c#L149
. I could settle for a solution that requires padding smaller chunks.