> I once tested my own implementation of a length prefixed integer format
> where the first few bits encodes the length of the rest in a single format
> up to very long strings. I expected the smaller size to pay off in terms of
> disk I/O and memory bandwidth but it was a tie due to overhead of
> encoding/decoding. That was a few years ago and did not consider multi core
> aspects.
Interesting, that's another route I'm considering: use flatbuffers for
the light-weight framing and for the keep a use a custom encoding for
the more spacious structures.
> > I wonder what would happen if I use filesystems with transparent
> > compression, such as ZFS. While it wouldn't bring down the virtual memory
> > size, it may have a drastic effect on storage, plus some extra cost
> > incurred by the FS transparently decompression blocks.
> >
>
> I think this could be practical approach, but it depends on how well you
> can exploit locality of reference to avoid cost of excessive decomression,
> and the choice of compression. It could use a lot of RAM.
What do you have in mind when you say "it could use a lot of RAM"?
Because the OS keeps the decompressed blocks for the accessed pages in
an opaque cache?
> Another option, if you have vectors of data, is to use the Arrow formats
> Union type which may or may not be supported on the on disk Feather format.
> It uses flatbuffers for metadata only.
If I understand it correctly, Arrow makes most sense in a columnar
format. The example schema I attached is just one of many files. While
each one individually makes sense to represent in columnar fashion, what
happens in reality is an interleaved stream of various entries, each of
which with a different type. Each record/row has a unique ID, which a
secondary index spits out during access. The access pattern is random
and lookups are highly selective. Therefore, I was thinking that a
(row-order) flatbuffers format would be a good fit, because it provides
random-access and I/O costs would be proportional to the query
selectivity (assuming memory-mapping). While it would be possible to
maintain an Arrow struct for every type and then create a sparse
columnar representation, I would imagine it wouldn't yield the I/O
benefits. That said, I haven't measure it.
Switching to Arrow on the data plane is on my medium term agenda when
actual analytics on the data becomes of interest. Right now, my primary
use case is plain search and I'm trying to minimize the I/O cost through
through random-access and memory. Hence the idea to go with flatbuffers.
> Personally would be inclined to use a database, or compress offline data in
> suitable chunks, then expand into memory mapped files (on disk or in
> memory) and mmap as needed in flatbuffer format - this gives you fast
> access and simplicity without paying excessive disk cost - ZFS might do
> this for you.
Yeah, the chunk-based approach is exactly what I'm currently doing
manually. The trade-off becomes touching more chunks than needed for the
data vs. the I/O gains you get from smaller data [1].
Matthias
[1]
https://www.percona.com/blog/2016/03/09/evaluating-database-compression-methods/