> > Realize though that it comes with limitations, and exchanging raw
> > compressed data with the outside world is inherently problematic.
>
> Could you expand a bit on this? What kind of problems may occur?
Well, the LZ4 block format is non decodable without additional metadata.
Typically, it requires to know the compressed size of the block,
and an upper limit of the decompressed size.
Another variation is to know the decompressed size,
and an upper limit of the compressed size.
Obviously, having both values is even better.
This is the part which is not specified:
any system can select which sideband it uses to send this information.
For example, a compressed file system can imply that
the decompressed size is necessarily bounded by its cluster size,
and therefore doesn't need to send this information at all.
Many 3rd-party tools decide to send the decompressed size instead,
implying that the compressed size is the size of the file
(or the size of the packet for a transmission protocol).
Commonly, many use a 32-bit integer for that,
implying that it _should_ be little-endian
(hence big-endian platform must be aware of the interpretation difference)
and implying that the maximum size is ~ 4 GB (which is fine most of the time).
So, this is the level of freedom allowed by the block format.
Each application can select its own metadata, and how to send them.
Unfortunately, this level of freedom becomes a limitation
whenever different tools want to communicate around a common format.
Since each one can make its own choices, they become mutually unintelligible,
even though at the core, they are all able to read LZ4 compressed blocks :
they just do not talk the same language when it comes to metadata.
> Hmm... the frame format wraps the block format, right? If the frame
> format is interoperable, wouldn't the block format also be
> interoperable? Or are there any additional concerns with the block format?
The frame format fixes above concerns by mandating the presence of metadata,
in a specified format, written for portability.
Since the frame format present some headers that the block format doesn't,
they are distinct by their header.
A library expecting to read a frame will not be able to decode a "raw" block (without metadata)
nor a block with its own custom metadata.
Conversely, a library expecting to read its own metadata
will not be able understand the frame headers.
They are effectively different formats,
even though the same compressed block might be present somewhere in the payload.
> I don't think a little size overhead would be concerning for Parquet,
> however the lz4-java problem I mentioned is a bit more worrying to us.
If you favor interoperability with 3rd-party tools and library,
prefer the frame format.
Especially for "open ended" ecosystems, this is the recommended setup.
However, make sure that the library you are trying to be compatible with,
`lz4-java`, is indeed able to read the frame format.
I haven't followed the situation closely recently,
which is maintained by Rei Odaira,
is able to generate and read both frame and block formats (with its own metadata).
Since these formats are different, it's important to pay attention to which side of the API is being used.
Same story with Apache Commons' LZ4 java library implementation,
which is able to read and generate both block and frame formats.