Feedback on standardization of LZ4 in Parquet

Antoine Pitrou

unread,

Mar 10, 2021, 8:08:52 AM3/10/21

to lz...@googlegroups.com

Hello,

We're trying to standardize proper use of LZ4 compression in the Parquet
format:
https://github.com/apache/parquet-format/pull/168

This draft is referencing the LZ4 block format, rather than the frame
format. The rationale is two-fold:

1) Parquet already stores the compressed and uncompressed data size
separately, so it needn't be repeated in the compressed payload

2) We encountered issues with the lz4-java implementation not being able
to read the LZ4 frame format produced by lz4-c:
https://mail-archives.apache.org/mod_mbox/arrow-dev/202101.mbox/%3CCAJPUwMAPSTrdbu4vw=GJiLy9ciJU0FvH_h...@mail.gmail.com%3E

I would welcome any feedback on this reasoning, to make sure we're not
making a wrong choice.

Best regards

Antoine.

Cyan

unread,

Mar 10, 2021, 1:21:46 PM3/10/21

to LZ4c

Referencing the LZ4 block format is valid.

It's the way it was initially defined, and many applications do use this format.

Realize though that it comes with limitations, and exchanging raw compressed data with the outside world is inherently problematic.

This may not be a problem though :

many applications keep the compressed format within their internals,

and never communicate raw compressed data to external systems.

If interoperability is a topic though (and I mean, with other 3rd-party tools / software stacks),

only the frame format is sufficiently specified to guarantee interoperability.

It comes at an increased cost though, both complexity and size, due to additional headers.

Antoine Pitrou

unread,

Mar 10, 2021, 1:33:37 PM3/10/21

to lz...@googlegroups.com

Hi Yann,

Thanks for the feedback.

Le 10/03/2021 à 19:21, Cyan a écrit :
> Realize though that it comes with limitations, and exchanging raw
> compressed data with the outside world is inherently problematic.

Could you expand a bit on this? What kind of problems may occur?

> If interoperability is a topic though (and I mean, with other 3rd-party
> tools / software stacks),
> only the frame format is sufficiently specified to guarantee
> interoperability.

Hmm... the frame format wraps the block format, right? If the frame
format is interoperable, wouldn't the block format also be
interoperable? Or are there any additional concerns with the block format?

> It comes at an increased cost though, both complexity and size, due to
> additional headers.

I don't think a little size overhead would be concerning for Parquet,
however the lz4-java problem I mentioned is a bit more worrying to us.

Regards

Antoine.

> --
> Vous recevez ce message, car vous êtes abonné au groupe Google Groupes
> "LZ4c".
> Pour vous désabonner de ce groupe et ne plus recevoir d'e-mails le
> concernant, envoyez un e-mail à l'adresse
> lz4c+uns...@googlegroups.com
> <mailto:lz4c+uns...@googlegroups.com>.
> Cette discussion peut être lue sur le Web à l'adresse
> https://groups.google.com/d/msgid/lz4c/a9def20f-ce7e-42e3-bbb5-ab75b33d3961n%40googlegroups.com
> <https://groups.google.com/d/msgid/lz4c/a9def20f-ce7e-42e3-bbb5-ab75b33d3961n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Echo Roxas

unread,

Mar 10, 2021, 6:14:30 PM3/10/21

to lz...@googlegroups.com

what is that for?

Pour vous désabonner de ce groupe et ne plus recevoir d'e-mails le concernant, envoyez un e-mail à l'adresse lz4c+uns...@googlegroups.com.
Cette discussion peut être lue sur le Web à l'adresse https://groups.google.com/d/msgid/lz4c/9d436228-73a7-657d-1c69-2c43fc50f2c7%40python.org.

Cyan

unread,

Mar 10, 2021, 7:53:32 PM3/10/21

to LZ4c

> > Realize though that it comes with limitations, and exchanging raw
> > compressed data with the outside world is inherently problematic.
>

> Could you expand a bit on this? What kind of problems may occur?

Well, the LZ4 block format is non decodable without additional metadata.

Typically, it requires to know the compressed size of the block,

and an upper limit of the decompressed size.

Another variation is to know the decompressed size,

and an upper limit of the compressed size.

Obviously, having both values is even better.

This is the part which is not specified:

any system can select which sideband it uses to send this information.

For example, a compressed file system can imply that

the decompressed size is necessarily bounded by its cluster size,

and therefore doesn't need to send this information at all.

Many 3rd-party tools decide to send the decompressed size instead,

implying that the compressed size is the size of the file

(or the size of the packet for a transmission protocol).

Commonly, many use a 32-bit integer for that,

implying that it _should_ be little-endian

(hence big-endian platform must be aware of the interpretation difference)

and implying that the maximum size is ~ 4 GB (which is fine most of the time).

So, this is the level of freedom allowed by the block format.

Each application can select its own metadata, and how to send them.

Unfortunately, this level of freedom becomes a limitation

whenever different tools want to communicate around a common format.

Since each one can make its own choices, they become mutually unintelligible,

even though at the core, they are all able to read LZ4 compressed blocks :

they just do not talk the same language when it comes to metadata.

> Hmm... the frame format wraps the block format, right? If the frame
> format is interoperable, wouldn't the block format also be
> interoperable? Or are there any additional concerns with the block format?

The frame format fixes above concerns by mandating the presence of metadata,

in a specified format, written for portability.

Since the frame format present some headers that the block format doesn't,

they are distinct by their header.

A library expecting to read a frame will not be able to decode a "raw" block (without metadata)

nor a block with its own custom metadata.

Conversely, a library expecting to read its own metadata

will not be able understand the frame headers.

They are effectively different formats,

even though the same compressed block might be present somewhere in the payload.

> I don't think a little size overhead would be concerning for Parquet,

> however the lz4-java problem I mentioned is a bit more worrying to us.

If you favor interoperability with 3rd-party tools and library,

prefer the frame format.

Especially for "open ended" ecosystems, this is the recommended setup.

However, make sure that the library you are trying to be compatible with,

`lz4-java`, is indeed able to read the frame format.

I haven't followed the situation closely recently,

but if I remember right, a java lz4 library like https://github.com/lz4/lz4-java ,

which is maintained by Rei Odaira,

is able to generate and read both frame and block formats (with its own metadata).

Since these formats are different, it's important to pay attention to which side of the API is being used.

Same story with Apache Commons' LZ4 java library implementation,

https://commons.apache.org/proper/commons-compress/javadocs/api-release/org/apache/commons/compress/compressors/lz4/package-summary.html ,

which is able to read and generate both block and frame formats.

Antoine Pitrou

unread,

Mar 11, 2021, 6:52:01 AM3/11/21

to lz...@googlegroups.com

Le 11/03/2021 à 01:53, Cyan a écrit :
> > > Realize though that it comes with limitations, and exchanging raw
> > > compressed data with the outside world is inherently problematic.
> >
> > Could you expand a bit on this? What kind of problems may occur?
>

> Well, the LZ4 block format is non decodable without additional metadata.
> Typically, it requires to know the compressed size of the block,
> and an upper limit of the decompressed size.
> Another variation is to know the decompressed size,
> and an upper limit of the compressed size.
> Obviously, having both values is even better.

Ok, thanks for the information.

As I said, the Parquet format records both the compressed and the
uncompressed size as separate metadata, so this wouldn't be a problem
for us.

> Unfortunately, this level of freedom becomes a limitation
> whenever different tools want to communicate around a common format.

In the Parquet case, we're talking about compressed data blocks embedded
inside a complex file format, so this is not a concern.

Thank you very much

Best regards

Antoine.

Reply all

Reply to author

Forward