lz4 encoding details

436 views
Skip to first unread message

stei...@scionics.de

unread,
Dec 6, 2017, 11:15:23 AM12/6/17
to LZ4c
[I already posted this question, but my post was deleted when I submitted it. :/ See attachment.]

I am trying to understand the lz4 encoding algorithm. The link to http://lz4.info that is reference on the frame format description is dead. So I post my question here. Am I right in saying that the block format based encoder in lz4.c scans 64kB of input buffer, creates a dictionary (lz77 style) from it and encodes the contents of these 64kB using that dictionary. I didn't really get what happens after that? Is the dictionary reused or purged?

If I understand correctly, the frame format allows for arbitrary chunks of input data to be encoded. Here I wonder, if the size of the dictionary which lz4 uses remains fixed throughout the encoding of one such frame or if the dictionary is arbitrary?

My apologies if this question is motivated based on my poor understanding of lz4 and it's ancestry. If there are good learning resources out there, that I can study, feel free to point me to them.

Best,
P
deleted_post.png

Cyan

unread,
Dec 8, 2017, 1:21:51 PM12/8/17
to LZ4c
Here is a link to the block compression format : https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
And finally the project homepage : http://www.lz4.org

The compression algorithm is base on the concept of sliding window.
So it references previous data, up to 64KB in the past.
This is irrespective of the block size, which can be much larger (several MB) or smaller. The sliding window is always 64 KB.

The frame format makes it possible to encode multiple blocks together, generally useful to encode some large content.
It also makes it possible for a block to use past data from previous blocks to fill the sliding window at the beginning of a new block, though this capability is optional.

Peter Steinbach

unread,
Dec 8, 2017, 4:51:29 PM12/8/17
to lz...@googlegroups.com
Hi Yann,

Thanks for your reply.

Here is a link to the block compression format : https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md
And finally the project homepage : http://www.lz4.org

Yes, I know these documents. I've read them and still they couldn't tell me how the lz4 encoder works. As a matter of fact, the first link you suggested reads:
"This document describes only the block format, not how the compressor nor decompressor actually work. " :/


The compression algorithm is base on the concept of sliding window.
So it references previous data, up to 64KB in the past.
This is irrespective of the block size, which can be much larger (several MB) or smaller. The sliding window is always 64 KB.

So this confirms my findings. Thank you. I'll go back to the block format document now and will try to understand your comment about block size. To make my motivation more precise, I have uint16 data of 0.5-2GB in size. This data includes random noise (in bits 0 - 9). I am currently working on a filter to treat this data in a special way, but it's volume in memory stays the same. I currently found that lz4 gives the same compression ratio on the filtered and unfiltered data ... But lz4 is 3x faster on the filtered data. I want to understand why.

The frame format makes it possible to encode multiple blocks together, generally useful to encode some large content.

Can you give me an idea what you mean by large? Or rephrased a bit, what are examples where people use the frame format rather than the block format?

It also makes it possible for a block to use past data from previous blocks to fill the sliding window at the beginning of a new block, though this capability is optional.

Interesting. Can the latter feature in lz4 be used in a thread-safe manor?
Best,
P

Cyan

unread,
Dec 9, 2017, 2:30:40 PM12/9/17
to LZ4c
> To make my motivation more precise, I have uint16 data of 0.5-2GB in size.

a table of unit16 is a pretty bad input format for lz4 compression.
you can get some compression, but let's say it's hard for lz4.
I would rather recommend to pre-filter the data with a transposer like blosc .

> what are examples where people use the frame format rather than the block format?

Anything compressed with the `lz4` command line utility will use the frame format.
The frame format is "self-contained" : it also includes metadata, necessary to regenerate the content.
In contrast, the "block format" is only the compressed data, but does not specify how to transmit metadata (uncompressed size, size of each block, nb of blocks, etc.)

Peter Steinbach

unread,
Dec 10, 2017, 3:20:42 AM12/10/17
to lz...@googlegroups.com

> To make my motivation more precise, I have uint16 data of 0.5-2GB in size.

a table of unit16 is a pretty bad input format for lz4 compression.
you can get some compression, but let's say it's hard for lz4.
I would rather recommend to pre-filter the data with a transposer like blosc .


I am already doing that and a bit more. But back to my original question ... based on your replies so far, I guess I have to dig the code to find out how the lz4 encoder works and why it works faster on some inputs and why not on others given the same compression ratio.

> what are examples where people use the frame format rather than the block format?

Anything compressed with the `lz4` command line utility will use the frame format.
The frame format is "self-contained" : it also includes metadata, necessary to regenerate the content.
In contrast, the "block format" is only the compressed data, but does not specify how to transmit metadata (uncompressed size, size of each block, nb of blocks, etc.)

Interesting. Back to my question from my last message: is the frame API thread-safe in lz4?
I'll have a deeper look at how blosc and lz4-mt perform the encoding in a multi-thread context n the coming days. But if you know the answer already, feel free to help out.

Thanks so far for your replies.
P

Peter Steinbach

unread,
Jan 8, 2018, 5:58:05 AM1/8/18
to lz...@googlegroups.com
following up on the lz4 frame format, is the lz4 frame API thread safe?
in other words can I call LZ4F_compressUpdate from different threads? Or
do I need to revert to a modus where 1 thread compresses one frame at a
time?

Note: blosc uses the block API for multi-threading. and apparently lz4mt
hasn't seen any activity for years. So I am not sure whereelse to
redirect my question or check for a reference implementation.

Best,
Peter

Cyan

unread,
Jan 10, 2018, 12:52:47 PM1/10/18
to LZ4c
Each thread needs its own context (LZ4F_createCompressionContext)
https://github.com/lz4/lz4/blob/dev/lib/lz4frame.h#L232

If this rule is respected, there is no limit on the number of thread/contexts compressing in parallel.

A single frame though must be compressed by a single thread.
Reply all
Reply to author
Forward
0 new messages