Simple example of LZ4 in Delphi 7

329 views
Skip to first unread message

Serge

unread,
Jan 5, 2016, 7:43:10 AM1/5/16
to LZ4c
Hello Yann Collet,

I have some ideas to improve LZ4 compression, but first I want to understand your algorithm. Your explanation of LZ4 is hard to undershtand. I wrote a small test programs on Delphi 7 by your explanation, you can download both packer & unpacker with binaries from my site: cronc.com/lzsimple.zip The sources have good comments.

lzsimple.exe compresses the given file into result file named 'packed' and shows compress ratio.
lzsimpleunp.exe uncompresses the given file into result file named 'unpacked'.

I assume that size of data below 4G. I use coding similar to yours:
uuuummmm oooooooo oooooooo [xuuuuuuu]...[0uuuuuuu] [unpacked bytes] [xmmmmmmm]...[0mmmmmmm]
uuuu - count of unpacked bytes (0-14). If u == 15 then additional byte(s) used. x == 0/1.
mmmm - count of matched bytes decreased by 4 (0-14). The same.
o - offset to previous matching (0-65535).

Can be up to 5 additional bytes of a count. If bit 7 in the additional byte == 1, then another byte is used, if bit 7 == 0, then this byte is last byte of the count.
Before I code additional bytes of the count I decrease it by 15.

I use simple hash and calculate the hash by 4 bytes for each byte of source data. So the offset alvays points to the near left match of the current 4 bytes. Hash table has 16M cells.

I've compared compress ratio my program with your LZ4.exe I found in the archive LZ4_Files.zip. My ratio is ~ on 10% worse than yours. Why? What I did wrong? Thanks in advance.

Sorry, I'm not an English speaker, but I hope you understand what I wrote.

Happy New Year and good compression! :-)

Serge

Yann Collet

unread,
Jan 5, 2016, 8:53:12 AM1/5/16
to LZ4c
Hi Serge


Difficult to say.
Your compression format is your own, specific, although merely inspired by LZ4.

The similarities are strong, so I would expect both formats to fare roughly the same.
Maybe there are some inefficiencies somewhere in the implementation.
That's going to be difficult to pinpoint...


Rgds

Serge

unread,
Jan 6, 2016, 6:33:39 AM1/6/16
to LZ4c
Do you use in the program LZ4.exe hash chain?

Regards,
 Serge

Yann Collet

unread,
Jan 6, 2016, 6:35:02 AM1/6/16
to LZ4c
Hash chains are used within lz4hc.

lz4hc is triggered on compression levels >=3 .

Serge

unread,
Jan 7, 2016, 5:05:55 AM1/7/16
to LZ4c
I've fixed algorithm for updating hash table, but my compress ratio is worse on 1-2% than yours...

Could you please say, what algorithm do you use for updating hash table when you use simple hash (not a hash chain)?

Let you have this data to compress, begins from address 1:

0,0,0,0,0,0,1,0,0,0,0,2,0,0,0,0,0,3,...

1. First we calculate hash for first 4 zeros and write in the hash table address 1.
2. We calculate hash for 4 zeros in address 1. We compress these 5 zeros into 1 literal, offset math == 1 and math length == 5. Then we update hash table and change address 1 to 2 (but its not good).
3. We calculate hash by bytes 1,0,0,0 and write in the hash table address 7.
4. We calculate hash by next 4 zeros and found in the hash table address 2, which match with length == 4. We compress the 4 zeros into 1 literal, offset == 6 and match length == 4. We update the hash table and change address 2 to 8 (its also not good).
5. So for the next 5 zeros we find match with only 4 previous zeros instead of 5, if we do not "corrupt" the data in the table.

Do you use some tricks to avoid these "corruption" hash table in case repeated chain of bytes (when offset match <= length match)?

Thanks,
 Serge

Yann Collet

unread,
Jan 7, 2016, 5:44:21 AM1/7/16
to LZ4c
Actually, 1-2% is pretty good I believe.

The difference can now be explained by many little implementation details,
starting from the simple fact that your format is different from LZ4.

As an example, I'm not completely sure to understand the way you encode long lengths (>=15), 
it seems your scheme behave differently, costing or saving a few bits depending on data to compress.

There are other subtlety, such as how to update the hash table, or even what the exact hash formula is, which is a matter of trade off.
I don't see the point in arguing too much about these : they are mostly equivalent choices, which will fare better or worse depending on sample data.
So just test and select.


There is no special treatment for overlap copies, although it could be a good idea to test one.
The issue is : the more special cases are added, the larger the code, the worse the performance (in general).
LZ4 is trying to avoid that.


Regards
Reply all
Reply to author
Forward
Message has been deleted
0 new messages