Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

GZIP with an improved parsing

111 views
Skip to first unread message

encode

unread,
Aug 13, 2007, 3:29:56 PM8/13/07
to
Hi all!

Just made a small improvement on deflate.c - changed the parsing
scheme to lazy matching with 2 byte lookahead. The compression speed
is not too much affected, and compression mostly a little bit
improved:

world95.txt:
gzip 1.2.4, -9: 862,955 bytes
gzip 1.2.4-hack, -9: 857,205 bytes

fp.log:
gzip 1.2.4, -9: 1,337,914 bytes
gzip 1.2.4-hack, -9: 1,319,851 bytes

bible.txt:
gzip 1.2.4, -9: 1,176,645 bytes
gzip 1.2.4-hack, -9: 1,168,229 bytes

3200.txt:
gzip 1.2.4, -9: 6,205,632 bytes
gzip 1.2.4-hack, -9: 6,183,702 bytes

ENWIK8:
gzip 1.2.4, -9: 36,445,248 bytes
gzip 1.2.4-hack, -9: 36,273,716 bytes

ENWIK9:
gzip 1.2.4, -9: 322,591,995 bytes
gzip 1.2.4-hack, -9: 321,050,648 bytes

Some testing results, including timings, can be found here:
http://www.cs.fit.edu/~mmahoney/compression/text.html

The "HACK"ed compile, plus modified "deflate.c" available here:
http://www.encode.ru/downloads/gzip124hack.zip (71 KB)

Mostly, it's just a brief and lame hack, written in five minutes, but
looks like it works! ;) Anyway, I made it for experiments, and maybe
and finally GZIP/ZLIB will have something more advanced than lazy
matching with just one-byte lookahead! ;)

Enjoy! :)

-- Ilia Muraviev

Phil Carmody

unread,
Aug 13, 2007, 4:36:13 PM8/13/07
to
encode <enc...@narod.ru> writes:
> Hi all!
>
> Just made a small improvement on deflate.c - changed the parsing
> scheme to lazy matching with 2 byte lookahead. The compression speed
> is not too much affected, and compression mostly a little bit
> improved:

Excellent - so this creates a compatible gzip stream, only
the compression is affected? I seem to remember someone
mentioned a hack similar to this a couple of years ago,
but never got round to implementing it. Up to 1.5% space
improvement, by the looks of it - good job!

Phil
--
Dear aunt, let's set so double the killer delete select all.
-- Microsoft voice recognition live demonstration

encode

unread,
Aug 13, 2007, 5:11:07 PM8/13/07
to
> Excellent - so this creates a compatible gzip stream, only
> the compression is affected?

Yes, it's fully compatible. The most important is that we not loose
too much time with this improved compression. For example, some
optimized Deflate used in 7-Zip, kzip, etc. is dead slow. And with 2-
bytes lookahead we getting around the same compression speed! I
created just a brief implementation, many things can be done more
cleverly. However, just showing how it can be... ;)

-- Ilia Muraviev

Phil Carmody

unread,
Aug 13, 2007, 5:17:44 PM8/13/07
to

I did check the times but I didn't want to sound negative in what
should have been a positive post! The 3:2 ratio of times on Matt's
table is a bit of a hit (compress only), but it's still a fast
compressor compared to the real crunchers.

Perhaps the '-number' switches can be rejigged, and this be made
the new number 9?

Who are you? I am the new number 2. You are number 6.

I am not a number; I am a free man!

Phil, somewhere on planet wingnut.

Mark Adler

unread,
Aug 14, 2007, 12:23:44 AM8/14/07
to
On Aug 13, 12:29 pm, encode <enc...@narod.ru> wrote:
> Just made a small improvement on deflate.c - changed the parsing
> scheme to lazy matching with 2 byte lookahead. The compression speed
> is not too much affected, and compression mostly a little bit
> improved:

Cool! I'll take a look at the patch.

So the obvious question is, what happens with a three-byte lookahead?

Mark

encode

unread,
Aug 14, 2007, 3:27:16 AM8/14/07
to
> I did check the times but I didn't want to sound negative in what
> should have been a positive post! The 3:2 ratio of times on Matt's
> table is a bit of a hit (compress only), but it's still a fast
> compressor compared to the real crunchers.
>
> Perhaps the '-number' switches can be rejigged, and this be made
> the new number 9?

The "-9" switch means max compression, right? So we can spend a little
bit extra time. With other switches we use the same compression. Also
note that I believe there is more clever way to do the same thing - so
the speed can be greater!

> So the obvious question is, what happens with a three-byte lookahead?

My brief tests shown that nothing happen! :(

The older versions of LZPM (my ROLZ-based file compressor) make use of
the Lazy Matching with 2-bytes lookahead instead of just one. With
LZPM, especially if we compress text files, we can see a really nice
compression gain, with just a slight speed impact. That's why I tried
to use the same thing with Deflate. Unfortunately, with Deflate, the
compression gain with such parsing is smaller than expected. I thing
it's not only due to a tiny dictionary (LZPM has 16 MB dictionary -
vs. Deflate's 32 KB), but also a weak literal encoder (LZPM has an
order-1 bit-oriented arithmetic encoder vs. Deflate's order-0
Huffman).

With LZPM, 3-byte lookahead can give some gain, but naturally only
with text files. With Deflate it only hurts.

Anyway, here is some alternatives for Deflate's improved parsing:
1. Lazy matching with N-byte lookahead. If N is greater than
MIN_MATCH, try to find match of this length at smaller distance and it
this instead.

2. Flexible Parsing - it should be not so efficient with LZ77 but with
some schemes like ROLZ (LZPM) it is the only way known to myself to
perform a near-optimal parsing:

/* get the sum of current match and followed match lengths */
max_value = match_length + get_match(cur_pos + match_length);

lim = match_length; /* how far we can go */

for (i = 1; i < lim; i++) {
tmp = i + get_match(cur_pos + i); /* get sum for (cur_pos + i) +
followed match*/

if (tmp > max_value) {
tmp = max_value;
match_length = i;
}
}

if ((match_length < lim) && (match_length >= MIN_MATCH))
// try to replace current distance with smaller one

It is just a brief pseudo code. In practice, we can use dozens of
heuristics in favor to improve compression. For example, with
unreleased yet LZPM 0.09 I take into account during calculating
"max_value":
+ actual distance of the current and followed matches
+ position within the match - i.e. in many cases dropping a literal is
cheaper
+ smaller matches at smaller distances
So, "max_value" became not just a sum of two match lengths but some
kind of score for each choice.

3. Optimal Parsing - like with LZMA and LZX. It might be very slow,
but with small dictionary, plus if we will use constants + heuristics
instead of direct price calculation, it can be useful.

-- Ilia Muraviev

Jim Leonard

unread,
Aug 14, 2007, 12:41:09 PM8/14/07
to
On Aug 13, 4:17 pm, Phil Carmody <thefatphil_demun...@yahoo.co.uk>
wrote:

> Perhaps the '-number' switches can be rejigged, and this be made
> the new number 9?

I'd like to see this. In fact, I'm surprised that optimal parsing
hasn't been added to gzip/zlib already; it doesn't change the
compatibility of the output. Mark, is there a reason this hasn't been
added in the last decade? (Not pointing fingers, just curious if
there is a reason I'm not thinking of to not touch the 1.2.4 code
proper)

Mark Adler

unread,
Aug 14, 2007, 7:24:54 PM8/14/07
to
On Aug 14, 9:41 am, Jim Leonard <MobyGa...@gmail.com> wrote:
> In fact, I'm surprised that optimal parsing
> hasn't been added to gzip/zlib already; it doesn't change the
> compatibility of the output. Mark, is there a reason this hasn't been
> added in the last decade?

That would be due in part to the fact that Jean-loup, the deflate
author, hasn't worked on it for many years. And In part due to the
fact that the intent of gzip (or zlib) was never to eke out the last
bit, but rather to provide good compression vs. speed performance.
For ekeing out the last bit, obviously there are other compression
methods that can do much better than the deflate format, no matter how
optimal the deflate algorithm is.

Having said that, the two-byte lookahead is easily within the realm of
good compression vs. speed performance at only 50% slower, and so I
think that should, and likely will be provided in zlib and gzip.

A truly optimal parser for deflate would, in my estimation, be slower
than a compressor using a different method with significantly more
compression. So I that would be very low on the to-do list for zlib.

Mark

collec...@googlemail.com

unread,
Aug 15, 2007, 4:08:01 PM8/15/07
to
Hi

Illia,

what happens if you increase the dictioanary size for GZIP? Is this
possible? Is it an easy hack to do, or a hard one? I think if it is
possible, the decoder shouldn't be affected, as gzip is just offset
+length based and needs no dictionary.

Mark Adler

unread,
Aug 15, 2007, 7:49:17 PM8/15/07
to

No, the format, and hence the decoder, has to be modified for a larger
dictionary. The deflate format can encode distances between 1 and
32768 (inclusive), and no others. There is a deflate64 format that
can encode distances up to, of course, 64K. However that hardly seems
worth the trouble since the gain is rather small.

What you really want is a completely different format for large
dictionaries (1 MB or more). And while you're at it, you might as
well use arithmetic or range coding instead of Huffman.

Well, it's already been done. Probably several times. One that comes
to mind is LZMA, used in 7-zip.

Mark

cr88192

unread,
Aug 15, 2007, 9:47:31 PM8/15/07
to

"Mark Adler" <mad...@alumni.caltech.edu> wrote in message
news:1187221757.8...@i13g2000prf.googlegroups.com...

> On Aug 15, 1:08 pm, collectio...@googlemail.com wrote:
>> what happens if you increase the dictioanary size for GZIP? Is this
>> possible? Is it an easy hack to do, or a hard one? I think if it is
>> possible, the decoder shouldn't be affected, as gzip is just offset
>> +length based and needs no dictionary.
>
> No, the format, and hence the decoder, has to be modified for a larger
> dictionary. The deflate format can encode distances between 1 and
> 32768 (inclusive), and no others. There is a deflate64 format that
> can encode distances up to, of course, 64K. However that hardly seems
> worth the trouble since the gain is rather small.
>

well, luckily, deflate64 is just the same as deflate, only with 64k offsets.
a single decoder can be written that can decode both absent much concern for
which it is.

the same encoder can also do both as well, just by default, I limit it to
32k for compatibility...


> What you really want is a completely different format for large
> dictionaries (1 MB or more). And while you're at it, you might as
> well use arithmetic or range coding instead of Huffman.
>
> Well, it's already been done. Probably several times. One that comes
> to mind is LZMA, used in 7-zip.
>

yes.


I once made a hack, a modified version of deflate supporting larger matches
and dictionaries (it switched over to using a different block format...).
wasn't really worth it though.


hypothetical (and likely impractical):

actually, if one wanted, they could make a format bitstream-identical to
deflate, but supporting very large dictionaries, by "redefining" the
contents of the bitstream (in which case, existing unaware decoders would
just produce garbage...).

another possibility is this:
doing something sort of like deflate-64, except that distances between 48
and 64k, mean something very different. it switches to an LZP style model,
where '49153' does not mean '49152+1 bytes backwards', but rather, the first
index in this hash chain past the 48kB mark.

a more drastic hack could allow extending the length field, for example,
certain short lengths could be made special/invalid for the 'long distance'
case. encoding such a length could indicate that the length is done
differently, for example, it is encoded as a suffix (using the same encoding
as for distances).

Len=3 Dist=51021 Auxlen=3838
encoding a match in thuth:
Len=4096 Dist=4953765

the bitstream would need to contain something to indicate this encoding
though (a special escape header).

~X## (sees '~', BTYPE==3)

## giving the log2 of the dictionary size, in ascii...
this also serves to blow up existing inflaters...

'~X24', indicates the stream is using a 16MB dictionary.


of course, a completely new codec is probably a better option.

about the only advantage this would have is that it would allow backwards
compatibility with deflated bitstreams (much like deflate64). and a
multi-way compatible encoder could also be written.

> Mark
>


collec...@googlemail.com

unread,
Aug 16, 2007, 7:41:26 AM8/16/07
to
On Aug 16, 12:49 am, Mark Adler <mad...@alumni.caltech.edu> wrote:
> On Aug 15, 1:08 pm, collectio...@googlemail.com wrote:
>
> > what happens if you increase the dictioanary size for GZIP? Is this
> > possible? Is it an easy hack to do, or a hard one? I think if it is
> > possible, the decoder shouldn't be affected, as gzip is just offset
> > +length based and needs no dictionary.
>
> No, the format, and hence the decoder, has to be modified for a larger
> dictionary. The deflate format can encode distances between 1 and
> 32768 (inclusive), and no others. There is a deflate64 format that
> can encode distances up to, of course, 64K. However that hardly seems
> worth the trouble since the gain is rather small.

I see now. I misinterpreted (Through lack of knowledge in the
field ;)) what he meant by dictionary size.

To me, dictioanry size is the number of words in a dictionary. He must
have meant the maximum range that a dictionary could have.

For example, the (tentative) definition for a compressor I hope to be
working on later, will have a 128 slot dictionary with infinite range.
That's for non-offset based "words". For offset based words, it has an
infinite number within 16K range. Although I suppose you could say
that it only has 16K maximum words... but then even that's false as a
word would have to be 4 bytes long, or so to be worth storing as an
offset. So it might be a 4K dictionary.

Of course I know nothing about making good compression, so it might
turn out that 16K is too small... You think 1MB is a better range
size?

Roy

unread,
Oct 1, 2007, 3:08:42 AM10/1/07
to
On 8 14 , 3 29 , encode <enc...@narod.ru> wrote:
> Hi all!
>
> Just made a small improvement on deflate.c - changed the parsing
> scheme to lazy matching with 2 byte lookahead. The compression speed
> is not too much affected, and compression mostly a little bit
> improved:
>
> world95.txt:gzip1.2.4, -9: 862,955 bytesgzip1.2.4-hack, -9: 857,205 bytes
>
> fp.log:gzip1.2.4, -9: 1,337,914 bytesgzip1.2.4-hack, -9: 1,319,851 bytes
>
> bible.txt:gzip1.2.4, -9: 1,176,645 bytesgzip1.2.4-hack, -9: 1,168,229 bytes
>
> 3200.txt:gzip1.2.4, -9: 6,205,632 bytesgzip1.2.4-hack, -9: 6,183,702 bytes
>
> ENWIK8:gzip1.2.4, -9: 36,445,248 bytesgzip1.2.4-hack, -9: 36,273,716 bytes
>
> ENWIK9:gzip1.2.4, -9: 322,591,995 bytesgzip1.2.4-hack, -9: 321,050,648 bytes

>
> Some testing results, including timings, can be found here:http://www.cs.fit.edu/~mmahoney/compression/text.html
>
> The "HACK"ed compile, plus modified "deflate.c" available here:http://www.encode.ru/downloads/gzip124hack.zip(71 KB)

>
> Mostly, it's just a brief and lame hack, written in five minutes, but
> looks like it works! ;) Anyway, I made it for experiments, and maybe
> and finallyGZIP/ZLIB will have something more advanced than lazy

> matching with just one-byte lookahead! ;)
>
> Enjoy! :)
>
> -- Ilia Muraviev

I adopted this patch for zlib a month ago. feeling free to use it. ;)

patch for zlib is here:
http://three.fsphost.com/rtfreesoft/zlib_deflate_lc2.patch

0 new messages