LZ77 Idea

zeroone

unread,

Jun 29, 2005, 11:19:50 AM6/29/05

to

Hi everyone,

at the moment, I am working on an LZ77 (more specific: LZSS) compressor.
Usually a constant amount of bits are used for codewords, consisting of
offset and length of matches.

Since working bytewise is easier and often also faster, I chose 3 Byte
(24 bits) codewords to be the best choice for storing offset and length.
My idea (I don't know if someone has already used this):
Instead using always 3 byte codewords I also include 2 and even 1 byte
codewords. So if no 4 byte match is found (minimum match length when
using 3 byte codewords, to save at least 1 byte) the compressor looks
for 3 byte matches to be encoded using 2 bytes.

The first bit of the codewords is used as a flag to indicate, if the
codeword has 2 or 3 bytes length. Of course, matchoffset und matchlength
lengths suffer from less bits, but at least there is a chance to
compress also 3 byte matches. Up to here everything works fine
(theoretically, since I did not implement this algo by now).

What if also no 3 byte matches are found? Usually the compressor has to
output a literal now, BUT what if we search for 2 byte matches to be
encoded using 1 byte codewords. Of cource, again, offset and length is
very limited, but small compression is better that no compression at all ;)

My problem: I don't know how to indicate 3 different codeword lengths.
One flagbit can only handle two different states. Using a second flagbit
would hurt offset and length (especially when using 1 and 2 byte
codewords) very bad.

So does anyone have any idea how to solve this? Any help is appreciated.

Big thanks,
zeroone

zeroone

unread,

Jun 29, 2005, 6:43:51 PM6/29/05

to

Jim Leonard wrote:

> zeroone wrote:
>
>>So does anyone have any idea how to solve this? Any help is appreciated.
>
>

> A few random thoughts spring to mind:
>
> - Add a codeword so that the extra bit isn't "wasted". Does your
> sliding window support sizes greater than 16777216? If so, consider a
> 32-bit code.
>
> - Limit yourself to two codeword types (short and long) so that you
> don't need an extra bit (my 8088 compressor does this, but also because
> my sliding window is short enough to support it; yours may not).
>
> - Use Golomb codes so that your codewords are variable (slows things
> down as you no longer have byte-wise codes)
>
> Hope this helps...
>

Thank you for your reply.

As I want to do everything byte-wise I won't use variable length codes.

What do you mean with 32-bit codes? 32 bits long codewords? Spending
more than 24 bits for codewords does not make sense in my compression,
since offset and length can perfectly be stored within only 3 bytes (of
course considering a max. 65k sliding window).

Using only two codeword types is possible without any problems.
But if possible I'll try to include all three types. Coding even 2 byte
matches would achieve great compression. That's the reason I am asking
for help, maybe there is a solution for this issue. If I am on the wrong
track I have to implement the two type version..

Hopefully anyone has some advise or just 'thoughts' :)

Jim Leonard

unread,

Jun 29, 2005, 8:00:13 PM6/29/05

to

zeroone wrote:

> What do you mean with 32-bit codes? 32 bits long codewords? Spending
> more than 24 bits for codewords does not make sense in my compression,
> since offset and length can perfectly be stored within only 3 bytes (of
> course considering a max. 65k sliding window).

Exactly. If you increase the size of your window, you can expand to a
4th 32-bit code.

Jacek Wawrzaszek

unread,

Jun 29, 2005, 9:07:09 PM6/29/05

to

"zeroone" wrote:
>
> As I want to do everything byte-wise I won't use variable length codes.
>

You can use variable length code to encode type of codeword and still
do it byte-wise:

0 -> 7-bit codeword
10 -> 14-bit codeword
11 -> 22-bit codeword

>
> Using only two codeword types is possible without any problems.
> But if possible I'll try to include all three types. Coding even 2 byte
> matches would achieve great compression. That's the reason I am asking
> for help, maybe there is a solution for this issue. If I am on the wrong
> track I have to implement the two type version..
>

If you have many 2 byte matches, you might try to use all 7 bits of the
shortest codeword as offset and encode with it only 2 byte strings.

J.

Jim Leonard

unread,

Jun 29, 2005, 3:03:48 PM6/29/05

to

zeroone wrote:
> So does anyone have any idea how to solve this? Any help is appreciated.

A few random thoughts spring to mind:

zeroone

unread,

Jun 30, 2005, 5:32:28 AM6/30/05

to

Jacek Wawrzaszek wrote:
> "zeroone" wrote:
>
>>
>> As I want to do everything byte-wise I won't use variable length codes.
>>
>
> You can use variable length code to encode type of codeword and still do
> it byte-wise:
>
> 0 -> 7-bit codeword
> 10 -> 14-bit codeword
> 11 -> 22-bit codeword

That seems to be a great solution. Although I loose some bits all three
types can be used in only one function. Thank you!

Now I am wondering, if it would be better to split it up into two
distinct functions:

The first encodes big matches down to 3 bytes:
1 -> 23-bit codeword (offset+length) if the match is > 3 bytes
0 -> 15-bit codeword (*only*offset) if matches are exactly 3 bytes long.
Length is not needed any more, because this type encodes only 3 byte
matches.

The second function encodes only matches of 2 bytes, since all bigger
matches have been previously encoded:
8-bit codeword (*only* offset). Again length isn't needed, because it is
constant.

How does this idea sound? Would it achieve good compression? Are there
still any improvements?

Up to here I favor the first method (one big function), since the second
short-match function has to compensate the flagbytes that are stored in
each 8th byte. So there must be at least one 2-byte-match each 8th byte.

Big thanks,
zeroone

cr88192

unread,

Jun 30, 2005, 8:00:20 AM6/30/05

to

"zeroone" <no...@no-spam.de> wrote in message
news:d9ue6n$cu7$03$1...@news.t-online.com...

> Hi everyone,
>
> at the moment, I am working on an LZ77 (more specific: LZSS) compressor.
> Usually a constant amount of bits are used for codewords, consisting of
> offset and length of matches.
>

<snip>

>
> My problem: I don't know how to indicate 3 different codeword lengths. One
> flagbit can only handle two different states. Using a second flagbit would
> hurt offset and length (especially when using 1 and 2 byte codewords) very
> bad.
>
> So does anyone have any idea how to solve this? Any help is appreciated.
>

if you are using a byte-packing scheme (vs my usual approach of codespace
mangling...), consider possibly using 2 bits instead 1 for indicating the
run (note: as described, it sounded like you were considering something
like:
1 bit + (0).3.12 and 1 bit + (1).7.16 or something?).

eg: 0=literal byte; 1=2 byte run (eg: a 4.12 or 6.10 representation); 2=3
byte run; 3=something or another (eg: a 4 byte window, 8.24 or 12.20, or
reserved, ...).

this does not hurt the runs themselves, but does increase the total overhead
(eg: worst case 25% inflation instead of 12.5%).

at least:
it would have more range than plain 1 bit+4.12 lzss, it risks less run
overhead than 1 bit+8.16 would.

note: I have found that, most commonly, matches are between about 8-32
chars. the limit of 16 or 18 often imposed by lzss hurts (due to the small
limit and dictionary), when it comes to longer runs (eg: >64 chars) they are
rare enough to justify a little extra cost.
I have noted that locality is also often pretty high, my testing has shown
that once one starts getting past about 32 or 64kB on typical data, payback
starts diminishing quickly vs the costs (such as that of performing the
lookup, and that of encoding the longer references).

also: literals seem to be fairly rare vs. runs for much well, compressible
data, which I have not noticed until more recently.

so, I would vote for, 2 bit codes:
0=literal; 1=6.10; 2=8.16; 3=whatever.
or maybe: 0=1 literal; 1=2 literals; 2=6.10; 3=8.16.

yet also possible, the packbits could be a kind of limited variable-length
scheme: 0=literal; 00=6.10; 01=8.16. though possible, this could lead to a
more complicated decoder.

the first above would be simplest probably.

any of this at all interesting/helpful?...

maybe I could try this as well, it would be a break from my beating against
(and not making much progress on) writing a physics simulator (stupid,
collision, normals...).
me starting on adding convex hulls as, me realizing, apparently the algo for
comming up with good obb normals is not that much simpler. at least (I
think) I know a good enough approach for hull normals. otherwise, aabbs and
spheres as about the only well behaved primitives is kind of lame.

or such...

zeroone

unread,

Jul 1, 2005, 8:02:33 AM7/1/05

to

So you are referring to the flagbytes? Actually I am using seperate
flag*bytes* instead using flagbits straight before the primary units. So
using 2 bits per unit would mean 2 flagbytes per 8 units. Are you sure
it is worth the cost? That would enlarge the code by 2 bytes each 8
literals/codewords, meaning up to 25% more bytes in the worst case.

At the moment I favor either using 1 or 2 bits of the codewords
theirselves to support all three types, or using just 2 types (long and
short) and -again- using no additional flagbytes, but the codewords
theirselves to store the type (in this case their first bit).

I really don't know what's better..

cr88192

unread,

Jul 1, 2005, 10:44:10 AM7/1/05

to

"zeroone" <no...@no-spam.de> wrote in message

news:da3bcp$i9$02$1...@news.t-online.com...

I mentioned this, the worst case is worse, but the best case is potentially
not as bad. I justify this by mentioning that, in my experience with common
data, literals are rare vs. runs.

I had more imagined though: 1 flagbyte for every 4 units, but 2 for 8 is
effectively the same ratio (it more depends on whether they are done 8 or 16
bits at a time...).

trial and error would be needed to determine whether the cost is actually
justified (note: some of my bytewise lz77 schemes risk equal or worse
inflation in the worst case, but the results are not allways negative). ok,
in my case, the costs may have been reduced given ways I have typically
managed the codespace, but oh well.

> At the moment I favor either using 1 or 2 bits of the codewords
> theirselves to support all three types, or using just 2 types (long and
> short) and -again- using no additional flagbytes, but the codewords
> theirselves to store the type (in this case their first bit).
>

yes, one either needs a smaller length or offset in this case.
3.12, likely too poor to be useful.
4.11, maybe ok. 5.10, maybe better, maybe worse.

24 bit case, one either has 7.16 or 8.15, so either a run limit of 128, or a
window limit of 32kB.

> I really don't know what's better..

well, trial and error would probably be needed to determine this.

imo this is a decent way to try things, but is limited. apparently trial and
error based approaches have more limited success against more complex
problems (mathy stuff, or stuff requiring much "design"). results with
compression have been acceptable though imo...

I have not yet gotten to trying this, more fiddling with a physics engine
and other stuff...

cr88192

unread,

Jul 5, 2005, 1:25:47 AM7/5/05

to

"cr88192" <cr8...@NOSPAM.hotmail.com> wrote in message
news:c5043$42c556c7$ca801623$80...@saipan.com...

>
> "zeroone" <no...@no-spam.de> wrote in message
> news:da3bcp$i9$02$1...@news.t-online.com...
>> cr88192 wrote:

<snip>

>
>> At the moment I favor either using 1 or 2 bits of the codewords
>> theirselves to support all three types, or using just 2 types (long and
>> short) and -again- using no additional flagbytes, but the codewords
>> theirselves to store the type (in this case their first bit).
>>
> yes, one either needs a smaller length or offset in this case.
> 3.12, likely too poor to be useful.
> 4.11, maybe ok. 5.10, maybe better, maybe worse.
>
> 24 bit case, one either has 7.16 or 8.15, so either a run limit of 128, or
> a window limit of 32kB.
>
>> I really don't know what's better..
>
> well, trial and error would probably be needed to determine this.
>

I finally got around to it. in case anyone cares, here is what I found from
testing:
2 flagbits hurts compression more than using a single bit in the run;
3.12 and 5.10 lead to worse compression, 4.11 gave the best results;
7.16 is better than 8.15 in my tests.

results are pretty close to those of my 'lzbss' algo, just without the whole
codespace mangling thing (allowing a simpler and possibly faster decoder).

rfc3261: lzbss 195kB, this 199kB, gzip 174kB.
calagary corpus (as a tarball): lzbss 1.21MB, this 1.23MB, gzip 1.03MB.
I guess this is probably reasonable for a bytewise lz77 variant.

so? I don't know...

encoding (for now at least):
window is "flat", offsets are relative to the current position,
offsets>65536-256 are not allowed (so that using a circular buffer for
decode is possible).
flagbits are packed 8 at a time into a byte going from low to high.

flagbits 0=run, 1=literal.
0: 16 bit run
MSB=0, low 7=high bits of relative offset
high 4=low bits of offset, low 4=length-2
1: 24 bit run
MSB=1, low 7=length
high 8 bits of offset
low 8 bits of offset

may add some sort of file-header or something (I just change algos too damn
often...).

or whatever...