Precomp - a command-line file precompressor

schn...@gmx.de

unread,

Sep 11, 2006, 6:52:36 AM9/11/06

to

Hi!
I wrote a command-line file precompressor named Precomp. It
decompresses deflate/zLib streams in files (for example PDF, SWF, ZIP,
JAR, PNG) and ensures they can be recompressed bit-to-bit-identical.
The output file is larger, but can usually be compressed better than
the original file with good compressors like UHArc or PAQ.
For example, FlashMX.pdf from the maximumcompression benchmark can be
compressed to 2,5 MB with Precomp + UHArc instead of 3,6 MB with UHArc
alone.

Have a look at http://schnaader.info/coding/precomp/precomp.html

The actual version is v0.3. If you like to, test it and mail any bugs
you may find to me.

Greetings,
Christian "schnaader" Schneider
---
http://schnaader.info
Damn kids. They're all alike.

Malcolm Taylor

unread,

Sep 11, 2006, 4:38:36 PM9/11/06

to

Hi,

Could you describe the algorithm that you have used? I have attempted
something similar in the past, but am curious about your approach. It
seems rather good!

Thanks,
Malcolm

Sportman

unread,

Sep 11, 2006, 6:22:58 PM9/11/06

to

Malcolm Taylor wrote:
> It seems rather good!

I agree with that:

4,526,946 FlashMX.pdf
21,845,869 FlashMX.pcf (3min 25sec, decompress 3 sec only)
2,111,696 FlashMX.paq8igcc -6 (38min 12 sec, 2x JPEG 641x291, 2x JPEG
481x218)

Malcolm Taylor

unread,

Sep 12, 2006, 5:58:22 AM9/12/06

to

Hi again,

It seems like such an obvious idea in hindsight, but bravo for thinking
of it!
Carry around zlib, and with a few good guesses (trial and error?) you
can probably reproduce the exact encoding done when the original deflate
stream was created. Then all you need for side channel data are the
arguments to pass to zlib.
It should work for all PNG files, gzip files, swf, pdf and a lot of zip
files. It won't work on optimised zip files, ones created by 7zip, WinRK
or WinRAR (and possibly others). Still, since zlib has become a defacto
standard in a lot of places, it should be applicable to a lot of files.

Malcolm

Stephan Busch (Squeeze Chart 2005)

unread,

Sep 12, 2006, 6:08:44 AM9/12/06

to

This program is more than just good - in my tests it outperformed all
existing
SWF, SymbianOS (SIS) file compressors and scored good at PDF
compression;
however my PDF was compressed better using Multivalent; and it does not
support deflate64 ZIP files..
But it is a very nice approach and I'm looking forward to test future
versions.
He is the first one that released such a method and therefore should be
added
to my Squeeze Chart Photo Gallery.
Yes, Malcolm, were the first that developed a similar technique, but
WinRK
is not updated anymore and so I could not verify and test your
implementation.

Yours,

Stephan Busch

schn...@gmx.de

unread,

Sep 12, 2006, 10:01:26 AM9/12/06

to

Sportman wrote:
> I agree with that:
>
> 4,526,946 FlashMX.pdf
> 21,845,869 FlashMX.pcf (3min 25sec, decompress 3 sec only)
> 2,111,696 FlashMX.paq8igcc -6 (38min 12 sec, 2x JPEG 641x291, 2x JPEG
> 481x218)

That's even better than what I expected for paq8i. Thanks for testing!

Malcolm Taylor wrote:
> Hi again,
>
> It seems like such an obvious idea in hindsight, but bravo for thinking
> of it!

It is really obvious if you think about it, and I was very surprised
that I didn't find anything like this. Actually, that was my biggest
motivation when I began coding Precomp.

> Carry around zlib, and with a few good guesses (trial and error?) you
> can probably reproduce the exact encoding done when the original deflate
> stream was created. Then all you need for side channel data are the
> arguments to pass to zlib.

The "trial and error" part is what slows down Precomp. There are 9
different compression levels and 9 different memory levels, and they
don't appear in the zLib header, so the program has to check for
recompression 81 times per stream. But there are some optimizations
already that fasten up that part a lot.

> It should work for all PNG files, gzip files, swf, pdf and a lot of zip
> files. It won't work on optimised zip files, ones created by 7zip, WinRK
> or WinRAR (and possibly others). Still, since zlib has become a defacto
> standard in a lot of places, it should be applicable to a lot of files.

Actually, gzip files are not supported, but they will be soon.

Stephan Busch (Squeeze Chart 2005) wrote:
> This program is more than just good - in my tests it outperformed all
> existing
> SWF, SymbianOS (SIS) file compressors and scored good at PDF
> compression;
> however my PDF was compressed better using Multivalent; and it does not
> support deflate64 ZIP files..

Multivalent often scores better because of its half-lossy nature, most
notably when using maximum settings. That's the prize Precomp has to
pay for being 100% lossless. Another difference is that Precomp can't
process all PDF streams, for example ASCII85Decode streams, LZWDecode
or other used variants.
Nevertheless, I am working on some additional processing of PDF files,
which could lead to better results.

Also, if you didn't, try Precomp several times on the files, like PDF
-> PCF -> PCF2 -> PCF3... this will lead to better results sometimes.
Precomp will tell you when this didn't lead to further savings.

Deflate64 is not supported by zLib, and I'll have to use an
implementation from a different source like the 7-Zip source code. This
will follow when I added gzip support.

Thank you all for the positive reactions,

Matt Mahoney

unread,

Sep 12, 2006, 3:50:42 PM9/12/06

to

Unfortunately precomp doesn't work on ohs.doc with paq8igcc, I think
because it interferes with the jpeg model. ohs.doc has 3 embedded
jpegs, the first of which is very large (3 MB) and highly redundant.
It grows by 200 bytes in ohs.pcf. When I run paq8igcc, it speeds up
when the jpeg is detected, as it should, but then slows down after
about 120 KB, which means the jpeg model probably detected an error and
fell back to normal compression.

Also, does precomp work on zip files? I tried alice29.zip created with
pkzip 2.0.4 and 7zip 4.4.2 -tzip but precomp didn't recognize either
one. It works OK with alice29.jar though.

C:\res\maxcomp>precomp ohs.doc

Precomp v0.3 - ALPHA version - USE FOR TESTING ONLY
Free for non-commercial use - Copyright 2006 by Christian Schneider

Input file: ohs.doc
Output file: ohs.pcf

100.0% - New size: 4172444 instead of 4168192

Done.
Time: 21546 ms

Decompressable streams: 116
Recompressed streams: 4

You can speed up Precomp for THIS FILE with this parameters:
-c46 -m18

C:\res\maxcomp>start /b /belownormal \res\compress\paq8i\paq8igcc -6 x1
ohs.doc

C:\res\maxcomp>4168192 ohs.doc: JPEG 3506x2155 JPEG 514x663 -> 553010
4168192 -> 553039 (1.0614 bpc) in 119.28 sec (34.944 KB/sec)

C:\res\maxcomp>start /b /belownormal \res\compress\paq8i\paq8igcc -6 x2
ohs.pcf

C:\res\maxcomp>4172444 ohs.pcf: JPEG 3506x2155 JPEG 514x663 -> 678469
4172444 -> 678498 (1.3009 bpc) in 203.50 sec (20.503 KB/sec)

It's still second place, though.
http://maximumcompression.com/data/doc.php

-- Matt Mahoney

Malcolm Taylor

unread,

Sep 12, 2006, 4:49:29 PM9/12/06

to

Hi again,

schn...@gmx.de wrote:
> The "trial and error" part is what slows down Precomp. There are 9
> different compression levels and 9 different memory levels, and they
> don't appear in the zLib header, so the program has to check for
> recompression 81 times per stream. But there are some optimizations
> already that fasten up that part a lot.

Might I suggest gathering statistics while decompressing. Most likely,
the deflate block size, and the maximum distance that the matches reach
will help narrow down the choices (or at least help choose the most
likely ones first).

> Also, if you didn't, try Precomp several times on the files, like PDF
> -> PCF -> PCF2 -> PCF3... this will lead to better results sometimes.
> Precomp will tell you when this didn't lead to further savings.

You could probably make this automatic. Whenever you decompress a
stream, you could look for valid deflate streams within it and recurse...

> Deflate64 is not supported by zLib, and I'll have to use an
> implementation from a different source like the 7-Zip source code. This
> will follow when I added gzip support.

This is less likely to be of use, since to the best of my knowledge
there are several different deflate64 implementations. WinRK's is custom
built, 7zip's is too, and I know nothing about WinRAR's.

Malcolm

PS. Thanks, it is always fun to see something like this emerge! :) I
just might have to fit the idea into WinRK... (if I can find the time)

Malcolm Taylor

unread,

Sep 12, 2006, 4:53:03 PM9/12/06

to

Hi Matt,

> Unfortunately precomp doesn't work on ohs.doc with paq8igcc, I think
> because it interferes with the jpeg model. ohs.doc has 3 embedded
> jpegs, the first of which is very large (3 MB) and highly redundant.

IIRC, ohs.doc also has a few PNG files (very small), which precomp
should find. Obviously it is getting confused by the jpeg image.

A suggestion for Christian is to look for the JPEG SOI and EOI markers
and ignore anything inbetween (I have used a similar technique to parse
ohs.doc in the past for manual analysis).

Malcolm

Matt Mahoney

unread,

Sep 12, 2006, 7:35:08 PM9/12/06

to

Or avoid tranforming small segments. It appears there are several
embedded in the large jpeg. A problem with using SOI and EOI markers
is they are 2 bytes each and can appear in random data.

-- Matt Mahoney

Malcolm Taylor

unread,

Sep 13, 2006, 2:00:25 AM9/13/06

to

Hi Matt,

> Or avoid tranforming small segments. It appears there are several
> embedded in the large jpeg. A problem with using SOI and EOI markers
> is they are 2 bytes each and can appear in random data.

True of the SOI marker, but IIRC I had some form of heuristic to
determine the true image start markers from the false ones... I'd have
to look back at my code to remember just what I did though :)

Malcolm

schn...@gmx.de

unread,

Sep 13, 2006, 7:13:57 AM9/13/06

to

Matt Mahoney wrote:
> Malcolm Taylor wrote:
> > Hi Matt,
> >
> > > Unfortunately precomp doesn't work on ohs.doc with paq8igcc, I think
> > > because it interferes with the jpeg model. ohs.doc has 3 embedded
> > > jpegs, the first of which is very large (3 MB) and highly redundant.
> >
> > IIRC, ohs.doc also has a few PNG files (very small), which precomp
> > should find. Obviously it is getting confused by the jpeg image.
> >
> > A suggestion for Christian is to look for the JPEG SOI and EOI markers
> > and ignore anything inbetween (I have used a similar technique to parse
> > ohs.doc in the past for manual analysis).
> >
> > Malcolm
>

> Or avoid tranforming small segments. It appears there are several
> embedded in the large jpeg. A problem with using SOI and EOI markers
> is they are 2 bytes each and can appear in random data.

When using verbose mode (parameter -v), piping to a file, and searching
for "Best match" the problem gets clearer:
---
Possible zLib-Stream found at position 461673, windowbits = 15
Best match with compression level 6: 638 bytes, decompressed to 2641
bytes

Possible zLib-Stream found at position 666881, windowbits = 15
Best match with compression level 4: 31 bytes, decompressed to 42 bytes

Possible zLib-Stream found at position 3358560, windowbits = 15
Best match with compression level 4: 19 bytes, decompressed to 178
bytes

Possible zLib-Stream found at position 3857585, windowbits = 15
Best match with compression level 6: 638 bytes, decompressed to 2641
bytes
---
The first and the last match are the 2 PNG files in ohs.doc, and the
other two matches are inside the JPEG file that are mistaken for a zLib
stream.

Following solutions come to my mind:

1. Parsing JPEG files (looking for SOI and JFIF, eventually parsing the
other blocks, and looking for EOI) and skip them. This would lead to a
new parameter, for example "jpegignore".
2. Ignoring small segments, like Matt suggested. This would lead to a
new parameter, for example "minsize". At the moment, I'm ignoring small
segments, but only if they're 4 bytes in size or smaller, so this is
easy to implement.
3. Ignoring user-chosen streams, so for ohs.doc you could use something
like "precomp ohs.doc -ignore666881 -ignore3358560" where the numbers
after "ignore" are the positions of the streams to ignore. Another way
would be to assign consecutive numbers to the streams and add them to
the verbose output, so you would just have to ignore stream #2 and #3
with "precomp ohs.doc -ignorenr2 -ignorenr3".

All of these 3 solutions seem good to me, so I will possibly use all of
them in the next version.

Matt Mahoney wrote:
> Also, does precomp work on zip files? I tried alice29.zip created with
> pkzip 2.0.4 and 7zip 4.4.2 -tzip but precomp didn't recognize either
> one. It works OK with alice29.jar though.

Not every ZIP variant is supported, and it seems that there is much
work left to do. The one mentioned by you seems to be an optimised zip
file created by 7zip. I guess it doesn't use Deflate or doesn't add a
zLib header, so Precomp can't recognize the stream.

In a later version, there will be a parameter '-brute' to "recognize"
Deflate streams without header. As the name says, it is more like brute
force. Every byte in the input file is assumed to be the beginning of a
stream and will be tried to recompress. This will slow down Precomp
extremely and lead to much more incorrect streams detected, but it will
detect streams that have no zLib/ZIP/gzip header or no header at all,
so it can be used to add new header types to Precomp.

schn...@gmx.de

unread,

Sep 13, 2006, 8:40:04 AM9/13/06

to

Follow-up:

I just found out both mistaken streams in ohs.doc use compression level
4 and memory level 1, the other streams use compression level 6 and
memory level 8, so a workaround is to use parameters '-c6 -m8' with
ohs.doc instead of recommended '-c46 -m18'.
The result will be 200 bytes smaller than before (4172244 bytes) and
can hopefully be compressed with paq8igcc correctly (I haven't tested
yet).

Matt Mahoney

unread,

Sep 14, 2006, 4:44:32 PM9/14/06

to

That worked.

paq8i -6 ohs.doc = 553,039 (#1 on maximumcompression.com)
precomp -c6 -m8 ohs.doc | paq8i -6 ohs.pcf = 552,690

-- Matt Mahoney

Matt Mahoney

unread,

Sep 14, 2006, 5:54:14 PM9/14/06

to

Some more results on FlashMX.pcf. These all beat 3,549,078, the best
on http://maximumcompression.com/data/pdf.php

bzip2 -9 = 3,155,076
7zip -m0=ppmd:mem=768m:o16 = 2,799,673
ppmonstr -m800 -o16 = 2,446,754
paq8igcc -7 (900 MB memory) = 2,109,667

Yeah, it's pretty good.

-- Matt Mahoney