how can i programmatically decompress lzw data (that's from pdfs)?

ben

unread,

Sep 15, 2003, 1:38:39 PM9/15/03

to

what can i use or how can i decompress lzw data from a pdf file?

i'm writing code to read pdfs, and i've just attempted to use the unix
utility 'uncompress' from within my code to decompress an lzw pdf data
stream, but it fails - my code takes the lzw compressed data out of the
pdf and saves just that data into a temporary file, then attempts to
use 'uncompress' on that temperory file, but it errors with
"Inappropriate file type or format"

this is a section from the pdf that shows the start of the data (after
the word stream):

<<
/Length 7186
/Filter /LZWDecode
>>
stream
匿沫P哄1.............

after the newline that comes after 'stream' is where my temporary file
starts from and continues for 7186 bytes.

so i'm either prepairing the temp file incorrectly? but i don't *think*
so, or maybe 'uncompress' will not work with this type of lzw data? if
not, what could i use? how can i uncompress lzw pdf data?

any info much appreciated.

i'm writing the code in c (also objective-c) on a mac, os x, which is
unix based.

thanks, ben

Logan Shaw

unread,

Sep 15, 2003, 2:04:21 PM9/15/03

to

ben wrote:
> so i'm either prepairing the temp file incorrectly? but i don't *think*
> so, or maybe 'uncompress' will not work with this type of lzw data? if
> not, what could i use? how can i uncompress lzw pdf data?

This page

http://www.prepressure.com/pdf/info/compression.htm

says that it uses ZIP compression, which is "a somewhat smarter version
of LZW compression". I don't think you'll be able to use a command-line
utility to decompress using ZIP's algorithm (since ZIP includes extra
data, namely a table of contents of files). But you may be able to
grab source code from, say, Info-ZIP and grab the appropriate sections.
Try http://www.info-zip.org/pub/infozip/UnZip.html .

- Logan

ben

unread,

Sep 15, 2003, 5:18:21 PM9/15/03

to

In article <Fan9b.37113$jV1....@twister.austin.rr.com>, Logan Shaw
<lshaw-...@austin.rr.com> wrote:

thanks very much for your reply.

zip de/compression is not mentioned in adobe's pdf 1.5 specifications
at all - not a single mention of it, and it does go into some detail
about all the compressions that pdf documents can and do use. if zip
rather than lzw were required i'm sure it'd be mentioned in the pdf
specs. so using something that's geared to lzw but not zip
decompression really isn't what's stopping me decompressing
successfully i don't think. it doesn't look like zip does lzw either.
the data i'm attempting to decompress is definetely lzw.

it now looks like the unix utility i'm using 'uncompress' does not in
fact handle lzw. - i was lead to believe that it did.

so if anyone's got any pointers how i could go about decompressing some
lzw data programmatically it'd be great.

thanks, ben

Marco Schmidt

unread,

Sep 15, 2003, 5:52:26 PM9/15/03

to

ben:

[...]

>zip de/compression is not mentioned in adobe's pdf 1.5 specifications
>at all - not a single mention of it, and it does go into some detail
>about all the compressions that pdf documents can and do use. if zip
>rather than lzw were required i'm sure it'd be mentioned in the pdf
>specs.

He probably means Deflate. It's used in Zip and PDF.

[...]

>it now looks like the unix utility i'm using 'uncompress' does not in
>fact handle lzw. - i was lead to believe that it did.

It probably does. I think you falsely assume that all implementations
of LZW create exactly the same type of bitstream when in fact there
are all kinds of differences. Maximum code size, bit order within a
byte, etc.

>so if anyone's got any pointers how i could go about decompressing some
>lzw data programmatically it'd be great.

Isn't that described in the PDF specification?

If not, search for existing PDF decoder implementations like xpdf (C)
or the Multivalent browser (Java).

By the way, there is a newsgroup for PDF: comp.text.pdf. There you
will probably get more in-depth information on that format.

Regards,
Marco

ben

unread,

Sep 15, 2003, 8:07:08 PM9/15/03

to

In article <bpccmv026n2m0mt4a...@4ax.com>, Marco Schmidt
<marcos...@geocities.com> wrote:

> > it now looks like the unix utility i'm using 'uncompress' does not in
> > fact handle lzw. - i was lead to believe that it did.

> It probably does.

right, that's very intersting. you think there's a chance that
'uncompress' does handle lzw, but happens to be a slightly different
version of. hmm.

> I think you falsely assume that all implementations
> of LZW create exactly the same type of bitstream when in fact there
> are all kinds of differences. Maximum code size, bit order within a
> byte, etc.

yup good point - you're right there. i am unsure on the particular
format nuences and that's been something continually worrying me about
this. wasn't so much assuming identicle format though, more
blindly/vigourously ignoring that worry :) good to have that pointed
out.

> >so if anyone's got any pointers how i could go about decompressing some
> >lzw data programmatically it'd be great.
>
> Isn't that described in the PDF specification?

well yes, but it's complicated. i was really hoping that the actual
specific algorithm decompression part of lzw, the nub of it, was
something that i could lift from elsewhere rather than carry out
myself.

> If not, search for existing PDF decoder implementations like xpdf (C)
> or the Multivalent browser (Java).

unfortuneately xpdf, at least pdf2text (which is part of xpdf) is
actually in c++ which i do not know. i only know c. shame. i was
looking at that earlier on.

> By the way, there is a newsgroup for PDF: comp.text.pdf. There you
> will probably get more in-depth information on that format.

yes, i have also asked on there. nothing fruitful as yet.

ok, thanks very much for the info and coments - most helpful.

ben.

Stuart Caie

unread,

Sep 16, 2003, 4:59:29 AM9/16/03

to

ben wrote:
> right, that's very intersting. you think there's a chance that
> 'uncompress' does handle lzw, but happens to be a slightly different
> version of. hmm.

Earlier versions of xpdf used to modify the LZW data as appropriate and
shell out to uncompress command (or the compress decoder in gzip) before it
was deemed acceptable to use an internal decoder. Let me tell you how SLOW
that was. It was so SLOW, it was like a SNAIL trudging through TREACLE.

Modifying data, writing it to disk, forking, executing a 60kb executable,
letting it process its command line arguments, etc., reading and writing
data to disk, recovering from child process exit, reading data from disk...
for EVERY LZWStream object? Because it's "easier"? Surely you jest.

> unfortuneately xpdf, at least pdf2text (which is part of xpdf) is
> actually in c++ which i do not know. i only know c. shame. i was
> looking at that earlier on.

I had a look at the source of Stream.cc xpdf-2.02pl1, it doesn't use very
much C++, it's mostly C.

If the stream decoder sees "LZWDecode", it sets up an LZW decoder with 5
parameters: pred, columns, colors, bits, early. The defaults values for
these are 1, 1, 1, 8 and 1, and are replaced with the values in dictionary
entries Predictor, Columns, Colors, BitsPerComponent and EarlyChange
respectively.

If pred != 1, a StreamPredictor is created and that provides all the decoded
data instead of LZW. Outside the scope of this explanation :)

Only the early parameter is used by the LZW decoder. The dictionary table is
cleared with clearTable(), eof = false, inputBits = 0.

We call getChar() continuously. getChar() returns the next char from a
'sequence' array. When we've run out of bytes in the 'sequence', we call
processNextCode(), which creates our next sequence, or sets EOF.

processNextCode() calls getCode(), which is a simple bitstream reader.
processNextCode() then works through the rest of the LZW algorithm. A single
byte "literal" is represented by putting the literal in seqBuf[0] and
setting seqLength to 1. An n-byte "match" is represented by putting the
match in seqBuf[0] through seqBuf[n-1] and setting seqLength to n. As
explained, getChar() then dishes out these 'sequences' one byte at a time.

I'll let you take it from there.

Regards
Stuart

ben

unread,

Sep 16, 2003, 7:49:29 AM9/16/03

to

In article <3f66d122$0$272$cc9e...@news.dial.pipex.com>, Stuart Caie
<ky...@4u.net> wrote:

> ben wrote:
> > right, that's very intersting. you think there's a chance that
> > 'uncompress' does handle lzw, but happens to be a slightly different
> > version of. hmm.
>
> Earlier versions of xpdf used to modify the LZW data as appropriate and
> shell out to uncompress command

yup, i've just realised this. 'uncompress' definetely does lzw. so as
Marco had suggested and you also say there, it's a case of the
format/version of lzw. differences within various lzw's.

> (or the compress decoder in gzip) before it
> was deemed acceptable to use an internal decoder. Let me tell you how SLOW
> that was. It was so SLOW, it was like a SNAIL trudging through TREACLE.
>
> Modifying data, writing it to disk, forking, executing a 60kb executable,
> letting it process its command line arguments, etc., reading and writing
> data to disk, recovering from child process exit, reading data from disk...
> for EVERY LZWStream object? Because it's "easier"? Surely you jest.

yes, i can imagine doing that being fairly slow.

> > unfortuneately xpdf, at least pdf2text (which is part of xpdf) is
> > actually in c++ which i do not know. i only know c. shame. i was
> > looking at that earlier on.
>
> I had a look at the source of Stream.cc xpdf-2.02pl1, it doesn't use very
> much C++, it's mostly C.
>
> If the stream decoder sees "LZWDecode", it sets up an LZW decoder with 5
> parameters: pred, columns, colors, bits, early. The defaults values for
> these are 1, 1, 1, 8 and 1, and are replaced with the values in dictionary
> entries Predictor, Columns, Colors, BitsPerComponent and EarlyChange
> respectively.
>
> If pred != 1, a StreamPredictor is created and that provides all the decoded
> data instead of LZW. Outside the scope of this explanation :)
>
> Only the early parameter is used by the LZW decoder. The dictionary table is
> cleared with clearTable(), eof = false, inputBits = 0.
>
> We call getChar() continuously. getChar() returns the next char from a
> 'sequence' array. When we've run out of bytes in the 'sequence', we call
> processNextCode(), which creates our next sequence, or sets EOF.
>
> processNextCode() calls getCode(), which is a simple bitstream reader.
> processNextCode() then works through the rest of the LZW algorithm. A single
> byte "literal" is represented by putting the literal in seqBuf[0] and
> setting seqLength to 1. An n-byte "match" is represented by putting the
> match in seqBuf[0] through seqBuf[n-1] and setting seqLength to n. As
> explained, getChar() then dishes out these 'sequences' one byte at a time.
>
> I'll let you take it from there.

well i'm *really* in two minds now :) yes dishing out the data to an
external utility for every little bit of lzw data is going to be
inefficient, but believe it or not, in this case that doesn't really
matter. if it takes 1 second to read the whole pdf or 1 minute -
doesn't really matter too much. it's something that's going to take
place in the background so the user will not be waiting for the
immediate result - just so long as it gets done sometime soon.
(although saying that obviously 1 sec would be preferable to 1 min :))

so it's largely a question of what'll be easist to implement. using
'uncompress' still needs extra work as i don't know what modifications
are needed for the data. or there's using xpdf and your description and
doing it like that. i have no idea which. i'm going to decide now.....

.....ah!, i can't :) i'm finding that a hard decision. need to think
about that and look into it some more. just not sure.

thanks very much for your desctiption of xpdf decoding lzw, and the
other info. much 'preciated.

ben.

Marco Schmidt

unread,

Sep 16, 2003, 8:50:32 AM9/16/03

to

ben:

[...]

>so it's largely a question of what'll be easist to implement. using
>'uncompress' still needs extra work as i don't know what modifications
>are needed for the data. or there's using xpdf and your description and
>doing it like that. i have no idea which. i'm going to decide now.....

Why not reuse existing code? Isn't there a single library that would
work for you?
<http://directory.google.com/Top/Computers/Programming/Libraries/PDF_Related/>
lists some plus more links, for Java I've collected links at
<http://www.geocities.com/marcoschmidt.geo/java-libraries-pdf.html>,
and there are probably more.

If you would simply be interested in the theory, that would be another
thing, but it seems that you just need a working solution.

Regards,
Marco

ben

unread,

Sep 16, 2003, 4:49:58 PM9/16/03

to

In article <6i1emvcv7rlfurl99...@4ax.com>, Marco Schmidt
<marcos...@geocities.com> wrote:

i've already got a reasonable way with writing code for reading pdfs -
not sure if there is something that will do what i'm after - pdf2text,
part of xpdf, was the closest i found but i'm looking to extract not
just the text alone but also the meta data and any other info possible
that i can find that might be useful. for example since starting i've
realised i can look at the "deleted" pdf objects sometimes, so now i'm
going to compare the latest version with previous version(s) and if
there's anything, give access to them, also the dates of those changes.

but i was looking for a drop in already working solution just for the
decompression of lzw step. i am very interested in compression in
general, but right now for this, i'd just like to do what i've done
with zlib for flate compression - link to it and use it with minimum
effort. thanks for the links though - i'm always interested in things
to do with pdfs.

thanks, ben.