LPF, NY Times, GNU and compression algorithms

RICARDO NASSIF

unread,

Mar 6, 1994, 5:37:32 AM3/6/94

to

After reading a piece on software patents in today's New York
Times (_Will Users Be the Big Losers in Software Patent Battles?_,
Sunday, 6 Mar 94, Business section, p. 7, signed by L.M. Fisher) which
centers on the dispute over the compression algorithm (Stac vs.
Microsoft) most widely used I began wondering (again) whether
compression algorithms are *this* difficult to come by.
This is probably a naive question from a non-hacker, but What
is the reason why you can't find (well, *I* can't) any GPL'ed or
public domain alternatives to on-the-fly compression? Does this have
to do with the complexity of the algorithm itself? Or (unlikely) lack
of demand? Or not enought manpower to carry on such a project?
Something else?

-----
Oh, both Mr. Gordom Irlam from the LPF and Mr. Richard Stahlman from
LPF and GNU are quote in the article.

--
.rn

ricardo nassif
rxn...@cac.psu.edu
rxn106@psuvm

Miguel de Icaza

unread,

Mar 7, 1994, 12:22:53 PM3/7/94

to

> This is probably a naive question from a non-hacker, but What
> is the reason why you can't find (well, *I* can't) any GPL'ed or
> public domain alternatives to on-the-fly compression? Does this have

Linux has double.

Miguel.

--

Miguel.

Heikki Suonsivu

unread,

Mar 7, 1994, 1:43:25 PM3/7/94

to

In article <RXN106.94...@wilbur.cac.psu.edu> rxn...@wilbur.cac.psu.edu (RICARDO NASSIF) writes:
Microsoft) most widely used I began wondering (again) whether
compression algorithms are *this* difficult to come by.

(g)zip is probably the only one which isn't patented. Or, at least there
is no known patents on it.

is the reason why you can't find (well, *I* can't) any GPL'ed or
public domain alternatives to on-the-fly compression? Does this have
to do with the complexity of the algorithm itself? Or (unlikely) lack

No. I would guess it is from 1 night for a very capable hacker to 1 month
for average person to do it well, provided that you have got the source
code for the OS and some knowledge about operating systems. The result
depends on how fancy you wish to get, and whether you do it at block level
or per-file.

of demand? Or not enought manpower to carry on such a project?

Actually, I think the primary reason is that disk space is cheap. Nobody
simply bothers, they go to nearest computer store and buy couple of
gigabytes more. I was wondering why noone has ever patched news to store
articles compressed. I even went through innd 1.4 and replaced all opens,
rm's, stats etc with hook functions (took one night for me), but I never
had enough time to write the code to do the actual decompression. Our
system is still using those hooks, but they are empty :-).

--
Heikki Suonsivu, T{ysikuu 10 C 83/02210 Espoo/FINLAND,
h...@cs.hut.fi home +358-0-8031121 work -4513377 fax -4555276 riippu SN

David Chase

unread,

Mar 7, 1994, 3:03:07 PM3/7/94

to

rxn...@wilbur.cac.psu.edu (RICARDO NASSIF) writes:
|> After reading a piece on software patents in today's New York
|> Times (_Will Users Be the Big Losers in Software Patent Battles?_,
|> Sunday, 6 Mar 94, Business section, p. 7, signed by L.M. Fisher) which
|> centers on the dispute over the compression algorithm (Stac vs.
|> Microsoft) most widely used I began wondering (again) whether
|> compression algorithms are *this* difficult to come by.

Most likely (I only read what you read, so I don't know much) the patents in
question cover something like "on-the-fly compression as part of a file
input-output system", and may be independent of the exact compression
algorithm used. That is, this is an application of compression, and not
a compression algorithm itself, and substituting a different compression
algorithm may not make a difference.

It may also be the case that this is a result of randomness on the part
of the jury. I rather doubt that anyone who expressed strong sentiments
with respect to patents (let alone software patents) made it onto the
jury.

Note that I'm not saying this is what happened -- this is just what I
can figure out, based on the limited information I've so far seen. Note
that Microsoft countersued with something that looks suspiciously like
an interface patent (something about automatically loading software into
the running OS).

David Chase, speaking for myself
Thinking Machines Corp.

Ozan S. Yigit

unread,

Mar 7, 1994, 3:25:04 PM3/7/94

to

Heikki Suonsivu:

(g)zip is probably the only one which isn't patented. Or, at least there
is no known patents on it.

there are others. comp.compression FAQ lists various patents and some of
the packages that are affected by them.

oz

Marc Auslander

unread,

Mar 8, 1994, 9:06:31 AM3/8/94

to

From the gzip description:

Literals or match lengths are compressed with one Huffman tree, and
match distances are compressed with another tree. The trees are stored
in a compact form at the start of each block. The blocks can have any
size (except that the compressed data for one block must fit in
available memory). A block is terminated when zip determines that it
would be useful to start another block with fresh trees. (This is
somewhat similar to compress.)

Duplicated strings are found using a hash table. All input strings of
length 3 are inserted in the hash table. A hash index is computed for
the next 3 bytes. If the hash chain for this index is not empty, all
strings in the chain are compared with the current input string, and
the longest match is selected.

Two Stac patents in the case are 4701745 and 5016009.

From 4601745:

the data processing means including circuit means operable to check
whether a sequence of successive bytes to be processed identical with
a sequence of bytes already processed, and including a hash generating
means responsive to the application of a predetermined number of bytes ...

From 5016009

in order to perform the hashing function, a data compression system
includes certain hash data structures including a history array
pointer ...

also:

... is found ..., encoding said matching string ... a variable length
indicator ..., said predetermined strategy ensuring that a matching
string of two characters of said input data stream is compressed to
less than said two characters of said input data stream.

So - on the face of it, why isn't gzip covered by these patents? (I
don't have time to enter the whole thing - get copies!) Also, I
myself cannot say what's covered by what - I'm not a patent attorney,
I'm a programmer!

Please distinguish between "its not covered" and "the patent isn't
valid". These are the Stac patents, and Stac just won an infringement
suit.

--

Marc Auslander <ma...@watson.ibm.com> 914 784-6699 (Tieline 863 Fax x6306)

Ozan S. Yigit

unread,

Mar 8, 1994, 11:03:29 AM3/8/94

to

Marc Auslander:

[gzip description and excerpts from stac patents elided]

So - on the face of it, why isn't gzip covered by these patents?

oh, no! compress [which has become something of a symbol in this
newsgroup] strikes back! :-]

maybe time to take a second look at Y coding...

Jean-loup Gailly

unread,

Mar 10, 1994, 5:10:03 AM3/10/94

to

Marc Auslander <ma...@watson.ibm.com> writes:

> Two Stac patents in the case are 4701745 and 5016009.
>

> From 4601745: [typo, meant: 4701745]

>
> the data processing means including circuit means operable to check
> whether a sequence of successive bytes to be processed identical with
> a sequence of bytes already processed, and including a hash generating
> means responsive to the application of a predetermined number of bytes ...
>
> From 5016009
>
> in order to perform the hashing function, a data compression system
> includes certain hash data structures including a history array
> pointer ...
>
> also:
>
> ... is found ..., encoding said matching string ... a variable length
> indicator ..., said predetermined strategy ensuring that a matching
> string of two characters of said input data stream is compressed to
> less than said two characters of said input data stream.
>
> So - on the face of it, why isn't gzip covered by these patents?

Let's take each patent in turn. A clarification first: the Stac patent
4,701,745 was not invented by Stac. It was initially owned by Ferranti
(UK) and only recently bought by Stac. This was a very profitable
acquisition (assuming it cost Stac far less than the $120 million they
won by using this patent against Microsoft).

(a) 4,701,745

This algorithm is now known as LZRW1, because Ross Williams reinvented
it independently later and posted it on comp.compression on April 22,
1991. Exactly the same algorithm has also been patented by Gibson and
Graybill (5,049,881). The patent office failed to recognize that
it was the same algorithm.

This algorithm uses LZ77 with hashing but no collision chains and
outputs unmatched bytes directly without further compression. gzip
uses collisions chains of arbitrary length, and uses Huffman encoding
on the unmatched bytes:

- Claim 1 of the patent is restricted to (emphasis added by me):

output means operable to APPLY to a transfer medium each byte of data
not forming part of such an identical sequence; and

ENCODING means responsive to the identification of such a sequence
to APPLY to the transfer medium an identification signal which
identifies both the location in the input store of the previous
occurrence of the sequence of bytes and the number of bytes
contained in the sequence.

The claim thus makes a clear distinction between "encoding" and
"applying to the transfer medium". A system which compresses the
unmatched bytes does not infringe this patent.

- The description of the patent and claim 2 make clear that the check
for identity of the sequences of bytes is to be made only once (no
hash collision chains). Gzip performs an arbitrary number of such
checks. The "means" enumerated in claim 1 specify what the hash
table consists of, and this does not include any means for storing
hash collision chains.

- Claim 2 also requires that *all* bytes participating in the hash
function should be compared:

A system as claimed in claim 1 in which the circuit means also
includes check means operable to check for identity between EACH
of the said predetermined number of bytes in sequence and EACH of
a similar sequence of bytes contained in the input store at a
location defined by a pointer read out from the temporary store at
said address

[in plain English, this is the check for hash collision]

and to check whether identity exists between succeeding bytes in
each sequence of bytes, and a byte counter operable to count the
number of identical bytes in each sequence.

[this is the determination of the match length]

Gzip never checks for equality of the third byte used in the hash
function. The hash function is such that on a hash hit with equality
of the first two bytes, the third byte necessarily matches.

- In addition, gzip uses a "lazy" evaluation of string matches. Even
when a match is found, gzip may encode (with Huffman coding) a single
unmatched byte. This is done when gzip determines that it is more
beneficial to parse the input string differently because a longer
match follows. In the Waterworth patent, a string match is always
encoded as a (length, pointer) pair.

All other claims of the patent are dependent on claim 1
("a system as claimed in claim 1 in which ..."). Since gzip
does not infringe claim 1 it does not infringe the other claims.
In particular, claim 6 explicitly states that unmatched strings
are not compressed:

A system as claimed in claim 5 in which the data receiving means
includes decoder means operable to separate UNCOMPRESSED bytes of
data from identification signals.

The gzip decoder never receives uncompressed bytes since all input is
compressed with Huffman coding [both literals and (length, offset) pairs].

The only "invention" in the Waterworth patent is the absence of hash
collision chains. The original description of the LZ77 algorithm
required searching for the true longest match, not just checking the
length of one particular match. Using hashing for string searching
was very well known at the time of the patent application (March 86).
The patent description specifies explicitly that "Hash techniques are
well known and many differents forms of hash will be suitable".

The --fast option of gzip was on purpose made slower than possible
precisely because of the existence of the Waterworth patent.
See in particular the following part of the gzip TODO file:

Add a super-fast compression method, suitable for implementing
file systems with transparent compression. One problem is that the
best candidate (lzrw1) is patented twice (Waterworth 4,701,745
and Gibson & Graybill 5,049,881).

(b) 5,016,009

This is standard LZ77 with hashing, and collisions resolved using
linked lists. There are several important restrictions which let gzip
escape from the patent:

- the linked lists are implemented only with offsets. The end of
a chain is detected by adding together all offsets, until the
sum becomes greater than the size of the history buffer. gzip uses
direct indices, and so detects the end of the chains differently.
The exact wording of claim 1 of the patent is:

... said data compression system comprising ... an offset array means
... said method comprising the steps of ...

calculating a difference between said history array pointer
and said pointer obtained from said hash table means,
storing said difference into said offset array means entry
pointed to by said history array pointer, ...

gzip never calculates such a difference and does not have any offset
array.

- unmatched strings are emitted as literal bytes without any
compression. gzip uses Huffman encoding on the unmatched strings.
This is the same argument as for the Waterworth patent.

- unmatched strings are preceded by
... a "raw" data tag indicating that no matching data string was found

gzip does not use such a tag because it uses a single Huffman table for
both string literals and match lengths. It is only the prefix
property of Huffman codes which allows the decoder to distinguish
the two cases. So there is not a unique "raw" tag preceding all
literals. This is not a minor point. It is one of the reasons
giving gzip superior compression to that obtained with the Stac
algorithm.

- a string match is always encoded as a (length, pointer) pair.
Gzip uses a "lazy" evaluation of string matches as described
above for the Waterworth patent.

All other claims of the patent are dependent on claim 1 ("the method
of claim 1 wherein ..."). Since gzip does not infringe claim 1 it does
not infringe the other claims. In any case, I have studied in detail
all the 77 claims to make sure that gzip does not infringe.

Unrelated note: this Stac patent is the only one where I found an
original and non obvious idea. The hash table is refreshed using a
incremental mechanism, so that the refresh overhead is distributed
among all input bytes. This allows the real response time necessary in
disk compressors such as Stacker (the Stac product). gzip does not use
this idea, and refreshes the hash table in a straightforward manner
every 32K bytes.

One final comment: I estimate that I have now spent more time studying
data compression patents than actually implementing data compression
algorithms. I have a partial list of 318 data compression patents,
even though for the moment I restrict myself mostly to lossless
algorithms (ignoring most image compression patents). Richard
Stallman has been *extremely* careful before accepting gzip as the GNU
compressor. I continue to study new patents regularly. I would of
course very much prefer spending what's left of my spare time
improving the gzip compression algorithm instead of studying patents.
Some improvements that I would have liked to put in gzip have not been
incorporated because of patents.

In short, every possible precaution has been taken to make sure that
gzip isn't covered by patents.

Jean-loup Gailly, author of gzip.
jl...@chorus.fr