Words as well as characters as encoding units

Mok-Kong Shen

unread,

May 5, 2012, 4:26:03 AM5/5/12

to

If I don't err, the Huffman compression is certainly applicable to
words instead of characters as units of encoding. Now, in
http://en.wikipedia.org/wiki/Most_common_words_in_English one reads:

The Reading Teachers Book of Lists claims that the first 25 words
make up about one-third of all printed material in English, and
that the first 100 make up about one-half of all written material.

While this could be some over-estimation (I don't know), wouldn't
this anyway imply that the optimal Huffman compression would have
encoding units that consist partly of some words, with the rest
being characters?

Thanks in advance.

M. K. Shen

James Dow Allen

unread,

May 5, 2012, 5:41:22 AM5/5/12

to

On May 5, 3:26 pm, Mok-Kong Shen <mok-kong.s...@t-online.de> wrote:
> ... that the optimal Huffman compression would have

> encoding units that consist partly of some words, with the rest
> being characters?

Ignoring the mention of "Huffman" -- irrelevant in this
context -- you ask about the use of many-to-many encoding
for streams like natural language.

In fact, Lempel-Ziv coding achieves what you seek, though,
at least in its purest forms, requires no pre-set codebook:
it builds its own codebook dynamically from the stream itself.

James

Mok-Kong Shen

unread,

May 5, 2012, 9:06:07 AM5/5/12

to

Thanks. I have a question: If, given a piece of text with a
certain (say, approximately known) frequency of distribution
of words in it, would the LZ scheme in the form you mentioned
be able to exploit that fact? If yes, would you kindly tell
a bit of how or else give some references?

M. K. Shen

Thomas Richter

unread,

May 5, 2012, 11:05:02 AM5/5/12

to

On 05.05.2012 15:06, Mok-Kong Shen wrote:

>> In fact, Lempel-Ziv coding achieves what you seek, though,
>> at least in its purest forms, requires no pre-set codebook:
>> it builds its own codebook dynamically from the stream itself.
>
> Thanks. I have a question: If, given a piece of text with a
> certain (say, approximately known) frequency of distribution
> of words in it, would the LZ scheme in the form you mentioned
> be able to exploit that fact?

Yes indeed it does, however not in the sense that you need to provide
the model parameters (like frequencies) explicitly. Rather, LZ77 all the
variants thereof (LZ78, LZW, LZX) build up their own model, including a
dictionary.

> If yes, would you kindly tell
> a bit of how or else give some references?

LZ-type algorithms keep a dictionary of words that is dynamically
build-up; that is, the next 'symbol' they work on is formed by the
shortest word (or block) that is not yet in the dictionary, after which
this block gets added to the dictionary, and a reference to the known
prefix and the extending character are encoded. By this method, words
(or prefixes, as LZ does not really care about 'words' as such) end up
in the dictionary and get shorter encodings by just referencing them in
future occurrences. By this mechanism, the dictionary will sooner or
later include the words 't', 'th' 'the' 'the ' where each of these
words, except the first, will consist of a reference to the previous
word ('t', 'th', ...) plus the extension character ('h', 'e', ' '). Any
future word 'the ', for example as in 'the quick brown fox...' would
then be encoded as a reference to 'the ' plus the extension 'q'. Of
course 'the q' is not a word in the usual sense, though still becomes
part of the LZ dictionary. If the text goes on '...the lazy dog.', then
again the word 'the l' is not yet in the dictionary, though 'the ' is,
and thus it is encoded by a combined symbol of the reference to 'the '
plus the extension 'l'. Thus, the output code becomes shorter.

A reference to this is of course the classic text by Lempel and Ziv
themselves (I believe back then at the Technion in Haifa):

http://www.cs.duke.edu/courses/spring03/cps296.5/papers/ziv_lempel_1977_universal_algorithm.pdf

I also seem to remember, though text compression is not field (I
compress images) that there are a couple of specialized algorithms that
boost performance by using a pre-occupied dictionary, i.e. with frequent
English (or any other language) words.

Greetings,
Thomas

Mok-Kong Shen

unread,

May 5, 2012, 1:42:37 PM5/5/12

to

Am 05.05.2012 17:05, schrieb Thomas Richter:
> On 05.05.2012 15:06, Mok-Kong Shen wrote:

> James Dow Allen wrote:
>
>>> In fact, Lempel-Ziv coding achieves what you seek, though,
>>> at least in its purest forms, requires no pre-set codebook:
>>> it builds its own codebook dynamically from the stream itself.
>>
>> Thanks. I have a question: If, given a piece of text with a
>> certain (say, approximately known) frequency of distribution
>> of words in it, would the LZ scheme in the form you mentioned
>> be able to exploit that fact?
>
> Yes indeed it does, however not in the sense that you need to provide
> the model parameters (like frequencies) explicitly. Rather, LZ77 all the
> variants thereof (LZ78, LZW, LZX) build up their own model, including a
> dictionary.

[skip]

I don't exclude my misunderstanding due to my poor knowledge. But let
me explain my difficulty: For discussion purposes, suppose that all the
texts contain the same 100 different words. If I don't err, LZ will,
after processing a certain preamble, have build up a dictionary
containing all these words and use that to encode what follows the
preamble. Now consider text 1 and text 2, having the same preamble but
subsequently very different frequency distributions of the 100 words
after that. I couldn't yet imagine how LZ could exploit the difference
in distributions in the two cases appropriately. For the Huffman scheme,
one could assume either that the frequency distributions are known and
hence used or else one uses the adaptive variant.

M. K. Shen

glen herrmannsfeldt

unread,

May 5, 2012, 2:33:26 PM5/5/12

to

Mok-Kong Shen <mok-ko...@t-online.de> wrote:

(snip, someone wrote)

>>>> In fact, Lempel-Ziv coding achieves what you seek, though,
>>>> at least in its purest forms, requires no pre-set codebook:
>>>> it builds its own codebook dynamically from the stream itself.

>>> Thanks. I have a question: If, given a piece of text with a
>>> certain (say, approximately known) frequency of distribution
>>> of words in it, would the LZ scheme in the form you mentioned
>>> be able to exploit that fact?

You mean a frequency distribution of words, but with no
correlation between the placement of the words?

>> Yes indeed it does, however not in the sense that you need to provide
>> the model parameters (like frequencies) explicitly. Rather, LZ77 all the
>> variants thereof (LZ78, LZW, LZX) build up their own model, including a
>> dictionary.
> [skip]

> I don't exclude my misunderstanding due to my poor knowledge.
> But let me explain my difficulty: For discussion purposes,
> suppose that all the texts contain the same 100 different words.
> If I don't err, LZ will, after processing a certain preamble,
> have build up a dictionary containing all these words and
> use that to encode what follows the preamble.

> Now consider text 1 and text 2, having the same preamble but
> subsequently very different frequency distributions of the 100 words
> after that. I couldn't yet imagine how LZ could exploit the
> difference in distributions in the two cases appropriately.

LZ doesn't know about "words", but sequences of characters.
Even if the words are uncorrelated, it will start putting in
table entries for combinations of words. If you have words
(or even just one) that occurs very often, there will be
entries combining those (or that) word with others.

> For the Huffman scheme, one could assume either that the
> frequency distributions are known and hence used or else
> one uses the adaptive variant.

For a list of words with no correlation, that is likely about
as good, or maybe slightly better, than LZW. If the words form
sentences in any language, though, the word correlations will
help LZW, but not Huffman.

-- glen

Mok-Kong Shen

unread,

May 5, 2012, 3:22:47 PM5/5/12

to

Let's perhaps discuss on a level that is imagined and non-real:
Suppose all words are three characters long. In the Huffman case,
one knows (by assumption) the frequency distributions of these
100 words in the different texts and hence could optimally
compress accordingly (essentially by doing the same as in the
normal case of treating texts based on individual characters).
How would LZ do the compression in the said cases and "without"
employing the evidently valuable informations of the frequency
distributions? I suppose common sense would indicate that here
LZ couldn't perform as well as Huffman, no?

M. K. Shen

Thomas Richter

unread,

May 5, 2012, 3:31:10 PM5/5/12

to

On 05.05.2012 21:22, Mok-Kong Shen wrote:

> Let's perhaps discuss on a level that is imagined and non-real:
> Suppose all words are three characters long. In the Huffman case,
> one knows (by assumption) the frequency distributions of these
> 100 words in the different texts and hence could optimally
> compress accordingly (essentially by doing the same as in the
> normal case of treating texts based on individual characters).
> How would LZ do the compression in the said cases and "without"
> employing the evidently valuable informations of the frequency
> distributions? I suppose common sense would indicate that here
> LZ couldn't perform as well as Huffman, no?

The problem you see is possibly that you believe that LZ is an "all
inclusive" compression scheme that generates its binary output from
nothing else but the dictionary plus some static Huffman behind, but
this is not quite the case. LZ is just one out of a chain of processing
steps that make up a regular compressor, in the same sense as BWT does
not by itself compress, but prepares the data to be more compressible.

That said, that an LZ can be or would be "frequency adaptive" or "learns
to adapt to the frequency" is also archived by its entropy coding
backend. An adaptive Huffman or adaptive AC coder will learn how
frequent some dictionary entries are. In the simplest possible
implementation, the dictionary could include a move-to-front list and
the Huffman coder would give the front entries shorter codes.

Some of the less smart LZ algorithms, however, do not even do that (for
example, the LZ in TIFF is a rather "dump" version).

So long,
Thomas

glen herrmannsfeldt

unread,

May 5, 2012, 4:17:49 PM5/5/12

to

Mok-Kong Shen <mok-ko...@t-online.de> wrote:

(snip, someone wrote)

>>>>> Thanks. I have a question: If, given a piece of text with a
>>>>> certain (say, approximately known) frequency of distribution
>>>>> of words in it, would the LZ scheme in the form you mentioned
>>>>> be able to exploit that fact?

Well, you could find a Huffman and LZW program and try it on
some generated files with known distributions.

(snip, I wrote)

>> LZ doesn't know about "words", but sequences of characters.
>> Even if the words are uncorrelated, it will start putting in
>> table entries for combinations of words. If you have words
>> (or even just one) that occurs very often, there will be
>> entries combining those (or that) word with others.

(snip)

> Let's perhaps discuss on a level that is imagined and non-real:
> Suppose all words are three characters long. In the Huffman case,
> one knows (by assumption) the frequency distributions of these
> 100 words in the different texts and hence could optimally
> compress accordingly (essentially by doing the same as in the
> normal case of treating texts based on individual characters).

In that case, you would need to supply the table (or enough to
generate it) along with the compressed data. For short files,
that will be significant, for longer ones, not so much.

But first consider the case of N different three letter words of
equal probability. Huffman will need, on average, log2(N) bits
for each. (Not counting the table that you also need.)

LZW will build up a table of the words it sees, along with the
one and two letter prefixes of the words. In most cases, the
prefixes will be a small enough fraction not to worry too much
about. Once all the words are in the table, each takes about
log2(N) bits, or maybe a little more.

> How would LZ do the compression in the said cases and "without"
> employing the evidently valuable informations of the frequency
> distributions? I suppose common sense would indicate that here
> LZ couldn't perform as well as Huffman, no?

It does it by adding table entries for combinations of words.

As word frequency increases, so does the frequency of that
word in combination with other words, and itself.

Say, for example, you have N/2 words that appear with probability
1/N, and one word with probability 1/2. In addition to the table
entries for the N/2 words, there will quickly be entries for the
common word preceding (or following) each word. That doubles the
table size, so one additional bit is needed for each such
combination, just as you would expect for Huffman.

If there is no interword correlation, and in the limit that the
table (specified with Huffman) is an insignificant fraction, then,
yes, Huffman probably does better.

If the distribution changes, though, after not so long LZW throws
away the table and starts a new one. Huffman doesn't do that.

And if there is interword correlation, LZW will use it.

-- glen

Mok-Kong Shen

unread,

May 5, 2012, 4:19:41 PM5/5/12

to

But at least some of the textbooks seem to place LZ and Huffman on the
same level, i.e. seemingly as competive to, rather than coorperating
with each other. Could you recommend some literatures where the real
practical cooperations of diverse compression algorithms are described
in real-life scenarios as you sketched above? Thanks in advance.

M. K. Shen

Mok-Kong Shen

unread,

May 5, 2012, 4:36:47 PM5/5/12

to

Am 05.05.2012 22:17, schrieb glen herrmannsfeldt:
[snip]

> If the distribution changes, though, after not so long LZW throws
> away the table and starts a new one. Huffman doesn't do that.

In a couple of textbooks I have, there is no mention of thowing
away of LZ table with change of distribution. Could you please
recommend some better literatures? I don't properly understand your
last sentence in this context. (I mean there is the adaptive Huffman.)

M. K. Shen

glen herrmannsfeldt

unread,

May 5, 2012, 5:29:35 PM5/5/12

to

Mok-Kong Shen <mok-ko...@t-online.de> wrote:
> Am 05.05.2012 22:17, schrieb glen herrmannsfeldt:
> [snip]
>> If the distribution changes, though, after not so long LZW throws
>> away the table and starts a new one. Huffman doesn't do that.

> In a couple of textbooks I have, there is no mention of thowing
> away of LZ table with change of distribution. Could you please
> recommend some better literatures?

I don't know of any specific reference. The usual LZW uses 16 bit
codes, for a 65536 entry table. When the table is full, it throws
away all the table entries and starts a new one.

One commonly considered case is the .tgz file, compressing
the output of tar. Since it is pretty much a concatenation of
files, different files could have very different distributions.
I believe that there is a code which will dump the table early,
but normally it runs until it fills, then starts over.

> I don't properly understand your last sentence in this context.
> (I mean there is the adaptive Huffman.)

Yes, I meant non-adaptive. Huffman is, for example, used for
group 3 fax, with a fixed table based on statistical analyis
of documents. That means no table to send, and on the average
isn't too bad.

-- glen

Mok-Kong Shen

unread,

May 5, 2012, 5:52:43 PM5/5/12

to

Am 05.05.2012 23:29, schrieb glen herrmannsfeldt:

> I don't know of any specific reference. The usual LZW uses 16 bit
> codes, for a 65536 entry table. When the table is full, it throws
> away all the table entries and starts a new one.

My layman's impression could certainly be wrong. But to me this seems
to be that LZW normally attempts to find as much correlations between
words as possible, which to a large part might actually not be
profitable work in practice.

M. K. Shen

Niels Fröhling

unread,

May 8, 2012, 2:42:21 AM5/8/12

to

> My layman's impression could certainly be wrong. But to me this seems
> to be that LZW normally attempts to find as much correlations between
> words as possible, which to a large part might actually not be
> profitable work in practice.

Can you name a language without correlations? Can you even point at digital data
which is uncorrelated? (I'd exclude SETI captures here :^), that is likely
uncorrelated data)

glen herrmannsfeldt

unread,

May 8, 2012, 4:19:46 AM5/8/12

to

The question is what might be in a file that one wants to compress,
and that might not be in a language. For example, one might be
working with GPS systems, and have files full of latitude and
longitude values. You can consider those "words", but might not
have any correlation.

-- glen

Niels Fröhling

unread,

May 9, 2012, 5:14:31 PM5/9/12

to

> The question is what might be in a file that one wants to compress,
> and that might not be in a language. For example, one might be
> working with GPS systems, and have files full of latitude and
> longitude values. You can consider those "words", but might not
> have any correlation.

A format-type correlation is often enough for exploitation. The file "geo" of
the Canterbury Corpus is the shining example, it's an array of floats,
compressing it as bit-planes (knowing how the values correlate) yields a few
tenths of percent of compression gain - because all the floats are in a specific
correlating range AFAIR 0.0 to 1.0.

There are no uncorrelated data-sets besides total random data. Sometimes there
may not be strong enough a correlation, and then any high-order compressors
compress bad/worst, but exploiting correlations is their business, not plain
static stationary order-0.

Ciao
Niels

James Dow Allen

unread,

May 10, 2012, 1:17:23 AM5/10/12

to

On May 10, 4:14 am, Niels Fröhling <spamt...@adsignum.com> wrote:
> A format-type correlation is often enough for exploitation. The file "geo" of

> ...

> compressing it as bit-planes (knowing how the values correlate) yields a few

> tenths of percent of compression gain ...

Interesting and important point.
But frankly, "a few tenths of percent" is not enough to get
one's heart racing with exhiliration.

There was an interesting post in this ng two years
ago by John Tromp which I thought might be a spring-off
for research. A Google Groups link is
http://groups.google.com/group/comp.compression/browse_frm/thread/9c9676663073ca39

Header includes
> Subject: compressor blind spots
> Date: Tue, 27 Apr 2010 10:17:16 -0700 (PDT)
> Message-ID: <239f5097-d5f0-49e0...@k36g2000yqb.googlegroups.com>

The "exhilirating" sentence was:
> I didn't expect bzip2 to do much better since it should have already
> noticed the repetitive tails
> of strings, but imagine my shock when seeing the result:
> ... over 12 times smaller!

James