>>>>> Thanks. I have a question: If, given a piece of text with a
>>>>> certain (say, approximately known) frequency of distribution
>>>>> of words in it, would the LZ scheme in the form you mentioned
>>>>> be able to exploit that fact?
Well, you could find a Huffman and LZW program and try it on
some generated files with known distributions.
(snip, I wrote)
>> LZ doesn't know about "words", but sequences of characters.
>> Even if the words are uncorrelated, it will start putting in
>> table entries for combinations of words. If you have words
>> (or even just one) that occurs very often, there will be
>> entries combining those (or that) word with others.
(snip)
> Let's perhaps discuss on a level that is imagined and non-real:
> Suppose all words are three characters long. In the Huffman case,
> one knows (by assumption) the frequency distributions of these
> 100 words in the different texts and hence could optimally
> compress accordingly (essentially by doing the same as in the
> normal case of treating texts based on individual characters).
In that case, you would need to supply the table (or enough to
generate it) along with the compressed data. For short files,
that will be significant, for longer ones, not so much.
But first consider the case of N different three letter words of
equal probability. Huffman will need, on average, log2(N) bits
for each. (Not counting the table that you also need.)
LZW will build up a table of the words it sees, along with the
one and two letter prefixes of the words. In most cases, the
prefixes will be a small enough fraction not to worry too much
about. Once all the words are in the table, each takes about
log2(N) bits, or maybe a little more.
> How would LZ do the compression in the said cases and "without"
> employing the evidently valuable informations of the frequency
> distributions? I suppose common sense would indicate that here
> LZ couldn't perform as well as Huffman, no?
It does it by adding table entries for combinations of words.
As word frequency increases, so does the frequency of that
word in combination with other words, and itself.
Say, for example, you have N/2 words that appear with probability
1/N, and one word with probability 1/2. In addition to the table
entries for the N/2 words, there will quickly be entries for the
common word preceding (or following) each word. That doubles the
table size, so one additional bit is needed for each such
combination, just as you would expect for Huffman.
If there is no interword correlation, and in the limit that the
table (specified with Huffman) is an insignificant fraction, then,
yes, Huffman probably does better.
If the distribution changes, though, after not so long LZW throws
away the table and starts a new one. Huffman doesn't do that.
And if there is interword correlation, LZW will use it.
-- glen