Analysis of dataset in the large text compression benchmark

Matt Mahoney

unread,

Jun 10, 2006, 4:48:29 PM6/10/06

to

One problem in the large text compression benchmark is that the entropy
of the data set is not known. Shannon and others estimated the entropy
of English text by having human subjects guess successive characters in
text reduced to a 27 character alphabet (a-z and space). The problem
with the data set (enwik9) is that there is a lot of "junk" like XML
markup, tables, links, formatting codes, etc. This doesn't matter as
long as the extra data is not hard to model; the best natural language
model will still win. However, it would still be nice to know where
the finish line is.

I wrote a program to filter out the junk and generate readable English
text using a 27 character alphabet (as readable as you can get without
punctuation). Essentially I removed any text that would not be visible
on the Wikipedia website. For example, I removed hypertext links but
retained the anchor text. I also removed non-English text, tables and
markup, but retained image captions. When this is reduced to only
letters, digits, and nonconsecutive spaces, the resulting text is about
70% of the original size. This is expanded to 74% after spelling out
digits (a common technique in speech models).

Then I compared compression ratios for 100 MB of clean text with enwik8
(also 100 MB) using 25 different compressors. Popular compressors like
zip, gzip and bzip2 compress clean text about 10% smaller than the
original text. High end compressors like paq8h, durilca, slim, and
ppmonstr compress about 1% to 5% smaller. (Did not test WinRK).
Details are here:
http://cs.fit.edu/~mmahoney/compression/text.html#data

Now we have a rough estimate of the entropy of enwik9. Shannon
estimated 0.6 to 1.3 bpc for 27 character English. Cover and King
estimated an upper bound of 1.3 bpc. It is likely that the entropy of
enwik9 is at most 5% more. I plan to test on the "cleaned" enwik9 to
see if the results hold for larger files.

The uncertainty in Shannon's estimate is still a problem because the
top compressors are now in the range of uncertainty. paq8h compresses
enwik9 to 1.18 bpc. The uncertainty is due to the fact that different
probability distributions can lead to the same observed sequence of
guesses in the Shannon game (guessing successive characters). Cover
and King tried to rectify this problem using a gambling game to assign
probabilities, but this method is tedious and they were only able to
measure a single sentence. Years ago I did simulations of the Shannon
game wth known models which suggest the true entropy is in the middle
of the Shannon bounds, around 0.9 to 1.1 bpc, but this is speculative.

-- Matt Mahoney

Alexandru

unread,

Jun 11, 2006, 11:35:21 AM6/11/06

to

I did a similar program to remove punctuation and xml-tags. I split the
enwik9 file in two files: one with punctuation (and xml/xhtml tags) and
another one with the real text (keeping enough information to
reconstruct the initial file). I got an overall improvement of 5% in
the compression ratio (around 10mb). The file, "text.txt", with the
real text was a big list with words with only a space between two
words.

I also wrote a program that guesses words (~21% of the words on enwik9,
and ~17% on enwik8) in order to reduce the size of "text.txt" file.
Unfortunately, the compression ratio (of bzip and chile) was not
affected in a positive way (both programs use BWT). Anyway, if anyone
wants to test my idea I can give him the necessary files (3 source
files written in C++, very simple to use it within his program just by
including a header).

I want to do something like T9 on mobiles phone: replacing letters with
digits, but I didn't figure a clever way to group letters into digits
in order to have a very high success rate of guessing the words from
the first shot.

Some other problems with the test:

1. A human is well "trained" at the age of 21 so he processes much less
than one gigabyte of language. A child who just learned to read/write
can also do very good on Shannon's tests.

2. If you give someone a very big book written in an unknown language
he/she won't be able to learn that language and understand book's
meaning (even if the book written using a known alphabet).

3. Do not remove punctuation. A sentence ending in "?" is different
from a sentence ending in "." (eg. the distribution of the first word
in the sentence is different, or the order of words may be different).

4. Humans do have memory! Shannon didn't included humans' memory in his
tests, but their capability to guess letters in different contexts
(humans do not train online). Your benchmark includes also the
compressor size. I suggest you to estimate the humans' memory and allow
compressors to have other files (like dictionaries) up to that limit.

Matt Mahoney

unread,

Jun 11, 2006, 5:39:01 PM6/11/06

to

Alexandru wrote:
> I did a similar program to remove punctuation and xml-tags. I split the
> enwik9 file in two files: one with punctuation (and xml/xhtml tags) and
> another one with the real text (keeping enough information to
> reconstruct the initial file). I got an overall improvement of 5% in
> the compression ratio (around 10mb). The file, "text.txt", with the
> real text was a big list with words with only a space between two
> words.
>
> I also wrote a program that guesses words (~21% of the words on enwik9,
> and ~17% on enwik8) in order to reduce the size of "text.txt" file.
> Unfortunately, the compression ratio (of bzip and chile) was not
> affected in a positive way (both programs use BWT). Anyway, if anyone
> wants to test my idea I can give him the necessary files (3 source
> files written in C++, very simple to use it within his program just by
> including a header).

If you post the programs, I will add them to the benchmark.
Preprocessors (like xml-wrt) combined with existing compressors are
useful.

> Some other problems with the test:
>
> 1. A human is well "trained" at the age of 21 so he processes much less
> than one gigabyte of language. A child who just learned to read/write
> can also do very good on Shannon's tests.

True, 1 GB was just a rough guess. Has anyone measured how well
children do in Shannon's tests compared to adults?

> 2. If you give someone a very big book written in an unknown language
> he/she won't be able to learn that language and understand book's
> meaning (even if the book written using a known alphabet).

Also true, because people can't learn a language without grounding the
meanings of words to their other senses. People only learn what is
relevant. However, I think that language has properties that allow
machines to do this. Words with similar meanings or similar
grammatical roles will appear in similar context. A machine can learn
those groupings by analyzing the statistics of the text. A human
cannot because he or she would be overwhelmed by attempting to read a
foreign text. It takes people many years to learn such associations.
Machines can learn them in minutes.

> 3. Do not remove punctuation. A sentence ending in "?" is different
> from a sentence ending in "." (eg. the distribution of the first word
> in the sentence is different, or the order of words may be different).

I agree this would be better. I only removed punctuation because
Shannon did. It would be more accurate to measure the entropy using
more realistic text. I am considering ideas to improve Shannon's tests
to measure the entropy of enwik9, but I am unsure how to do it. One
idea I have is to have a human-assisted computer language model. The
computer would make its best guesses about the next word or character
and present these to the user. The user could either accept the
machine's guesses as is, or substitute his own guess either from the
list or by typing it in.

I would like to do this in a way that minimizes the uncertainty in
Shannon's method, which is due to different probability distributions
leading to the same series of observed guesses by the human. I would
also like to make the test easy enough that a large number of samples
could be measured. (The tests by Cover and King were very tedious. It
took several people several hours to measure one sentence).

I suspect that the distribution of entropy in enwik9 and in text in
general is nonuniform and possibly fractal in nature. This will
complicate how I pick samples to measure. I will need to study how
compression varies across the file.

> 4. Humans do have memory! Shannon didn't included humans' memory in his
> tests, but their capability to guess letters in different contexts
> (humans do not train online). Your benchmark includes also the
> compressor size. I suggest you to estimate the humans' memory and allow
> compressors to have other files (like dictionaries) up to that limit.

The purpose of including the size of the decompressor is to prevent
information from being moved from the compressed file to the
decompressor in the form of dictionaries or prebuilt models. Otherwise
it is trivial to compress any file in most benchmarks to 1 byte. Of
course, humans can predict text using vast knowledge learned over a
lifetime in addition to the data itself. However, humans have limited
memory. What I want to know is just how much knowledge that is, in
bits. How much memory is needed to understand language, not including
the knowledge needed to ground the meanings of words to other senses?
How much knowledge is needed to build a language model that can match
human performance, as measured by prediction or recognition?

I have an update on this study. I tested the filtered enwik9 (715 MB)
on 17 compressors. It turns out that compressing the 100 MB text8 file
underestimates the compression ratio on the full filtered data set by
about 2% to 4%. The compression ratio is in fact about the same on
both the raw and filtered text, within 3% in most cases. In fact, on
the filtered text, paq8h compression is about 3% worse than on the raw
text, and would take 2nd place behind xml-wrt|ppmonstr when the
decompressor size is taken into account.

-- Matt Mahoney

Matt Mahoney

unread,

Jun 11, 2006, 11:38:20 PM6/11/06

to

Matt Mahoney wrote:
> I have an update on this study. I tested the filtered enwik9 (715 MB)
> on 17 compressors. It turns out that compressing the 100 MB text8 file
> underestimates the compression ratio on the full filtered data set by
> about 2% to 4%. The compression ratio is in fact about the same on
> both the raw and filtered text, within 3% in most cases. In fact, on
> the filtered text, paq8h compression is about 3% worse than on the raw
> text, and would take 2nd place behind xml-wrt|ppmonstr when the
> decompressor size is taken into account.

Another update. I graphed the incremental compression of enwik9 with
ppmd (modifying the source code to print the x,y coordinates I needed).
The graph is here:
http://cs.fit.edu/~mmahoney/compression/text.html#data (scroll down a
bit). It turns out the entropy is not uniform as might be expected.
There is a large dip in the middle where the incremental compression
ratio drops from 0.22 to about 0.08 for about 150 MB. It seems like
there are several thousand articles on U.S. towns that may have been
automatically generated from a table of census data, so the articles
have a regular structure. This probably explains why most compressors
did better on enwik9 than enwik8, even those without much memory.

-- Matt Mahoney

Ted Dunning

unread,

Jun 12, 2006, 5:13:08 PM6/12/06

to

This sort of thing is actually relatively common in real text. Sports
stories are much more predictable than other genres and you find times
when lots of new stuff happens which increases the entropy temporarily
(think 9/11 for an example).

Matt Mahoney wrote:
> Matt Mahoney wrote:
...

Alexandru

unread,

Jun 14, 2006, 3:17:23 PM6/14/06

to

I have a suggestion for the data set:

Assign each word in the text an unique 24-bit number and then transform
the text into a list of integers (a binary file). This way you'll force
people to think about grammar and word relations and not to try to
predict individual letters (or bits). Since the file will be much
smaller (I expect half of the current size, ie ~400MB) you may use the
whole wikipedia as a test to compress.

Matt Mahoney

unread,

Jun 14, 2006, 5:06:30 PM6/14/06

to

I am currently doing some experiments along these lines, writing my own
text preprocessor similar to xml-wrt. In experiments with xml-wrt,
replacing the most common words with 1-2 byte tokens improves
compression with almost all compressors. However it is better not to
replace every word, just the most common ones, because you have to
compress the dictionary too. With xml-wrt the optimal threshold for
adding to the dictionary is about 180 (-f option) for enwik8 and 800
for enwik9. In both cases the dictionary size is several thousand
words, so 16 bits is enough.

I'm also experimenting with my own version of a text preprocessor that
can be integrated into a context mixing compressor so I can also use
semantic and syntactic modeling. Currently, paq8h (the top compressor)
is not improved by xml-wrt because it already is preprocessed with a
static Engish dictionary and xml-wrt messes up its own internal word
recognition models.

What I plan to do, hopefully by late summer, is abstract out the
lexical model by introducing symbols to override default capitalization
and spacing rules, add symbols corresponding to morphemes (root words,
suffixes, prefixes, punctuation), and spell out rare words in letters
and common n-grams, and split off nontext and structured data into a
separate stream. (The spelling of novel words can sometimes reveal
information about their meanings). After this preprocessing I will
cluster words (including those spelled out) by their immediate and
longer range context to learn their syntactic and semantic roles,
respectively, and use those classifications as context in a
context-mixing compressor. I may experiment with a "soft" clustering
using online LSA (but for both semantics and syntax) implemented with a
neural network.

-- Matt Mahoney

Alexandru

unread,

Jun 14, 2006, 6:21:26 PM6/14/06

to

Matt Mahoney wrote:
> Alexandru wrote:
> > I have a suggestion for the data set:

> > Assign each word in the text an unique 24-bit number and then transform
> > the text into a list of integers (a binary file). This way you'll force
> > people to think about grammar and word relations and not to try to
> > predict individual letters (or bits). Since the file will be much
> > smaller (I expect half of the current size, ie ~400MB) you may use the
> > whole wikipedia as a test to compress.

> I am currently doing some experiments along these lines, writing my own
> text preprocessor similar to xml-wrt. In experiments with xml-wrt,
> replacing the most common words with 1-2 byte tokens improves
> compression with almost all compressors. However it is better not to
> replace every word, just the most common ones, because you have to
> compress the dictionary too. With xml-wrt the optimal threshold for
> adding to the dictionary is about 180 (-f option) for enwik8 and 800
> for enwik9. In both cases the dictionary size is several thousand
> words, so 16 bits is enough.

My idea was to make the dictionary publicly available (with the option
to allow decompresors to use it or not) and to have an easy-to-read
file. I don't think anybody will try to mess with xml-wrt file format
or codes generated by a Huffman tree. You need a very simple format
whose main utility is not compression (although is compresses somehow),
but speed (eg. comparing two ints is faster than comparing two words)
and eficiency (eg. no memory needed to remember words, no parsing, and
it is *perfect hash function*).

> I'm also experimenting with my own version of a text preprocessor that
> can be integrated into a context mixing compressor so I can also use
> semantic and syntactic modeling. Currently, paq8h (the top compressor)
> is not improved by xml-wrt because it already is preprocessed with a
> static Engish dictionary and xml-wrt messes up its own internal word
> recognition models.

I've made my self a primitive text preprocessor. It parses the text and
it tries to guess each word using the text already seen. It outputs the
word only if the guess was wrong. The input is a big list of words with
only one space between two consecutive tokens. The output is a list of
words with one or more spaces between two consecutive items. If there
is no word between two consecutive spaces then the preprocessor guessed
the word.

eg.: (_ stands for space) "blah_mumu_bubu" -> "blah__bubu" (the
preprocessor guessed the word mumu).

You can download the preprocessor from this location:
http://mosoi.lumina.ro/downloads/skip.tar.bz2

Because these days I'm having my final exams I cannot continue my work
on this program (at least not for the next month), but if you have any
idea (or question), please drop a line. At the moment the context is
only the last word (and sometimes, no words). The next step will be to
make variable length context (but it already consumes too much memory
~250mb).

> -- Matt Mahoney

-- Alexandru Mosoi

Alexandru

unread,

Jun 15, 2006, 1:26:08 PM6/15/06

to

The program moved here:
http://mosoi.lumina.ro/downloads/Chile/large/skip.tar.bz2

I did some programs that implement my idea (construct a dictionary and
transform the list of words into a list of integers). The dictionary
size (as a bzip2 archive): ~450k for text8.txt and ~1.5mb (not very
big) for text9.txt.

The results are (bzip2):
text8.txt 100,000,000 bytes
text8.txt.bz2 26,395,389 bytes

dict8.txt 1,187,596 bytes
outp8.txt 51,015,621 bytes -> ~1/2 of original file
(as I predicted)

dict8.txt.bz2 458,295 bytes -> dictionary file
outp8.txt.bz2 23,382,764 bytes -> the binary file with
integers
total 23,841,059 -> total size

An improvement of 2,554,330 bytes!!

For gzip2
text8.txt.gz 33,048,243
outp8.txt.gz 27,768,161
dict8.txt.gz 554,332
total 28,322,493

Again an improvement of 4,725,756 bytes!!!

with xml-wrt (-f180):
text8.txt.xwrt 37,082,250
text8.txt.xwrt.bz2 23,869,114 bytes (a little bit worse than
compressing the list of integers)

Moreover, my opinion is that by assigning shorter codes to more
frequent words you'll increase the entropy of the file and,
consequently, compression performance will be affected.

My programs can be downloaded from:
http://mosoi.lumina.ro/downloads/Chile/large/.

Matt Mahoney

unread,

Jun 15, 2006, 4:04:23 PM6/15/06

to

Does this work on enwik8/9 also? If so, send me some results and I
will post them.

Your "skip" program is like word-level LZP compression.

-- Matt Mahoney

Alexandru

unread,

Jun 15, 2006, 5:35:38 PM6/15/06

to

No it doesn't work on enwik8/9, only for text8/9.

Still, I can split enwik8/9 into two streams:
* first stream contains enough information to decode punctuation and
the location of words
* second stream contains only the words.

I can apply my work, including "skip", on the second stream. I did a
program to split the files into the streams, but i didn't do the
reverse program (so the results cannot be verified). If I will have
some notable results (like 15% improvement in compression ratio) I will
submit any result.

At the moment I want to concentrate only on compressing the stream of
words (text8/9 files). I assume that compressing human language (and
not xml/html tags) is your goal, too (correct me if I'm wrong).

The results I posted earlier shows that using a dictionary with all the
words is a very good idea. Not only that it boosts compression of other
programs, but also the transformation of words into integers (like a
perfect hash function) allows better text preprocessing by decreasesing
memory usage and increasing computation speed. Moreover "the hash
function" is also bijective, so integers can be converted easily into
words and vice-versa.

> Your "skip" program is like word-level LZP compression.

Oh, thanks for taking a look over it. I didn't know (until 1h ago)
about LZP algorithm (I didn't pay too much attention to any of LZ*
algorithms). The difference between LZP and "skip" is that LZP uses a
higher level to predict; however, I'm planning to implement a variable
length context for skip, too. It will be a tremendious work since I
will have to rewrite parts of the program to make it work with integers
(now that I decided to use a dictionary).

--
Alexandru Mosoi

Ted Dunning

unread,

Jun 15, 2006, 7:54:13 PM6/15/06

to

You can probably grab a few percent more by compressing your dictionary
using prefix compression (standard trick).

Using arithmetic coding to minimize the size of your code words will
definitely help as well.

There are interesting methods that try to find an optimal lexicon for
this sort of compression. See Carl de Marcken's thesis about
segmentation of English.

If you use an entropic method for deriving your lexicon, it should
automagically include common multi-word phrases. Matt's observation
that your method is similar to word-level LVP is correct and can be
extended if you use phrases to say that you are doing something like
word level LZ.

Matt Mahoney

unread,

Jun 15, 2006, 9:45:29 PM6/15/06

to

You could also have codes for punctuation characters, and a special
symbol meaning to capitalize the next letter. Also, you could have
rules about when to insert a space before a word (like after another
word or "," but not after a linefeed or "(" ) and a special symbol to
override the rules.

> At the moment I want to concentrate only on compressing the stream of
> words (text8/9 files). I assume that compressing human language (and
> not xml/html tags) is your goal, too (correct me if I'm wrong).
>
> The results I posted earlier shows that using a dictionary with all the
> words is a very good idea. Not only that it boosts compression of other
> programs, but also the transformation of words into integers (like a
> perfect hash function) allows better text preprocessing by decreasesing
> memory usage and increasing computation speed. Moreover "the hash
> function" is also bijective, so integers can be converted easily into
> words and vice-versa.

Yes, but part of language modeling is dealing with the junk at the
lexical level like structure, formatting and misspelled words.
Wikipedia is high quality with few misspellings, but more commonly you
can improve compression of misspelled words using ngrams rather than
creating new dictionary entries. The filtered text is good for testing
ideas, though.

> > Your "skip" program is like word-level LZP compression.
>
> Oh, thanks for taking a look over it. I didn't know (until 1h ago)
> about LZP algorithm (I didn't pay too much attention to any of LZ*
> algorithms). The difference between LZP and "skip" is that LZP uses a
> higher level to predict; however, I'm planning to implement a variable
> length context for skip, too. It will be a tremendious work since I
> will have to rewrite parts of the program to make it work with integers
> (now that I decided to use a dictionary).

It will be interesting to see what effect this has on language
modeling. I guess if you take all the predictable parts out of text,
it will still be somewhat comprehensible but no longer grammatical. It
might make compression worse for an n-gram model, but I wonder if it
would help a more sophisticated model like LSA.

>
> --
> Alexandru Mosoi

-- Matt Mahoney

Alexandru

unread,

Jun 16, 2006, 6:01:18 AM6/16/06

to

> > At the moment I want to concentrate only on compressing the stream of
> > words (text8/9 files). I assume that compressing human language (and
> > not xml/html tags) is your goal, too (correct me if I'm wrong).
> >
> > The results I posted earlier shows that using a dictionary with all the
> > words is a very good idea. Not only that it boosts compression of other
> > programs, but also the transformation of words into integers (like a
> > perfect hash function) allows better text preprocessing by decreasesing
> > memory usage and increasing computation speed. Moreover "the hash
> > function" is also bijective, so integers can be converted easily into
> > words and vice-versa.
>
> Yes, but part of language modeling is dealing with the junk at the
> lexical level like structure, formatting and misspelled words.
> Wikipedia is high quality with few misspellings, but more commonly you
> can improve compression of misspelled words using ngrams rather than
> creating new dictionary entries. The filtered text is good for testing
> ideas, though.

I made a distribution of new words over text9.txt (fil9). Check the
image:
http://mosoi.lumina.ro/downloads/Chile/large/words.jpg.

You may observe that the number of words added to the dictionary is
almost constant (there is a very low decrease over time). This happens
because wikipedia contains articles from many subjects, each subject
with it's own specific vocabulary (this explains also the huge number
of words in the dictionary).

Alexandru Mosoi

Matt Mahoney

unread,

Jun 16, 2006, 11:46:53 AM6/16/06

to

Very interesting. This distribution is consistent with Zipf's law.
Zipf's law says the n'th most frequent word has a frequency
proportional to 1/n. At every point in the learning process, about
half of the words in the dictionary should occur only once. The rate
at which novel words are encountered is nearly constant (after an
initial spurt). This implies that natural languge is a nonstationary
source over large time scales.

If Zipf's law holds, then about 40,000 words have a frequency of 20 or
higher (since 800,000 have a frequency of 1). This is significant
because a child has to read a word about 20 times to "learn" it, so
that it becomes part of the vocabulary. The typical adult vocabulary
is 30,000 to 50,000 words.

The relatively flat section in the middle looks like it corresponds to
the region of low compression ratio in enwik9 observed with ppmd.

This is exactly the kind of research I was hoping this benchmark would
encourage.

-- Matt Mahoney

Matt Mahoney

unread,

Jun 16, 2006, 4:45:24 PM6/16/06

to

I did an analysis of the unfiltered and filtered text (enwik9 and fil9)
to test if Zipf's law holds. In both cases, I counted any sequence of
a-z as a word, ignoring case. This is cruder than ZIpf's analysis.
Zipf analyzed text by hand (this was in 1935) and considered
morphology, so suffixes like -s, -ed, -ing etc. would be considered
separate words from their roots. It turns out with my crude parsing
that the fit to Zipf's law is rather poor, with or without filtering.
Filtering just makes it a little worse.

Here are the words in each file ranked by frequency. Zipf's law says
rank x frequency = constant (dictionary size). Here is for enwik9.
The Zipf constant is 7-14 million for the top 10,000 words but only
about 1.2 million around the bottom. (In case of words tied by
frequency, the rank is shown for the last word in the list).

1171106 types, 141176630 tokens
Freq Rank
7803966 1 the
4864784 2 of
3064252 3 and
2634604 4 in
2366965 5 a
2147564 6 to
1875373 7 quot
1638933 8 is
1423733 9 id
1379670 10 s
1153722 11 lt
1152134 12 gt
994178 13 for
919854 14 amp
861329 15 as
855885 16 are
817812 17 was
740085 18 by
726845 19 with
664439 20 from
...
143452 100 most
...
14368 1000 songs
...
956 10002 ale arrows axioms cal charlemagne expenditures
paradigm...
...
26 100665 aaaa aabybro aak aalsmeer abdolhossein abdoulaye
aberfoyle...
...
5 277712 aaaba aabye aaca aacap aachtopf aafjes aafss aakal
aamer...
4 326414 aaaaaaa aaaaahh aaaai aaaargh aaad aaasouth aabab
aabbff...
3 403550 aaaab aaaar aaalac aaargh aabaa aababb aabadie aabrekk
aacc...
2 577832 aaaaaaaa aaaaaaaaaah aaaabbbbccccddddeeeeffffgggg
aaaay...
1 1171106 aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa...

Here is for fil9. Zipf's constant is about 7-20 million for the top
10,000 words, then drops to less than 1 million.

833184 types, 124301826 tokens
Freq Rank
7446708 1 the
4453926 2 of
3776770 3 one
3085174 4 zero
2916968 5 and
2480552 6 in
2339802 7 two
2241744 8 a
2063649 9 nine
2028129 10 to
1594091 11 is
1489530 12 eight
1483148 13 three
1459622 14 four
1456363 15 five
1283963 16 six
1123114 17 seven
938010 18 for
840437 19 are
829101 20 as
...
109212 100 line
...
11177 1000 route
...
752 10007 angela bulls costly fare finch hulk hypothetical
inputs...
...
18 101387 abbate abderus abdomens abides abigor abp abridge...
...
5 218316 aaaaaa aabba aabf aachtopf aafss aaiyangar aakal
aakirkeby...
4 253871 aaaaahh aaaba aabab aachener aadt aafjes aafk aafla
aagaard...
3 309475 aaaaaaa aaaab aaaar aaaargh aaad aabaa aabc aabye
aadolf...
2 425051 aaaaaaaa aaaabbbbccccddddeeeeffffgggg aaaay aaab
aaabb...
1 833184 aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaarg...

-- Matt Mahoney

Matt Mahoney

unread,

Jun 16, 2006, 5:28:16 PM6/16/06

to

Ted Dunning wrote:
> You can probably grab a few percent more by compressing your dictionary
> using prefix compression (standard trick).

Here are some compression benchmarks for a sorted list of words.
http://maximumcompression.com/data/dict.php?historic=true

In writing PAQ8 I saw a big jump in compression when I added an
indirect context model. At the time I was writing it, I was not even
trying to tune it for sorted words, although it helped on almost every
test I tried. Still, I was rather surprised by the effect. The other
models that help are a text column model (using the column position and
character above as context), and an extra level of model indirection at
the bit level. In other words, if the bit sequence 0001 is observed in
4 successive occurrences of some context, then the next bit prediction
depends on the 0/1 count in this indirect context based on what happens
after the sequence 0001 is observed in other contexts. In a dictionary
the next bit will usually be 1 because the words are in ascending
order. (01 is more common than 10). This is an example of indirect
context modeling at the bit level, which is present in all versions of
PAQ7 and PAQ8.

PAQ6 and PAQAR use fixed tables, so are more likely to predict 0.
However there is also some improvement in PAQ7-8 due to using a neural
network to mix predictions instead of adaptive weighted linear mixing.
The main difference is that instead of a weighted average of what the
various models predict the next bit to be, the probabilities are first
transformed into the logistic domain, p -> ln(p/(1-p)) before the
weighed averaging, then transformed back: p -> 1/(1+e^-p)) prior to
arithmetic coding.

In PAQ8F and higher, I also added indirect models at the byte level.
This means that the sequence "AB...AC...AB...AC...AB...A?" uses the
history "BCBCB" (the sequence that occurred in context A) to predict
the next byte. Note that B is always followed by C, so C is predicted
in the indirect contexts "B" and "CB".

Neither XML-WRT nor the external dictionary in PAQ8H are sorted, so I
think there might be some room for improvement here.

-- Matt Mahoney

Alexandru

unread,

Jun 16, 2006, 7:11:27 PM6/16/06

to

Ted Dunning wrote:
> You can probably grab a few percent more by compressing your dictionary
> using prefix compression (standard trick).

Already did this, but the trick doesn't work with paq7 (it works with
bzip and gzip). Paq7 does better if I leave the sorted list of words
untouched. I tried different tricks of which none improved compression
with paq7. I think paq8h does better, but I cannot run it on my
computer.

I kindly ask Matt to publish a Linux binary executable (with static
libraries) of paq8h so I can run tests on it. Currently I'm doing all
my tests with bzip2 because it's fast and has no model for English text
(which I'm working on), but I plan to try also paq from time to time.

Ted Dunning

unread,

Jun 16, 2006, 7:57:39 PM6/16/06

to

This introduction also occurs due to tokens such as numbers. For
instance, in a novel, you expect to see at least one new token on each
of the later pages (the page numbers, at least).

Whether that sort of vocabulary growth is interesting is a debatable
issue.

As you get to very, very large corpora, the rate of new tokens which
are actually words in the sense used by most linguists decreases to
very nearly zero. This is because Zipf's law doesn't precisely hold
for very large or very low frequency words, both are found in lesser
abundance than predicted.

Ted Dunning

unread,

Jun 16, 2006, 8:15:11 PM6/16/06

to

Matt Mahoney wrote:
> Matt Mahoney wrote:
>

> I did an analysis of the unfiltered and filtered text (enwik9 and fil9)

> to test if Zipf's law holds. ... It turns out with my crude parsing

> that the fit to Zipf's law is rather poor, with or without filtering.
> Filtering just makes it a little worse.

...

> The Zipf constant is 7-14 million for the top 10,000 words but only
> about 1.2 million around the bottom. (In case of words tied by
> frequency, the rank is shown for the last word in the list).

These results are typical. If you plot the results on a log-log graph,
you see a goodly stretch that is roughly linear with droop at each end.
The drop that you see is not all that large.

Working with lemmatized text doesn't change the results all that much.
You still get droop at each end.

Ted Dunning

unread,

Jun 16, 2006, 8:19:22 PM6/16/06

to

Matt Mahoney wrote:

> ... The rate

> at which novel words are encountered is nearly constant (after an
> initial spurt). This implies that natural languge is a nonstationary
> source over large time scales.

Au contraire.

A hyperbolic distribution with a very large number of unobserved
symbols does not imply non-stationarity.

In fact, word frequencies that vary by topic also doesn't imply
non-stationarity. It just implies that the probability of the next
word can't be modeled by a stationary multi-nomial. A stationary
hidden variable approach or a history based model might do just fine.

Matt Mahoney

unread,

Jun 16, 2006, 8:31:21 PM6/16/06

to

Unfortunately PAQ8H (and PAQ8G) have some Windows specific code in the
text preprocessor written by Przemyslaw Skibinski. I think he plans to
replace the static dictionaries with xml-wrt in a future version of PAQ
which should compile under Linux like xml-wrt currently does.
Meanwhile you can use PAQ8F which does no dictionary preprocessing and
compiles under Linux. However it lacks Alexander Ratushnyak's
improvement to the mixer so the compression is not quite as good.

-- Matt Mahoney

Matt Mahoney

unread,

Jun 17, 2006, 9:55:40 AM6/17/06

to

Yes, I guess you're right, you need more to show the nonstationarity of
text.

But even huge corpora like libraries, the web or usenet can be
organized hierarchically. A depth-first traversal of this hierarchy
will show nonuniform statistics over the largest time scales. You can
artificially make a corpus stationary by chopping it into small pieces
and shuffling them, but this does not remove the underlying
organization. These pieces can be clustered to recover the hierarchy.

Even if you consider every word ever spoken or written, then this
source is nonstationary because language evolves over time.

-- Matt Mahoney

Alexandru

unread,

Jun 17, 2006, 11:27:06 AM6/17/06

to

I checked also paq7 and paq8f. On paq7 there is no improvement, but for
paq8f:

text.txt.paq8f 18,554,164

dict.txt.paq8f 359,198
outp.txt.paq8f 17,940,311
total 18,320,744 (the improvement is small,
though)

* Note: The words in the dictionary is not compressed with prefix
compression.

Matt Mahoney

unread,

Jun 17, 2006, 11:49:04 AM6/17/06

to

> ...
> 26 100665 aaaa aabybro aak aalsmeer abdolhossein abdoulaye

> ...
> 5 277712 aaaba aabye aaca aacap aachtopf aafjes aafss aakal

> 4 326414 aaaaaaa aaaaahh aaaai aaaargh aaad aaasouth aabab

> 3 403550 aaaab aaaar aaalac aaargh aabaa aababb aabadie aabrekk

> 2 577832 aaaaaaaa aaaaaaaaaah aaaabbbbccccddddeeeeffffgggg

> 1 1171106 aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa...

Here is the distribution of enwik9, counting words only once per page
(article). In enwik9, the string "<p" (start of the <page> tag) is
used to delimit articles. This 2 character string occurs nowhere else.
The distribution is again Zipf-like, but the droop at the ends is more
pronounced. The product of rank x frequency peaks at around 8 million.

1171106 types, 52789779 tokens, 243426 pages
Freq Rank
243427 2 page xml
243426 12 contributor id preserve revision space t text
timestamp...
205420 13 username
190590 14 comment
175772 15 of
169458 16 the
165157 17 to
156329 18 in
155269 19 and
153842 20 a
146587 21 category
143877 22 s
141985 23 is
131662 24 for
129792 25 as
126046 26 from
125279 27 with
121777 28 it
117671 29 an
115512 30 by
...
85080 50 no
...
51218 100 some
...
36246 200 being
...
14389 500 george
...
7574 1000 running
...
4014 2001 edge seem
...
1421 5001 introduce targets
...
552 10009 covenant daytime dining drift exciting grip guarded
issuing...
...
196 20061 acquaintances afc appleton barclay beers carriages
cayman...
...
45 50213 abdur abrogation abruzzo abugida acci accueil
acidosis...
...
15 100727 aacsb aalen aao aarons abacha abaddon abanet abcl
abdali...
...
10 130964 aaai aadd aands aarseth aarti aaup abascal abaya
abbad...
...
5 213602 aaah aabba aach aachener aacm aadolf aafa aalberg
aalge...
4 253832 aaaaaaa aaasouth aabye aacap aachtopf aadl aafjes
aafrika...
3 322435 aaaai aaad aaargh aabab aabc aabrekk aaca aacc
aachencath...
2 475773 aaaaaaaaaah aaaab aaab aaaba aaade aaalac aaarrr
aabaa...
1 1171106 aaaaaaaa aaaaaaaaa aaaaaaaaaaa aaaaaaaaaaaa...

-- Matt Mahoney

Ted Dunning

unread,

Jun 17, 2006, 5:50:06 PM6/17/06

to

Alexandru wrote:
> Ted Dunning wrote:
> > You can probably grab a few percent more by compressing your dictionary
> > using prefix compression (standard trick).
>
> Already did this, but the trick doesn't work with paq7 (it works with
> bzip and gzip). Paq7 does better if I leave the sorted list of words
> untouched.

Score one for Matt's cleverness in defining the rules in such a way
that the compressor designers can focus on important issues.