Re: Ian's Compression Algorithm

Rob Freeman

unread,

Jan 15, 2007, 5:15:21 AM1/15/07

to Hutter Prize

Hi Ian,

This cross-posted from Creat...@googlegroups.com to comp.compression,
and Hutter Pr...@googlegroups.com. It seems relevant.

Ian Parker wrote:
>
> Either natural language is compressible or it is not. You now seem to
> be saying it is compressible but not by very much - am I right?

You can think of it as an attempt to explain why text can't be
compressed as much as we expect if you like. In particular why we can't
compress it by as much as human estimates for the entropy of text
suggest we should (e.g. Matt Mahoney on Shannon.)

Something like... that the sum of human knowledge needed to compress a
given text that much, will always be bigger than the text itself.

We can understand this if we accept that knowledge is coded in the
different ways of ordering text (or the different ways of ordering any
sufficiently complex/random set.)

It will always take more information to code all the different ways of
ordering a random set than it takes to code the set itself. (It is only
a corollary that you can't use information about the different possible
orderings of a text to compress text, because your knowledge of all the
possible orders of a text will always be bigger than the text itself.)

It is important to note that while language is statistically random
(permitting the same tokens to specify more orders), language is not
indeterminate (c.f. Adrian Barnett's rather amusing "bitmap" Theory Of
Everything (http://www.abarnett.demon.co.uk/theories.html#BITMAP.) With
language the different orders are precisely determined by the set
itself. It is just many are equally likely.

Which IMHO is the insight we have been looking for to model AI (NLP
etc.)

-Rob

Matt Mahoney

unread,

Jan 15, 2007, 1:54:29 PM1/15/07

to Hutter Prize

Rob Freeman wrote:
> Hi Ian,
>
> This cross-posted from Creat...@googlegroups.com to comp.compression,
> and Hutter Pr...@googlegroups.com. It seems relevant.
>
> Ian Parker wrote:
> >
> > Either natural language is compressible or it is not. You now seem to
> > be saying it is compressible but not by very much - am I right?
>
> You can think of it as an attempt to explain why text can't be
> compressed as much as we expect if you like. In particular why we can't
> compress it by as much as human estimates for the entropy of text
> suggest we should (e.g. Matt Mahoney on Shannon.)
>
> Something like... that the sum of human knowledge needed to compress a
> given text that much, will always be bigger than the text itself.

No, the Kolmogorov complexity of a string x cannot be significantly
more than the size of x. You can always write a program to store and
print x verbatim.

> We can understand this if we accept that knowledge is coded in the
> different ways of ordering text (or the different ways of ordering any
> sufficiently complex/random set.)
>
> It will always take more information to code all the different ways of
> ordering a random set than it takes to code the set itself. (It is only
> a corollary that you can't use information about the different possible
> orderings of a text to compress text, because your knowledge of all the
> possible orders of a text will always be bigger than the text itself.)

No, if you have a set of k items out of a possible n, then there are (n
choose k) = n!/(k!(n-k)!) possible sets, but there are k! ways to order
the elements. It is not true that k! > (n choose k) when n is much
larger than k.

>
> It is important to note that while language is statistically random
> (permitting the same tokens to specify more orders), language is not
> indeterminate (c.f. Adrian Barnett's rather amusing "bitmap" Theory Of
> Everything (http://www.abarnett.demon.co.uk/theories.html#BITMAP.) With
> language the different orders are precisely determined by the set
> itself. It is just many are equally likely.

I like the idea of launching satellites by building a giant ramp into
space, driving them up by truck and pushing them over the edge. :-)

> Which IMHO is the insight we have been looking for to model AI (NLP
> etc.)
>
> -Rob

I didn't see any great insights there. Ian's compression algorithm at
http://ipcomp.blogspot.com/ looks like an attempt at random data
compression.

-- Matt Mahoney

Rob Freeman

unread,

Jan 15, 2007, 11:12:14 PM1/15/07

to Hutter Prize

Matt Mahoney wrote:
> Rob Freeman wrote:
> >
> > ...the sum of human knowledge needed to compress a

> > given text that much, will always be bigger than the text itself.
>
> No, the Kolmogorov complexity of a string x cannot be significantly
> more than the size of x. You can always write a program to store and
> print x verbatim.

You misunderstand me Matt. I am talking about attempts to compress text
using "human knowledge".

I define "human knowledge" to be different ways of ordering a system
(i.e. distributions of relations.) The information needed to specify
different ways of ordering a (random) system will always be bigger than
the information needed to specify the system itself.

If you choose to define "human knowledge" another way you will get
another result. But on the basis of past discussions I think you agree
that distributions of relations are the best model.

> > We can understand this if we accept that knowledge is coded in the
> > different ways of ordering text (or the different ways of ordering any
> > sufficiently complex/random set.)
> >
> > It will always take more information to code all the different ways of
> > ordering a random set than it takes to code the set itself. (It is only
> > a corollary that you can't use information about the different possible
> > orderings of a text to compress text, because your knowledge of all the
> > possible orders of a text will always be bigger than the text itself.)
>
> No, if you have a set of k items out of a possible n, then there are (n
> choose k) = n!/(k!(n-k)!) possible sets, but there are k! ways to order
> the elements. It is not true that k! > (n choose k) when n is much
> larger than k.

You misunderstand me again. My point is that "n choose k" is always
much larger than n (or k! is always >> k.)

> > It is important to note that while language is statistically random
> > (permitting the same tokens to specify more orders), language is not
> > indeterminate (c.f. Adrian Barnett's rather amusing "bitmap" Theory Of
> > Everything (http://www.abarnett.demon.co.uk/theories.html#BITMAP.) With
> > language the different orders are precisely determined by the set
> > itself. It is just many are equally likely.
>
> I like the idea of launching satellites by building a giant ramp into
> space, driving them up by truck and pushing them over the edge. :-)

The guy is inspired :-)

> > Which IMHO is the insight we have been looking for to model AI (NLP
> > etc.)
>

> I didn't see any great insights there. Ian's compression algorithm at
> http://ipcomp.blogspot.com/ looks like an attempt at random data
> compression.

I thought you analyzed Ian's algorithm well. It seems to come down to a
re-invention of compression based on first order probability
distributions. All credit to Ian for rediscovering that from first
principles.

But Ian's algorithm was not the insight I was talking about.

I think the insight we have been looking for to model AI/NLP is that
the information needed to code different ways of ordering a system
(knowledge) is always greater than the information to code the system
itself (for a random system.)

In the context of AI/NLP it is important to note that random need not
mean indeterminate. I hope I demonstrated this in our earlier thread on
comp.compression. In the case of language relational distributions can
precisely determine a syntax which nevertheless has random statistics.
To be clear I am talking once again about our toy example:
"AX...DX...DB...AZ...YZ...YC...A_". Here different relational
distributions (A,D), (A,Y) (corresponding to "knowledge": in some ways
"A is close to D but not Y" but in other ways "A is close to Y but not
D"?) deterministically specify syntax, but still produce a random
language model A_ -> AB/AC (probably because the distributions are
contradictory.)

These distributions specify the system, but it is more expensive to
enumerate all the different possible relational distributions than it
is to enumerate the system itself. (So they can't be used to compress
it.)

The way to model NLP/AI is to specify the underlying random (but
determinate) system (e.g. text), and generate different possible
distributions over that system to model knowledge, or limit syntax (the
same thing), as required.

-Rob

James Bowery

unread,

Jan 16, 2007, 12:01:21 AM1/16/07

to Hutter...@googlegroups.com

On 1/15/07, Rob Freeman <gro...@chaoticlanguage.com> wrote:

I define "human knowledge" to be different ways of ordering a system
(i.e. distributions of relations.) The information needed to specify
different ways of ordering a (random) system will always be bigger than
the information needed to specify the system itself.

When you talk about "different ways of ordering" you are really talking about the equivalence class of strings under a grammar.

Take a very simple grammar where two strings are equivalent:

X == "not not "X

We can see from this that you can index this infinite equivalence class with an integer i that places 2i "not " strings in front of X. But not all members of this equivalence class are equally likely. Therefore arithmetic coding can be used for the index.

What you're claiming is that the expression of knowledge being compressed has an index pretty close to 0 -- in other words, there isn't much you can do to simplify the expression.

I think most of us who believe the Hutter Prize will result in breakthroughs (AI, epistemological, scientific, philosophical, social, political, economic, etc.) believe that there is a _lot_ more to be done to the knowledge expressed in Wikipedia for it to have an index anywhere near 0.

ianpa...@gmail.com

unread,

Jan 16, 2007, 7:17:50 AM1/16/07

to Hutter Prize

I have no objections to cross posting under the new moderation scheme.
Hutter is in fact completely unmoderated.
It may well be that natural language is less compressible than Matt and
Co imagine. However it certainly is compressible. Obviously without a
definte benchmark you cannot tell how compressible.

I would like to make a number of remarks.

1) Rrimavera, Ressorte, Mamanthal. We know that English can be
translated into Spanish using a human translator. It would be hard for
me to concede therefore that techniques do not exist to eliminate at
least 2/3 of words using context. Primavera etc. implies log(2)3 bits
per word (at least).
2) This is a much more mathematical point. We know that the present
prize-winner has compressed to 17MB. If we allocate a number to a word
there must be a technique (we will ignore Kolmagorov for the moment)
which will compress our sequence to at most 17MB. We can at this stage
view punctuation marks as words. Algorithms like 7-zip take raw text
and compress using fragments as does the current prize-winner. Now if a
sequence is random to the extent that the probability of a word
occurring is simply the mean frequency of that word there cannot be any
better method than factorials, the Gosper approximation giving the
minimum compressed size. Now if V is our vector of words VP is going to
represent a fragment representation. We can get to fragments by a
simple matrix multiplication. VP = 17MB compressed. A matrix X can (and
has been) found which will give us VPX. The probability decomposition
of VPX MUST be 17MB. Of course multiplying one matrix by another simply
produces another matrix. Why not take the bull by the horns, apply LSA
and produce the 17MB minus matrix? The answer, of course, is
Kolmogorov. If I were not concerned with Hutter, but were simply
concerned with GALE and translating languages I would not care a damn
about Kolmogorov but would simply demand a larger training set.

This raises yet another point. We can get an initial semantic
assessment by simply associating fragments to words. This tends to be
what is done in assessing relationships between different languages,
cerradura does indeed have a very similar fragment structure to serrure
(French) arana, arraigne - Papers please! If we were to take the
prize-winning program and construct a matrix P linking it to a word
list. This could be a springboard to further compression.

- Ian Parker

Rob Freeman

unread,

Jan 16, 2007, 5:42:03 AM1/16/07

to Hutter Prize

James Bowery wrote:
> On 1/15/07, Rob Freeman <gro...@chaoticlanguage.com> wrote:
> >
> > I define "human knowledge" to be different ways of ordering a system
> > (i.e. distributions of relations.) The information needed to specify
> > different ways of ordering a (random) system will always be bigger than
> > the information needed to specify the system itself.
>
> When you talk about "different ways of ordering" you are really talking
> about the equivalence class of strings under a grammar.
>

> ...

>
> X == "not not "X

If I understand you correctly James, I do not mean "the equivalence
class of strings under a grammar" when I say "different ways of
ordering." I don't think the assumption of a grammar is justified at
all.

Why do you think the assumption of a grammar is justified?

I'm considering equivalence classes yes, but not classes under a
grammar. You could say I let the classes govern the grammar. In some
cases the two come to the same thing, but they are not the same.

I don't think there is any reason to assume that natural language
limits itself to paradigmatic relationships of the kind X == "not not
"X. I would go as far as to say it is demonstrably not the case.

A case in point. My example. For a language
"AX...DX...DB...AZ...YZ...YC...A_" you have two paradigmatic classes
(A,D) and (A,Y). You cannot reduce those to a grammar of the kind X ==
"not not "X. Is there any mapping (A,D) == "not not "(A,Y)?

For any real language there are many paradigmatic classes of this sort.
Just look at how many you can get.

Calculate them out for yourself. Expand my example. If A_1, A_2,... A_n
are the contexts of A in some text, and X_1, X_2,...X_n are contexts of
other tokens, then the number of ways A can have common contexts with
other tokens in the text, and thus uniquely specify some new
paradigmatic class, are just Matt's "(n choose k) = n!/(k!(n-k)!)
possible sets", where k is the number of common contexts between A and
some other token.

The syntagmatic distribution of sequences AX_? specified by these
classes in the text can be random, because many different paradigmatic
distributions (A_i,...A_i+k) can be equally likely (and must be,
because many of the "n choose k" possible classes will overlap, and
thus form complementary distributions with each other??) But the
relationship between any given syntax and its corresponding
paradigmatic distribution is not random. And the different paradigmatic
distributions (knowledge?) governing that random syntax are not random
either, just very numerous. Much more numerous than the sequence of 2n
or so tokens needed to specify them.

I don't know if any of that makes sense. Perhaps I have misunderstood
your objection.

-Rob

jabo...@gmail.com

unread,

Jan 17, 2007, 9:03:45 AM1/17/07

to Hutter Prize

Rob Freeman wrote:
> If I understand you correctly James, I do not mean "the equivalence
> class of strings under a grammar" when I say "different ways of
> ordering." I don't think the assumption of a grammar is justified at
> all.
>
> Why do you think the assumption of a grammar is justified?

Because you can think of a compressed file as a grammar specification
which has a single well formed string:

The original file.

This is most obvious in the case of simple hierarchical grammars like
those generated by Nevill-Manning's Sequitur compressor.

Rob Freeman

unread,

Jan 17, 2007, 10:43:01 AM1/17/07

to Hutter Prize

jabo...@gmail.com wrote:
> Rob Freeman wrote:
> > If I understand you correctly James, I do not mean "the equivalence
> > class of strings under a grammar" when I say "different ways of
> > ordering." I don't think the assumption of a grammar is justified at
> > all.
> >
> > Why do you think the assumption of a grammar is justified?
>
> Because you can think of a compressed file as a grammar specification
> which has a single well formed string

So you feel justified in assuming grammar to make compression possible,
on the grounds that if compression is possible you will have a grammar?

Surely your argument is circular.

As a case in point, what are the equivalence classes which compress my
example string: "AX...DX...DB...AZ...YZ...YC"?

In particular how do you capture the information that "in some ways A
is close to D but not Y, but in other ways A is close to Y but not D"?

-R

jabo...@gmail.com

unread,

Jan 17, 2007, 11:38:32 AM1/17/07

to Hutter Prize

Rob Freeman wrote:
> So you feel justified in assuming grammar to make compression possible,
> on the grounds that if compression is possible you will have a grammar?
>
> Surely your argument is circular.

No I was merely using grammar as an example of a knowledge
representation language and then you said it isn't powerful enough to
compress natural language and then I said that there is an existence
proof that it is: Sequitur with its hierarchical grammar output.

> As a case in point, what are the equivalence classes which compress my
> example string: "AX...DX...DB...AZ...YZ...YC"?
>
> In particular how do you capture the information that "in some ways A
> is close to D but not Y, but in other ways A is close to Y but not D"?

Sorry, but you're not making sense to me. Find some better way of
defining "information" aka: algorithmic information theory, and we can
continue this discussion here. Otherwise, keep it in comp.compression.

Reply all

Reply to author

Forward