This cross-posted from CreatingAI@googlegroups.com to comp.compression, and Hutter Prize@googlegroups.com. It seems relevant.
Ian Parker wrote:
> Either natural language is compressible or it is not. You now seem to > be saying it is compressible but not by very much - am I right?
You can think of it as an attempt to explain why text can't be compressed as much as we expect if you like. In particular why we can't compress it by as much as human estimates for the entropy of text suggest we should (e.g. Matt Mahoney on Shannon.)
Something like... that the sum of human knowledge needed to compress a given text that much, will always be bigger than the text itself.
We can understand this if we accept that knowledge is coded in the different ways of ordering text (or the different ways of ordering any sufficiently complex/random set.)
It will always take more information to code all the different ways of ordering a random set than it takes to code the set itself. (It is only a corollary that you can't use information about the different possible orderings of a text to compress text, because your knowledge of all the possible orders of a text will always be bigger than the text itself.)
It is important to note that while language is statistically random (permitting the same tokens to specify more orders), language is not indeterminate (c.f. Adrian Barnett's rather amusing "bitmap" Theory Of Everything (http://www.abarnett.demon.co.uk/theories.html#BITMAP.) With language the different orders are precisely determined by the set itself. It is just many are equally likely.
Which IMHO is the insight we have been looking for to model AI (NLP etc.)
> This cross-posted from CreatingAI@googlegroups.com to comp.compression, > and Hutter Prize@googlegroups.com. It seems relevant.
> Ian Parker wrote:
> > Either natural language is compressible or it is not. You now seem to > > be saying it is compressible but not by very much - am I right?
> You can think of it as an attempt to explain why text can't be > compressed as much as we expect if you like. In particular why we can't > compress it by as much as human estimates for the entropy of text > suggest we should (e.g. Matt Mahoney on Shannon.)
> Something like... that the sum of human knowledge needed to compress a > given text that much, will always be bigger than the text itself.
No, the Kolmogorov complexity of a string x cannot be significantly more than the size of x. You can always write a program to store and print x verbatim.
> We can understand this if we accept that knowledge is coded in the > different ways of ordering text (or the different ways of ordering any > sufficiently complex/random set.)
> It will always take more information to code all the different ways of > ordering a random set than it takes to code the set itself. (It is only > a corollary that you can't use information about the different possible > orderings of a text to compress text, because your knowledge of all the > possible orders of a text will always be bigger than the text itself.)
No, if you have a set of k items out of a possible n, then there are (n choose k) = n!/(k!(n-k)!) possible sets, but there are k! ways to order the elements. It is not true that k! > (n choose k) when n is much larger than k.
> It is important to note that while language is statistically random > (permitting the same tokens to specify more orders), language is not > indeterminate (c.f. Adrian Barnett's rather amusing "bitmap" Theory Of > Everything (http://www.abarnett.demon.co.uk/theories.html#BITMAP.) With > language the different orders are precisely determined by the set > itself. It is just many are equally likely.
I like the idea of launching satellites by building a giant ramp into space, driving them up by truck and pushing them over the edge. :-)
> Which IMHO is the insight we have been looking for to model AI (NLP > etc.)
> -Rob
I didn't see any great insights there. Ian's compression algorithm at http://ipcomp.blogspot.com/ looks like an attempt at random data compression.
> > ...the sum of human knowledge needed to compress a > > given text that much, will always be bigger than the text itself.
> No, the Kolmogorov complexity of a string x cannot be significantly > more than the size of x. You can always write a program to store and > print x verbatim.
You misunderstand me Matt. I am talking about attempts to compress text using "human knowledge".
I define "human knowledge" to be different ways of ordering a system (i.e. distributions of relations.) The information needed to specify different ways of ordering a (random) system will always be bigger than the information needed to specify the system itself.
If you choose to define "human knowledge" another way you will get another result. But on the basis of past discussions I think you agree that distributions of relations are the best model.
> > We can understand this if we accept that knowledge is coded in the > > different ways of ordering text (or the different ways of ordering any > > sufficiently complex/random set.)
> > It will always take more information to code all the different ways of > > ordering a random set than it takes to code the set itself. (It is only > > a corollary that you can't use information about the different possible > > orderings of a text to compress text, because your knowledge of all the > > possible orders of a text will always be bigger than the text itself.)
> No, if you have a set of k items out of a possible n, then there are (n > choose k) = n!/(k!(n-k)!) possible sets, but there are k! ways to order > the elements. It is not true that k! > (n choose k) when n is much > larger than k.
You misunderstand me again. My point is that "n choose k" is always much larger than n (or k! is always >> k.)
> > It is important to note that while language is statistically random > > (permitting the same tokens to specify more orders), language is not > > indeterminate (c.f. Adrian Barnett's rather amusing "bitmap" Theory Of > > Everything (http://www.abarnett.demon.co.uk/theories.html#BITMAP.) With > > language the different orders are precisely determined by the set > > itself. It is just many are equally likely.
> I like the idea of launching satellites by building a giant ramp into > space, driving them up by truck and pushing them over the edge. :-)
The guy is inspired :-)
> > Which IMHO is the insight we have been looking for to model AI (NLP > > etc.)
> I didn't see any great insights there. Ian's compression algorithm at > http://ipcomp.blogspot.com/ looks like an attempt at random data > compression.
I thought you analyzed Ian's algorithm well. It seems to come down to a re-invention of compression based on first order probability distributions. All credit to Ian for rediscovering that from first principles.
But Ian's algorithm was not the insight I was talking about.
I think the insight we have been looking for to model AI/NLP is that the information needed to code different ways of ordering a system (knowledge) is always greater than the information to code the system itself (for a random system.)
In the context of AI/NLP it is important to note that random need not mean indeterminate. I hope I demonstrated this in our earlier thread on comp.compression. In the case of language relational distributions can precisely determine a syntax which nevertheless has random statistics. To be clear I am talking once again about our toy example: "AX...DX...DB...AZ...YZ...YC...A_". Here different relational distributions (A,D), (A,Y) (corresponding to "knowledge": in some ways "A is close to D but not Y" but in other ways "A is close to Y but not D"?) deterministically specify syntax, but still produce a random language model A_ -> AB/AC (probably because the distributions are contradictory.)
These distributions specify the system, but it is more expensive to enumerate all the different possible relational distributions than it is to enumerate the system itself. (So they can't be used to compress it.)
The way to model NLP/AI is to specify the underlying random (but determinate) system (e.g. text), and generate different possible distributions over that system to model knowledge, or limit syntax (the same thing), as required.
On 1/15/07, Rob Freeman <gro...@chaoticlanguage.com> wrote:
> I define "human knowledge" to be different ways of ordering a system > (i.e. distributions of relations.) The information needed to specify > different ways of ordering a (random) system will always be bigger than > the information needed to specify the system itself.
When you talk about "different ways of ordering" you are really talking about the equivalence class of strings under a grammar.
Take a very simple grammar where two strings are equivalent:
X == "not not "X
We can see from this that you can index this infinite equivalence class with an integer i that places 2i "not " strings in front of X. But not all members of this equivalence class are equally likely. Therefore arithmetic coding can be used for the index.
What you're claiming is that the expression of knowledge being compressed has an index pretty close to 0 -- in other words, there isn't much you can do to simplify the expression.
I think most of us who believe the Hutter Prize will result in breakthroughs (AI, epistemological, scientific, philosophical, social, political, economic, etc.) believe that there is a _lot_ more to be done to the knowledge expressed in Wikipedia for it to have an index anywhere near 0.
I have no objections to cross posting under the new moderation scheme. Hutter is in fact completely unmoderated. It may well be that natural language is less compressible than Matt and Co imagine. However it certainly is compressible. Obviously without a definte benchmark you cannot tell how compressible.
I would like to make a number of remarks.
1) Rrimavera, Ressorte, Mamanthal. We know that English can be translated into Spanish using a human translator. It would be hard for me to concede therefore that techniques do not exist to eliminate at least 2/3 of words using context. Primavera etc. implies log(2)3 bits per word (at least). 2) This is a much more mathematical point. We know that the present prize-winner has compressed to 17MB. If we allocate a number to a word there must be a technique (we will ignore Kolmagorov for the moment) which will compress our sequence to at most 17MB. We can at this stage view punctuation marks as words. Algorithms like 7-zip take raw text and compress using fragments as does the current prize-winner. Now if a sequence is random to the extent that the probability of a word occurring is simply the mean frequency of that word there cannot be any better method than factorials, the Gosper approximation giving the minimum compressed size. Now if V is our vector of words VP is going to represent a fragment representation. We can get to fragments by a simple matrix multiplication. VP = 17MB compressed. A matrix X can (and has been) found which will give us VPX. The probability decomposition of VPX MUST be 17MB. Of course multiplying one matrix by another simply produces another matrix. Why not take the bull by the horns, apply LSA and produce the 17MB minus matrix? The answer, of course, is Kolmogorov. If I were not concerned with Hutter, but were simply concerned with GALE and translating languages I would not care a damn about Kolmogorov but would simply demand a larger training set.
This raises yet another point. We can get an initial semantic assessment by simply associating fragments to words. This tends to be what is done in assessing relationships between different languages, cerradura does indeed have a very similar fragment structure to serrure (French) arana, arraigne - Papers please! If we were to take the prize-winning program and construct a matrix P linking it to a word list. This could be a springboard to further compression.
James Bowery wrote: > On 1/15/07, Rob Freeman <gro...@chaoticlanguage.com> wrote:
> > I define "human knowledge" to be different ways of ordering a system > > (i.e. distributions of relations.) The information needed to specify > > different ways of ordering a (random) system will always be bigger than > > the information needed to specify the system itself.
> When you talk about "different ways of ordering" you are really talking > about the equivalence class of strings under a grammar.
> ...
> X == "not not "X
If I understand you correctly James, I do not mean "the equivalence class of strings under a grammar" when I say "different ways of ordering." I don't think the assumption of a grammar is justified at all.
Why do you think the assumption of a grammar is justified?
I'm considering equivalence classes yes, but not classes under a grammar. You could say I let the classes govern the grammar. In some cases the two come to the same thing, but they are not the same.
I don't think there is any reason to assume that natural language limits itself to paradigmatic relationships of the kind X == "not not "X. I would go as far as to say it is demonstrably not the case.
A case in point. My example. For a language "AX...DX...DB...AZ...YZ...YC...A_" you have two paradigmatic classes (A,D) and (A,Y). You cannot reduce those to a grammar of the kind X == "not not "X. Is there any mapping (A,D) == "not not "(A,Y)?
For any real language there are many paradigmatic classes of this sort. Just look at how many you can get.
Calculate them out for yourself. Expand my example. If A_1, A_2,... A_n are the contexts of A in some text, and X_1, X_2,...X_n are contexts of other tokens, then the number of ways A can have common contexts with other tokens in the text, and thus uniquely specify some new paradigmatic class, are just Matt's "(n choose k) = n!/(k!(n-k)!) possible sets", where k is the number of common contexts between A and some other token.
The syntagmatic distribution of sequences AX_? specified by these classes in the text can be random, because many different paradigmatic distributions (A_i,...A_i+k) can be equally likely (and must be, because many of the "n choose k" possible classes will overlap, and thus form complementary distributions with each other??) But the relationship between any given syntax and its corresponding paradigmatic distribution is not random. And the different paradigmatic distributions (knowledge?) governing that random syntax are not random either, just very numerous. Much more numerous than the sequence of 2n or so tokens needed to specify them.
I don't know if any of that makes sense. Perhaps I have misunderstood your objection.
Rob Freeman wrote: > If I understand you correctly James, I do not mean "the equivalence > class of strings under a grammar" when I say "different ways of > ordering." I don't think the assumption of a grammar is justified at > all.
> Why do you think the assumption of a grammar is justified?
Because you can think of a compressed file as a grammar specification which has a single well formed string:
The original file.
This is most obvious in the case of simple hierarchical grammars like those generated by Nevill-Manning's Sequitur compressor.
jabow...@gmail.com wrote: > Rob Freeman wrote: > > If I understand you correctly James, I do not mean "the equivalence > > class of strings under a grammar" when I say "different ways of > > ordering." I don't think the assumption of a grammar is justified at > > all.
> > Why do you think the assumption of a grammar is justified?
> Because you can think of a compressed file as a grammar specification > which has a single well formed string
So you feel justified in assuming grammar to make compression possible, on the grounds that if compression is possible you will have a grammar?
Surely your argument is circular.
As a case in point, what are the equivalence classes which compress my example string: "AX...DX...DB...AZ...YZ...YC"?
In particular how do you capture the information that "in some ways A is close to D but not Y, but in other ways A is close to Y but not D"?
Rob Freeman wrote: > So you feel justified in assuming grammar to make compression possible, > on the grounds that if compression is possible you will have a grammar?
> Surely your argument is circular.
No I was merely using grammar as an example of a knowledge representation language and then you said it isn't powerful enough to compress natural language and then I said that there is an existence proof that it is: Sequitur with its hierarchical grammar output.
> As a case in point, what are the equivalence classes which compress my > example string: "AX...DX...DB...AZ...YZ...YC"?
> In particular how do you capture the information that "in some ways A > is close to D but not Y, but in other ways A is close to Y but not D"?
Sorry, but you're not making sense to me. Find some better way of defining "information" aka: algorithmic information theory, and we can continue this discussion here. Otherwise, keep it in comp.compression.