Hutter Prize awarded

Matt Mahoney

unread,

Oct 27, 2006, 9:24:41 PM10/27/06

to

The first Hutter prize was awarded to Alexander Ratushnyak and
Przemyslaw Skibinski for their work on the paq8hp5 text compressor.
The full announcement is here:
http://groups.google.com/group/Hutter-Prize/browse_frm/thread/3f3f80c76dd14513/#

paq8hp5 source code is now available, linked from the large text
compression benchmark page:
http://cs.fit.edu/~mmahoney/compression/text.html
(direct link: http://start.binet.com.ua/~artest/HKCC/ )
There is also a Linux port by Luchezar Georgiev.

Both paq8hp5 (top ranked on enwik8, 100 MB) and durilca4linux_2 by
Dmitry Shkarin (top ranked on enwik9, 1 GB, no prize money) incorporate
low level syntactic and semantic language modeling. Both compressors
preprocess text by replacing words with dictionary codes using a
dictionary of the most frequent words occurring in the benchmark. In
paq8hp5 related words were manually grouped. In durilca4linux_2 the
grouping was done automatically to cluster words that appear in similar
context. In both, these groups are also suffix sorted to improve
dictionary compression. Although language models of syntax and
semantics have been around for awhile, these techniques have never been
incorporated into data compressors until after the Hutter prize was
launched in August.

Other than the method of building the dictionary, the main difference
in the two programs is that paq8hp5 uses context mixing and
durilca4linux_2 uses PPM. PPM is more memory efficient, so this plays
a larger role on the larger (1 GB) benchmark. Still, it requires 3 GB
of memory. The Hutter prize (enwik8) limits programs to 1 GB memory
but my tests show improved compression using 2 GB. The programs also
differ in the dictionary size: 43K words for paq8hp5, 80K for
durilca4linux_2.

For more information on the Hutter prize, see http://prize.hutter1.net/

-- Matt Mahoney

Ian Parker

unread,

Oct 28, 2006, 6:57:28 AM10/28/06

to

I have downloaded the source code in the *.rar files. What program
decompresses it. WinZip does not seem to.

- Ian Parker

Claus Dragon

unread,

Oct 28, 2006, 10:42:30 AM10/28/06

to

Words to the wise, "Ian Parker" <ianpa...@gmail.com> wrote:

>I have downloaded the source code in the *.rar files. What program
>decompresses it. WinZip does not seem to.

WinRar or 7Zip, the former has a free trial version, the latter is
completely free.

--
Claus Dragon <clau...@mpsahotmail.com>
=(UDIC)=
d++ e++ T--
K1!2!3!456!7!S a27
"Coffee is a mocker. So, I am going to mock."

- Me, lately.

Ian Parker

unread,

Oct 29, 2006, 3:46:04 PM10/29/06

to

Thank you very much I have now downloaded the program.

- Ian Parker

Rob Freeman

unread,

Nov 13, 2006, 6:32:55 AM11/13/06

to

Matt,

A quick question.

I may have missed this at some point, but are there any scalability
figures on the entrants for the Hutter prize?

Do the compression ratios change as file sizes change, and if so how?

Thanks,

Rob

Matt Mahoney

unread,

Nov 13, 2006, 3:48:50 PM11/13/06

to

Rob Freeman wrote:
> Matt,
>
> A quick question.
>
> I may have missed this at some point, but are there any scalability
> figures on the entrants for the Hutter prize?
>
> Do the compression ratios change as file sizes change, and if so how?

I only evaluated for enwik8 and enwik9 (100 MB and 1 GB). The Hutter
prize entrants are tuned to enwik8 (smaller dictionary, less emphasis
on memory economy) but in general most compressors will do better on
larger natural language text files. I am not aware of any upper bound
where a larger input no longer helps. This is apparent in the BBB
results, in the table showing compression vs. block size. (BBB is the
only BWT compressor that can fit enwik9 in one block with 2 GB memory).
http://cs.fit.edu/~mmahoney/compression/text.html#1640

Rob Freeman

unread,

Nov 13, 2006, 11:27:30 PM11/13/06

to

Matt Mahoney wrote:
> Rob Freeman wrote:
> > Matt,
> >
> > A quick question.
> >
> > I may have missed this at some point, but are there any scalability
> > figures on the entrants for the Hutter prize?
> >
> > Do the compression ratios change as file sizes change, and if so how?
>
> I only evaluated for enwik8 and enwik9 (100 MB and 1 GB). The Hutter
> prize entrants are tuned to enwik8 (smaller dictionary, less emphasis
> on memory economy) but in general most compressors will do better on
> larger natural language text files. I am not aware of any upper bound
> where a larger input no longer helps. This is apparent in the BBB
> results, in the table showing compression vs. block size. (BBB is the
> only BWT compressor that can fit enwik9 in one block with 2 GB memory).

It makes sense that for a fixed file size, larger blocks will always be
better. But what about as the file itself gets bigger?

I guess it depends on how Zipf's Law changes as file sizes get bigger.
Does the tail get big faster than the head or vice versa?

Vocabularies are basically fixed (perhaps by definition), so for words
the tail will get fatter and fatter and eventually disappear, giving
you better compression for bigger texts. But as the text gets even
bigger the information in combinations of words will eventually
dominate. Are new sentences added faster than old ones are repeated?

Anybody know?

-Rob

Matt Mahoney

unread,

Nov 15, 2006, 9:27:14 PM11/15/06

to

Rob Freeman wrote:
> Matt Mahoney wrote:
> > Rob Freeman wrote:
> > > Matt,
> > >
> > > A quick question.
> > >
> > > I may have missed this at some point, but are there any scalability
> > > figures on the entrants for the Hutter prize?
> > >
> > > Do the compression ratios change as file sizes change, and if so how?
> >
> > I only evaluated for enwik8 and enwik9 (100 MB and 1 GB). The Hutter
> > prize entrants are tuned to enwik8 (smaller dictionary, less emphasis
> > on memory economy) but in general most compressors will do better on
> > larger natural language text files. I am not aware of any upper bound
> > where a larger input no longer helps. This is apparent in the BBB
> > results, in the table showing compression vs. block size. (BBB is the
> > only BWT compressor that can fit enwik9 in one block with 2 GB memory).
>
> It makes sense that for a fixed file size, larger blocks will always be
> better. But what about as the file itself gets bigger?

Same thing. Compressing in blocks is the same as splitting the file
into smaller parts, compressing separately and taking the average.

> I guess it depends on how Zipf's Law changes as file sizes get bigger.
> Does the tail get big faster than the head or vice versa?
>
> Vocabularies are basically fixed (perhaps by definition), so for words
> the tail will get fatter and fatter and eventually disappear, giving
> you better compression for bigger texts. But as the text gets even
> bigger the information in combinations of words will eventually
> dominate. Are new sentences added faster than old ones are repeated?
>
> Anybody know?
>
> -Rob

The vocabulary seems to grow without bound. This is shown by the
graphs at the bottom of
http://cs.fit.edu/~mmahoney/compression/textdata.html
The distribution using crude parsing seems to be steeper than a Zipf
distribution, but it still seems to hold that about half the vocabulary
is singletons regardless of how big the file is.

I suspect that the great majority of sentences don't repeat, and this
would still hold for much larger data sets.

-- Matt Mahoney

Rob Freeman

unread,

Nov 15, 2006, 11:52:54 PM11/15/06

to

"Matt Mahoney 写道："

>
> > It makes sense that for a fixed file size, larger blocks will always be
> > better. But what about as the file itself gets bigger?
>
> Same thing. Compressing in blocks is the same as splitting the file
> into smaller parts, compressing separately and taking the average.

Perhaps it depends on what you mean by "taking the average". If you
store a separate dictionary for each block then splitting into blocks
will be clearly worse.

> The vocabulary seems to grow without bound. This is shown by the
> graphs at the bottom of
> http://cs.fit.edu/~mmahoney/compression/textdata.html

You mean the one by Alexandru Mosoi?

> I suspect that the great majority of sentences don't repeat, and this
> would still hold for much larger data sets.

I agree. Whether this will make a difference will depend on whether the
ones which do repeat, repeat fast enough to make up for all the new
singletons.

Alexandru Mosoi's plot is interesting (if that is the one you mean.) On
what basis does he add a new entry to his dictionary? If new dictionary
entries are all repeated about the same number of times it would seem
to be telling us that repetitions keep number of new combinations in
check and compression will scale indefinitely.

I would like to see more figures on this. It might tell us something
about, for instance, what it means to be a "word" (when do you choose
to add something to your "dictionary".)

It also has significance for, for instance, language change. If new
sentences are added less frequently than old ones are repeated then the
language will not change. It will become more and more standard,
reified (like an inflected language?) On the other hand if new
sentences are added more frequently than old ones are repeated then the
language will gradually change into a completely different language
over time.

What gradually changing into another language means for compression is
another question!

-Rob

Matt Mahoney

unread,

Nov 16, 2006, 7:06:18 PM11/16/06

to

Rob Freeman wrote:
> "Matt Mahoney 写道："
> >
> > > It makes sense that for a fixed file size, larger blocks will always be
> > > better. But what about as the file itself gets bigger?
> >
> > Same thing. Compressing in blocks is the same as splitting the file
> > into smaller parts, compressing separately and taking the average.
>
> Perhaps it depends on what you mean by "taking the average". If you
> store a separate dictionary for each block then splitting into blocks
> will be clearly worse.

Block sorting (BWT) compressors like BBB essentially compress each
block independently. Block size is limited by memory. The other types
of compressors like LZ77, PPM or context mixing have some
representation of past input which is used to predict or compress new
input. When all of the available memory is used up, old input is
usually discarded and replaced with newer input. In all cases,
compression is limited by available memory. Compression improves as
the input gets larger until memory is full, then levels off.

>
> > The vocabulary seems to grow without bound. This is shown by the
> > graphs at the bottom of
> > http://cs.fit.edu/~mmahoney/compression/textdata.html
>
> You mean the one by Alexandru Mosoi?

Yes.

>
> > I suspect that the great majority of sentences don't repeat, and this
> > would still hold for much larger data sets.
>
> I agree. Whether this will make a difference will depend on whether the
> ones which do repeat, repeat fast enough to make up for all the new
> singletons.
>
> Alexandru Mosoi's plot is interesting (if that is the one you mean.) On
> what basis does he add a new entry to his dictionary? If new dictionary
> entries are all repeated about the same number of times it would seem
> to be telling us that repetitions keep number of new combinations in
> check and compression will scale indefinitely.
>
> I would like to see more figures on this. It might tell us something
> about, for instance, what it means to be a "word" (when do you choose
> to add something to your "dictionary".)

The article explains that the graph is for filtered text (27 character
alphabet). But it also holds for the raw text.

> It also has significance for, for instance, language change. If new
> sentences are added less frequently than old ones are repeated then the
> language will not change. It will become more and more standard,
> reified (like an inflected language?) On the other hand if new
> sentences are added more frequently than old ones are repeated then the
> language will gradually change into a completely different language
> over time.
>
> What gradually changing into another language means for compression is
> another question!
>
> -Rob

It means that text is a nonstationary source. In just about any
collection, it never settles to uniform statistics, no matter how much
you have. For example, a library, usenet, or even the web is often
organized by genre (fiction, nonfiction, etc), language or social
groups. Presumably this would hold even if you took every word ever
spoken or written by every person on Earth during some period of time.
To extend this further, language evolves over time.

A lot of compression and information theory assumes stationary sources
for simplicity. But nonstationary sources have to be modeled
differently, favoring newer data over old rather than treating it all
equally. However, this is how most practical compressors actually
work.

-- Matt Mahoney

Ted Dunning

unread,

Nov 17, 2006, 2:14:39 PM11/17/06

to

It is also possible (thought often quite difficult) to model patently
non-stationary sources as hidden stationary meta-processes that provide
parameters to the observable process. In some cases, you can infer
point-estimates or distributions for the hidden process.

The virtues here include:

a) you get back to a stationary model that provides better matching to
your observables

b) you may be able to massively extend your history for some phenomena
(like the closed class words or basic syntax) while still being
flexible enough to incorporate new phenomena (like neologisms)

c) it is hard enough to do this that it could provide a veritable
fountain of dissertation topics for those still in the academic world.

One example of this general approach is the way that LDA or MCDA model
the distribution of topic probability vectors on a per document basis.
These document vectors then provide the definition of an apparently
non-stationary word generator. This allows sharper and more
differentiated topic specific models that still deal well with the fact
that documents talk about more than one thing.

Rob Freeman

unread,

Nov 22, 2006, 12:28:47 AM11/22/06

to

"Matt Mahoney 写道："
> Rob Freeman wrote:
> > ...

> > What gradually changing into another language means for compression is
> > another question!
>

> It means that text is a nonstationary source.

Yes, this is probably what the incompressibility I talked about in our
earlier thread on comp.ai.nat-lang, and elsewhere, comes down to in the
limit. Language changes.

Favouring newer data over old is just another way of selecting a
sub-set, which is all that can be compressed.

Perhaps you could gloss my assertion of incompressibility (in the
limit) as predicting language change.

The difference being we have not had a model which predicts language
change up to now.

That is in the long term. Over short periods of time what this
"incompressibility" I talk about probably means is just that syntax
will be ambiguous, or partly random.

We have not had a model which predicts ambiguity up to now either.

Just to put all this in context I am of course advocating that we model
natural language syntax in a certain way, for compression or any other
task. Generally put what I suggest is that syntax should not be
modelled by rules, but by ad-hoc generalizations over examples.

In this context I was attracted first to Marcus Hutter's proof that
compression is enough to model AI. That liberates us from rules. Then
you referred to Kolmogorov's(?) general result that incompressible
strings exist. It is a small step from there to see the power we get if
we assume not only that incompressible strings exist, but that natural
language is one such string (the power to explain language change, for
instance.)

Now, it is probably not possible to prove natural language is
incompressible (in the limit), by the general result that
incompressibility is impossible to prove(?) but incompressibility is
consistent with the observations that language changes, and has
ambiguous (partly random) statistics. A model where natural language is
incompressible (in the limit) is the only one which predicts these
things.

At least it is the only one that I know of. Anyone know of another?

-Rob

Rob Freeman

unread,

Nov 22, 2006, 1:00:07 AM11/22/06

to

This is interesting Ted. It would be great if we could go back from the
data and find a model which produced the observable language change.

How good are these estimation techniques? I assume the problem is just
the stated academic "virtue" that such models should be ferociously
difficult to find!

What kinds of assumptions do you need to make? Do you need to assume
the kinds of "meta-processes" you expect to find, or are there
algorithms to estimate those from first principles too?

In my terms I would assume those "meta-processes" are certain ways of
generalizing text, much those which have been used to look for
"grammar" since before Chomsky (distributional analysis.) It should
then be relatively trivial to get an observable moving distribution of
words, which both predicts new text and is influenced by that new text,
a kind of self-referencing automaton.

-Rob

"Ted Dunning 写道："

Matt Mahoney

unread,

Nov 23, 2006, 7:16:37 PM11/23/06

to

Rob Freeman wrote:
> "Matt Mahoney 写道："
> > Rob Freeman wrote:
> > > ...
> > > What gradually changing into another language means for compression is
> > > another question!
> >
> > It means that text is a nonstationary source.
>
> Yes, this is probably what the incompressibility I talked about in our
> earlier thread on comp.ai.nat-lang, and elsewhere, comes down to in the
> limit. Language changes.
>
> Favouring newer data over old is just another way of selecting a
> sub-set, which is all that can be compressed.
>
> Perhaps you could gloss my assertion of incompressibility (in the
> limit) as predicting language change.
>
> The difference being we have not had a model which predicts language
> change up to now.
>
> That is in the long term. Over short periods of time what this
> "incompressibility" I talk about probably means is just that syntax
> will be ambiguous, or partly random.

What do you mean by "incompressible"? Text can obviously be
compressed. Do you mean that as the text sample gets larger, the
compression ratio levels off at some point? There is no evidence that
such a point exists. Every text corpus, no matter how large, seems to
have redundancy that spans the length of the corpus.

> We have not had a model which predicts ambiguity up to now either.

Natural language differs from formal language in that it is designed to
be processed in parallel. Formal languages are designed for sequential
processing: tokenize, parse, semantics. This model fails for natural
language. Syntax is ambiguous and depends on semantics. For example:

I ate pizza with a fork.
I ate pizza with pepperoni.
I ate pizza with Sam.

Ambiguity is resolved by combining multiple, weak constraints. This
has an efficient parallel implementation (e.g. neural networks).

> Just to put all this in context I am of course advocating that we model
> natural language syntax in a certain way, for compression or any other
> task. Generally put what I suggest is that syntax should not be
> modelled by rules, but by ad-hoc generalizations over examples.

There is no "simple" set of rules for natural grammar, as there is for
formal language. There are perhaps millions of rules. We don't know
how many. I estimate the total complexity of a language model
(vocabulary, semantics, syntax, and worldly knowledge) of the average
adult is about 10^9 bits based on 4 lines of evidence:

1. Turing's 1950 estimate, which he did not explain.
2. Landauer's estimate of human long term memory capacity based on
recall tests.
3. The amount of language processed over a lifetime (about 1 GB) times
Shannon's 1950 estimate of 1 bit per character.
4. Extrapolating this graph: http://cs.fit.edu/~mmahoney/dissertation/

> In this context I was attracted first to Marcus Hutter's proof that
> compression is enough to model AI. That liberates us from rules. Then
> you referred to Kolmogorov's(?) general result that incompressible
> strings exist. It is a small step from there to see the power we get if
> we assume not only that incompressible strings exist, but that natural
> language is one such string (the power to explain language change, for
> instance.)
>
> Now, it is probably not possible to prove natural language is
> incompressible (in the limit), by the general result that
> incompressibility is impossible to prove(?) but incompressibility is
> consistent with the observations that language changes, and has
> ambiguous (partly random) statistics. A model where natural language is
> incompressible (in the limit) is the only one which predicts these
> things.
>
> At least it is the only one that I know of. Anyone know of another?
>
> -Rob

We know from the counting argument that the vast majority of strings
are not compressible. Furthermore, the vast majority of those strings
cannot be proven to be incompressible either. Kolmogorov's proof that
algorithmic complexity is not computable only says that at least one
such string must exist, just as Godel proved that in any system of
axioms rich enough to express arithmetic, there must be at least one
true statement which cannot be proven. Godel's proof is complex, but
is more easily understood as a proof of the uncomputability of the
halting problem.

Chaitin showed that such unprovable statements are ubiquitous, that
there are an infinite number of them.
http://www.cs.auckland.ac.nz/CDMTCS/chaitin/georgia.html

However there are also an infinite number of provable statements, so
this does not show that unprovable statements are in the majority.
Calude and Jürgensen did so. They proved that the vast majority (in a
certain sense) of true statements which have greater algorithmic
complexity than the system of axioms are not provable.
http://www.cs.auckland.ac.nz/~cristian/aam.pdf

Of course this applies to true statements of the form "x is not
compressible". Note that the statement "x is not compressible" has
algorithmic complexity |x| for the vast majority of x. If x is a
random string, then the shortest program that could output "x is not
compressible" would have about the same size as x. So you could not
prove it unless your system of axioms required a program as large as x
to print them. For example, you could have "x is not compressible" as
an axiom, but that is about the best you could do.

[1] Turing, A. M., (1950) "Computing Machinery and Intelligence,
Mind, 59:433-460.
[2] Landauer, Tom (1986), "How much do people remember? Some
estimates of the quantity of learned information in long term
memory", Cognitive Science (10) pp. 477-493.
[3] Shannon, Claude E. (1950), "Prediction and Entropy of Printed
English", Bell Sys. Tech. J (3) p. 50-64.

-- Matt Mahoney

Rob Freeman

unread,

Nov 24, 2006, 3:45:46 AM11/24/06

to

You've put lots of facts (and not a few speculations) here Matt. I see
no reason to fault them as such.

I think the most salient comment is "We don't know..."

As a candidate for eventual knowledge, I'm just suggesting natural
language appears to be something more like the "hidden stationary
meta-processes that provide parameters to the observable process" Ted
Dunning mentions, and not the ever changing (ambiguous, indeterminate,
innumerable and ultimately still unknown) rules we have always assumed
it to be (at least all models of natural language I know of assume it
to be.)

My assertion of "incompressiblity" may come down to the fact that the
observable distribution is moving.

Ambiguity too suggests something about a hypothesized underlying static
model. Essentially that its statistics are random. This is also a
property of incompressible strings.

YMMV, but I think this hypothesis that natural language consists of

"hidden stationary meta-processes that provide parameters to the

observable process" is worth a look.

To put this in yet broader context, let me say that I very much like
the work on AI being done by Juergen Schmidhuber at IDSIA. I understand
Marcus Hutter was Schmidhuber's student and did his work relating AI
and compression under him.

I see Schmidhuber's focus on multiple theories at the heart of their
prediction algorithms as essentially the same as the
"incompressibility" I am talking about.

I can't understand why I am trying to make a point here which I see
presented on the IDSIA site:

http://www.idsia.ch/~juergen/newai/newai.html

Am I misunderstanding the IDSIA focus on computability over logic
(rules)? That it is computability limits motivated by the likes of
Kolmogorov which should govern our thinking, and not logic and rules.

What do you see as the significance of the complexity issues, and focus
on computability over observable distributions (logic/rules), which
seem to be central to the IDSIA approach? E.g. their statement that
"...there is no universal approximable distribution... " (
http://www.idsia.ch/~juergen/kolmogorov.html .) Don't you too think it
is likely there is an underlying process computing the (constantly
changing) observable distributions of language (actually computing the
"logic" and "rules" of the observable distributions), and that it will
only be possible to ultimately understand language in terms of this
underlying process, not the observable distributions themselves? Seems
very like the IDSIA stuff to me.

-Rob

Matt Mahoney

unread,

Nov 24, 2006, 5:52:05 PM11/24/06

to

Rob Freeman wrote:
> You've put lots of facts (and not a few speculations) here Matt. I see
> no reason to fault them as such.
>
> I think the most salient comment is "We don't know..."
>
> As a candidate for eventual knowledge, I'm just suggesting natural
> language appears to be something more like the "hidden stationary
> meta-processes that provide parameters to the observable process" Ted
> Dunning mentions, and not the ever changing (ambiguous, indeterminate,
> innumerable and ultimately still unknown) rules we have always assumed
> it to be (at least all models of natural language I know of assume it
> to be.)
>
> My assertion of "incompressiblity" may come down to the fact that the
> observable distribution is moving.

Nonstationary processes are not necessarily incompressible. For
example, the infinite sequence 1234567891011121314... is nonstationary,
but there is a simple Turing machine that outputs it.

> Ambiguity too suggests something about a hypothesized underlying static
> model. Essentially that its statistics are random. This is also a
> property of incompressible strings.

I think that the source of ambiguity in natural language is that it is
constrained to be modeled on a very large but slow neural network.
Forward error correction protocols that combine large numbers of weak
constraints from a huge database can be efficiently computed in
parallel. This allows parts of the message to be dropped to speed up
communication without loss of meaning.

>
> YMMV, but I think this hypothesis that natural language consists of
> "hidden stationary meta-processes that provide parameters to the
> observable process" is worth a look.
>
> To put this in yet broader context, let me say that I very much like
> the work on AI being done by Juergen Schmidhuber at IDSIA. I understand
> Marcus Hutter was Schmidhuber's student and did his work relating AI
> and compression under him.
>
> I see Schmidhuber's focus on multiple theories at the heart of their
> prediction algorithms as essentially the same as the
> "incompressibility" I am talking about.
>
> I can't understand why I am trying to make a point here which I see
> presented on the IDSIA site:
>
> http://www.idsia.ch/~juergen/newai/newai.html

Schmidhuber proposes AIS, a variant of Hutter's AIXI, that the fastest
(rather than the simplest) explanation of the environment is to be
preferred. His rationale is that it makes the general AI problem
tractable. It also explains some fundamental observations about
physics, such as intractability of quantum mechanical computation and
its apparent randomness. However there are other explanations that
depend only on the more general assumption that the universe is
computable. Legg showed in
http://www.vetta.org/documents/IDSIA-12-06-1.pdf that a Turing machine
cannot predict the output of another machine with greater algorithmic
complexity. Since the universe must have greater complexity than any
computer in it (including brains), the universe must therefore appear
probabilistic to us.

> Am I misunderstanding the IDSIA focus on computability over logic
> (rules)? That it is computability limits motivated by the likes of
> Kolmogorov which should govern our thinking, and not logic and rules.
>
> What do you see as the significance of the complexity issues, and focus
> on computability over observable distributions (logic/rules), which
> seem to be central to the IDSIA approach? E.g. their statement that
> "...there is no universal approximable distribution... " (
> http://www.idsia.ch/~juergen/kolmogorov.html .) Don't you too think it
> is likely there is an underlying process computing the (constantly
> changing) observable distributions of language (actually computing the
> "logic" and "rules" of the observable distributions), and that it will
> only be possible to ultimately understand language in terms of this
> underlying process, not the observable distributions themselves? Seems
> very like the IDSIA stuff to me.
>
> -Rob

I believe that language is learnable by example. People do it. The
learned knowledge is too complex to describe or explicitly model, even
though we know it. This is a simple consequence that the brain is a
computer. A finite state machine cannot simulate itself. Nor can a
Turing machine simulate another machine with greater algorithmic
complexity.

However, this restriction does not apply to the learning algorithm
itself. The algorithm, which you are born with, should be easy enough
to describe and model in a computer. Its complexity is bounded by the
information capacity of the human genome, which must be less than 6 x
10^9 bits. I think only a small portion (< 1% ?) controls the
architecture of the brain.

It is somewhat unsatisfying to know that we can only build AI by giving
up on understanding what we have built.

niels.f...@seies.de

unread,

Nov 24, 2006, 6:02:57 PM11/24/06

to

> It is somewhat unsatisfying to know that we can only build AI by giving
> up on understanding what we have built.

So having children is rather unsatisfying? :--)

Ciao
Niels

Matt Mahoney

unread,

Nov 24, 2006, 9:20:39 PM11/24/06

to

That is a perfect analogy. An AI might not grow up the way you had
hoped, either. At the very least, it will do things you can't explain,
predict, or control. You can't know everything it knows.

-- Matt Mahoney

Rob Freeman

unread,

Nov 25, 2006, 4:12:36 AM11/25/06

to

"Matt Mahoney 写道："
> Rob Freeman wrote:
> >
> > My assertion of "incompressiblity" may come down to the fact that the
> > observable distribution is moving.
>
> Nonstationary processes are not necessarily incompressible. For
> example, the infinite sequence 1234567891011121314... is nonstationary,
> but there is a simple Turing machine that outputs it.

Perhaps not. You could imagine a generative grammar constantly
producing new sentences which could be compressed to the same rules,
too. But in fact not only do the strings of language change, the rules
also change.

On the subject of nonstationary distributions and incompressible
strings, what is the relationship between non-terminating sequences,
non-halting problems etc, and incompressible strings?

> I think that the source of ambiguity in natural language is that it is
> constrained to be modeled on a very large but slow neural network.
> Forward error correction protocols that combine large numbers of weak
> constraints from a huge database can be efficiently computed in
> parallel. This allows parts of the message to be dropped to speed up
> communication without loss of meaning.

This is robustness. You get this if your message is spread across
tokens. If different messages are spread across the same tokens you
will also get ambiguity. Then the system will also be incompressible
(because you can't resolve both messages at once.)

I don't see how robustness in itself gives ambiguity, though. You could
have a message spread across many tokens, but which only used each
token in one way. Then the message would be robust, deleting tokens
would not destroy it, but the tokens would not be ambiguous.

How does a system which "allows parts of the message to be dropped ...
without loss of meaning" give us ambiguity?

> Schmidhuber proposes AIS, a variant of Hutter's AIXI, that the fastest
> (rather than the simplest) explanation of the environment is to be
> preferred. His rationale is that it makes the general AI problem
> tractable. It also explains some fundamental observations about
> physics, such as intractability of quantum mechanical computation and
> its apparent randomness. However there are other explanations that
> depend only on the more general assumption that the universe is
> computable.

It is the more general assumption that the universe is computable which
I like. This is to be contrasted with the more traditional assumption
that the universe obeys rules (i.e. the laws of Physics.)

I am proposing we make a similar assumption about language. Assume only
that language is computable. This is to be contrasted with the more
traditional assumption that language obeys rules (i.e. grammar.)

Assuming only that a system is computable gives you more power. More
power than any logical characterization of that system, for instance.
It was this realization which caused the crisis in maths (which had
assumed logic was supreme, Russell etc.) This is the significance of
the results of Goedel, Kolmogorov, Chaitin, etc.

> Legg showed in
> http://www.vetta.org/documents/IDSIA-12-06-1.pdf that a Turing machine
> cannot predict the output of another machine with greater algorithmic
> complexity. Since the universe must have greater complexity than any
> computer in it (including brains), the universe must therefore appear
> probabilistic to us.

Interesting idea. I guess there will remain some things which are
forever unknowable.

> I believe that language is learnable by example. People do it. The
> learned knowledge is too complex to describe or explicitly model, even
> though we know it. This is a simple consequence that the brain is a
> computer. A finite state machine cannot simulate itself. Nor can a
> Turing machine simulate another machine with greater algorithmic
> complexity.

You seem to be suggesting that the brain can't understand itself. This
is a slightly different problem from that of understanding the
universe. The brain is not bigger than itself.

Anyway, these theoretical results, while interesting, don't tell us
anything in particular about language. It is only a theory that grammar
is too complex for us to understand. It might easily be that we just
haven't figured out the simplicity of it yet.

I think it is worth changing our assumptions about language. Up to now
we have assumed language to be logical. Why don't we try assuming only
that it is computable? A bit like what IDSIA has done for AI (and the
universe.) If it turns out that the less stringent assumption of a
computable process gives us a better model for language than the
logical models we have assumed up to now, we won't need to worry about
any possibility language is forever beyond understanding. We will
understand it.

As a particular candidate for computable processes underlying language
I would suggest certain ways of generalizing text, much those which

Matt Mahoney

unread,

Nov 25, 2006, 8:26:08 PM11/25/06

to

Rob Freeman wrote:
> "Matt Mahoney 写道："
> > Rob Freeman wrote:
> > >
> > > My assertion of "incompressiblity" may come down to the fact that the
> > > observable distribution is moving.
> >
> > Nonstationary processes are not necessarily incompressible. For
> > example, the infinite sequence 1234567891011121314... is nonstationary,
> > but there is a simple Turing machine that outputs it.
>
> Perhaps not. You could imagine a generative grammar constantly
> producing new sentences which could be compressed to the same rules,
> too. But in fact not only do the strings of language change, the rules
> also change.
>
> On the subject of nonstationary distributions and incompressible
> strings, what is the relationship between non-terminating sequences,
> non-halting problems etc, and incompressible strings?

Nonstationary sources must have infinite memory. For example, the
simple counting sequence I gave requires a counter that can store
arbitrarily large integers. Otherwise the sequence must eventually
repeat, which would make the source stationary.

There is no infinite, incompressible sequence that is computable. All
computable strings, whether finite or infinite, have finite algorithmic
complexity because they are generated by a program of finite length.

The vast majority of finite strings are incompressible, but all are
computable (just not computable by any program shorter than the
string).

> > I think that the source of ambiguity in natural language is that it is
> > constrained to be modeled on a very large but slow neural network.
> > Forward error correction protocols that combine large numbers of weak
> > constraints from a huge database can be efficiently computed in
> > parallel. This allows parts of the message to be dropped to speed up
> > communication without loss of meaning.
>
> This is robustness. You get this if your message is spread across
> tokens. If different messages are spread across the same tokens you
> will also get ambiguity. Then the system will also be incompressible
> (because you can't resolve both messages at once.)

What do you mean by incompressible? I can compress any English
sentence with gzip.

> I don't see how robustness in itself gives ambiguity, though. You could
> have a message spread across many tokens, but which only used each
> token in one way. Then the message would be robust, deleting tokens
> would not destroy it, but the tokens would not be ambiguous.
>
> How does a system which "allows parts of the message to be dropped ...
> without loss of meaning" give us ambiguity?

Natural language is only ambiguous when you don't process all of the
soft constraints. So far this is something only humans can do.

Lenat (Cyc) gave the following example: "The police arrested the
demonstrators because they { feared | advocated } violence." Who does
"they" refer to? To a dumb parser it is ambiguous, but not to you.
Humans use pronouns to compress messages because they can.

> > Schmidhuber proposes AIS, a variant of Hutter's AIXI, that the fastest
> > (rather than the simplest) explanation of the environment is to be
> > preferred. His rationale is that it makes the general AI problem
> > tractable. It also explains some fundamental observations about
> > physics, such as intractability of quantum mechanical computation and
> > its apparent randomness. However there are other explanations that
> > depend only on the more general assumption that the universe is
> > computable.
>
> It is the more general assumption that the universe is computable which
> I like. This is to be contrasted with the more traditional assumption
> that the universe obeys rules (i.e. the laws of Physics.)
>
> I am proposing we make a similar assumption about language. Assume only
> that language is computable. This is to be contrasted with the more
> traditional assumption that language obeys rules (i.e. grammar.)
>
> Assuming only that a system is computable gives you more power. More
> power than any logical characterization of that system, for instance.
> It was this realization which caused the crisis in maths (which had
> assumed logic was supreme, Russell etc.) This is the significance of
> the results of Goedel, Kolmogorov, Chaitin, etc.

I always argued that language is computable. A language model is a
probability distribution P over strings that humans say, hear, read or
write. I argue that the algorithmic complexity of computing an
approximation of P close enough to pass the Turing test is about 10^9
bits.

> > Legg showed in
> > http://www.vetta.org/documents/IDSIA-12-06-1.pdf that a Turing machine
> > cannot predict the output of another machine with greater algorithmic
> > complexity. Since the universe must have greater complexity than any
> > computer in it (including brains), the universe must therefore appear
> > probabilistic to us.
>
> Interesting idea. I guess there will remain some things which are
> forever unknowable.
>
> > I believe that language is learnable by example. People do it. The
> > learned knowledge is too complex to describe or explicitly model, even
> > though we know it. This is a simple consequence that the brain is a
> > computer. A finite state machine cannot simulate itself. Nor can a
> > Turing machine simulate another machine with greater algorithmic
> > complexity.
>
> You seem to be suggesting that the brain can't understand itself. This
> is a slightly different problem from that of understanding the
> universe. The brain is not bigger than itself.
>
> Anyway, these theoretical results, while interesting, don't tell us
> anything in particular about language. It is only a theory that grammar
> is too complex for us to understand. It might easily be that we just
> haven't figured out the simplicity of it yet.

It is theoretically possible that a simple description of a complex
object may exist but be hard to find. For example, an encrypted string
of zero bits appears random if you don't know the key.

However, I doubt this is the case with language. The brain can store a
lot of information, which changes slowly over time. It seems unlikely
that this information could be compressed to a very small size.

> I think it is worth changing our assumptions about language. Up to now
> we have assumed language to be logical. Why don't we try assuming only
> that it is computable? A bit like what IDSIA has done for AI (and the
> universe.) If it turns out that the less stringent assumption of a
> computable process gives us a better model for language than the
> logical models we have assumed up to now, we won't need to worry about
> any possibility language is forever beyond understanding. We will
> understand it.
>
> As a particular candidate for computable processes underlying language
> I would suggest certain ways of generalizing text, much those which
> have been used to look for "grammar" since before Chomsky
> (distributional analysis.) It should then be relatively trivial to get
> an observable moving distribution of words, which both predicts new
> text and is influenced by that new text, a kind of self-referencing
> automaton.
>
> -Rob

I do believe that simple language learning algorithms exist (as opposed
to the learned knowledge being simple).

-- Matt Mahoney

Rob Freeman

unread,

Nov 26, 2006, 3:43:29 AM11/26/06

to

"Matt Mahoney 写道："
> Rob Freeman wrote:

(big snip)

I could argue these points individually Matt, but as you have agreed
with me below, I'll focus on that...

> > Anyway, these theoretical results, while interesting, don't tell us
> > anything in particular about language. It is only a theory that grammar
> > is too complex for us to understand. It might easily be that we just
> > haven't figured out the simplicity of it yet.
>
> It is theoretically possible that a simple description of a complex
> object may exist but be hard to find. For example, an encrypted string
> of zero bits appears random if you don't know the key.
>
> However, I doubt this is the case with language. The brain can store a
> lot of information, which changes slowly over time. It seems unlikely
> that this information could be compressed to a very small size.

Thanks. That is exactly my point. Language cannot be compressed.

Note what has happened. You have unwittingly slipped from my hypothesis
about grammar (that the enormous apparent complexity of grammar can be
compressed to a simple computational solution), and treated it as a
hypothesis about language.

It is preconceptions of this kind (that language and grammar are the
same) which have trapped us for so long. The insight of the IDSIA
solution (which assumes _only_ computability) should free us from this,
but it is clearly not so. We have become so used to the assumption that
language and grammar are the same that we don't even notice it any
more.

It was the complexity of grammar I was suggesting might have a simple
solution, not language, which it is indeed my thesis cannot be
compressed.

To return to the solution, with its parallel to the IDSIA focus _only
on computability_. You are making an assumption that language is
governed by grammar. Try assuming _only_ that language is computable.
You will find you have much more power.

In particular you will find you have the power to free yourself from
the apparent complexity of grammar, and see a computational solution in
terms of language, especially the power of an uncompressible language
(constantly generalized in certain ways--much those which have been
used to look for grammar since before Chomsky, e.g. distributional
analysis--as a kind of self-referencing automaton.)

-Rob

Ian Parker

unread,

Nov 26, 2006, 7:06:08 AM11/26/06

to

Rob Freeman wrote:

> Thanks. That is exactly my point. Language cannot be compressed.

If that is true it cannot be translated from one NL into another.
Compression exists iff (2 fs not finger trouble) you can predict the
next word thereby obtaining a string which is coded with lower numbers
in order of probability.

spring = primavera | ressorte | mamanthal

If one of these (in "La estacion de ressorte" - primavera) is not of
higher probability in the generated list translation of a higher
standard than Google is impossible. Clearly an absurdity.

-Ian Parker

Matt Mahoney

unread,

Nov 26, 2006, 7:13:37 PM11/26/06

to

I understand that grammar is just one of many constraints on language.
I understand that it is therefore possible to have a complex language
with a simple grammar. There are artificial languages with simple
grammars, such as Esperanto, Attempto, Lojban, as well as most
programming and formal languages.

But I don't believe that there are any natural languges with simple
grammars. Of course there are a few dominant rules, e.g. S -> Subj
Verb Obj, NP -> Det N, etc. but there are also thousands (or millions)
of smaller rules and idioms. We don't know how many because nobody has
successfully built a natural language grammar.

So if I assume only that language is computable, how does it answer
this question?

-- Matt Mahoney

Rob Freeman

unread,

Nov 27, 2006, 2:52:39 AM11/27/06

to

"Matt Mahoney 写道："

>
> I understand that grammar is just one of many constraints on language.
> I understand that it is therefore possible to have a complex language
> with a simple grammar. There are artificial languages with simple
> grammars, such as Esperanto, Attempto, Lojban, as well as most
> programming and formal languages.
>
> But I don't believe that there are any natural languges with simple
> grammars. Of course there are a few dominant rules, e.g. S -> Subj
> Verb Obj, NP -> Det N, etc. but there are also thousands (or millions)
> of smaller rules and idioms. We don't know how many because nobody has
> successfully built a natural language grammar.

I'm not sure how you are seeing this Matt. I'm not saying grammar is
simple, I'm saying it is complex. I'm just saying that the complexity
of grammar has a simple explanation in terms of an underlying
incompressible model (a self-referencing automaton finding patterns on
a collection of incompressible facts, the new definition of language.)

There is a reversal of perspective which says now we are trying to
compress grammar in terms of language, not language in terms of
grammar. This is how you slipped into equating a compression of grammar
with a compression of language in your last message.

To see this model we need to separate grammar and language. Language is
now seen to have a separate computational explanation. Grammar is just
another product of that. Grammar is not the basis of language in this
model (rather language, the computational model, is the basis of
grammar. Language compresses grammar, and note: if language is
incompressible, grammar can be larger than the language itself, bigger
than our heads indeed.)

It is also a feature of this solution that it suggests "nobody has ever
successfully built a natural language grammar", because if the
underlying computational model is incompressible no such (complete)
grammar is possible. (By the mathematical result that an incompressible
underlyng computational model will result in statistics, like grammar,
which are random or incomplete, exactly because the computational model
is incompressible.)

> So if I assume only that language is computable, how does it answer
> this question?

How does it work for IDSIA?

I'm confused, because IDSIA seem to see the importance of assuming only
computability for systems like physics and AI, but I'm having
difficultly pitching exactly the same model to you for language.

I assume IDSIA assume only computability because doing so lets you get
beneath rules (laws of physics, the traditional assumptions of
statistical models.) Rules (statistics etc.) become explicable in terms
of the underlying computational model. In particular if underlying
computational model is incompressible you expect the rules (statistics
on the model) to appear (partly?) random, which randomness is now
explained.

This is the randomness Chaitin finds in maths. It exists because the
world maths attempts to order is incompressible (the same thing as
Goedel saying maths is incomplete. If it were not incompressible then
we would not have Russell's paradox and logicism would not have failed.
Though by the same token Turning would not have succeeded...)

This is the randomness physicists find in physics. There is a slew of
(at least two, anyway...) new models for physics based on
computability. I can only assume this is because such models allow us
to explain the randomness of quantum mechanics, the uncertainty
principle, etc.

This appears to be the randomness we see in natural language grammar,
and which prevents us from finding a complete grammar for any natural
language.

"How does it answer this question?" This theory explains the complexity
and elusiveness of grammar simply. The cost is that language itself
must be seen not as an expression of grammar, but a collection of
incompressible facts.

That is not really a cost. But we do need to accept language, as a
collection of facts, is incompressible. (It is powerful, and moving(?),
to the extent it is incompressible, anyway.)

I don't know if this helps. You need to give me more information about
how you understand the IDSIA model of "New AI" if you want me to be
more precise. I might be able to explain it by relating it to them.
They seem to have the same insights. It is just that we need to extend
them to language.

-Rob

Matt Mahoney

unread,

Nov 27, 2006, 9:21:52 PM11/27/06

to

Rob Freeman wrote:
> "Matt Mahoney 写道："
> >
> > I understand that grammar is just one of many constraints on language.
> > I understand that it is therefore possible to have a complex language
> > with a simple grammar. There are artificial languages with simple
> > grammars, such as Esperanto, Attempto, Lojban, as well as most
> > programming and formal languages.
> >
> > But I don't believe that there are any natural languges with simple
> > grammars. Of course there are a few dominant rules, e.g. S -> Subj
> > Verb Obj, NP -> Det N, etc. but there are also thousands (or millions)
> > of smaller rules and idioms. We don't know how many because nobody has
> > successfully built a natural language grammar.
>
> I'm not sure how you are seeing this Matt. I'm not saying grammar is
> simple, I'm saying it is complex. I'm just saying that the complexity
> of grammar has a simple explanation in terms of an underlying
> incompressible model (a self-referencing automaton finding patterns on
> a collection of incompressible facts, the new definition of language.)

I guess we are in agreement then. Language (including its grammar) is
complex, but there is a simple algorithm for learning it.

I am not familiar with all of the work of IDSIA. I am familiar with
Hutter's AIXI and some papers by Schmidhuber, Chatin, Solomonoff and
Legg. Their work on computational limits of learning are very general.
The understanding I get is that it is not possible to explicitly
specify a language model. It must be learned. It is not possible to
know what is in the training data. It will not be possible to predict
what a language model will do, or understand the decisions it makes.
The brain can build but not model something as complex as itself.

I don't see any theory that separated grammar from the other types of
knowledge such as a lexicon, semantics, or pragmatics that make up a
model. I think to understand these things, we need to know how they
work in the brain. We need to experimentally determine the cognitive
limits on human language learning and usage, and build language models
with neural networks. The computational theory can only tell us about
the size of the models, not the details.

-- Matt Mahoney

jasen

unread,

Nov 28, 2006, 12:56:10 AM11/28/06

to

On 2006-11-26, Matt Mahoney <matma...@yahoo.com> wrote:

>> On the subject of nonstationary distributions and incompressible
>> strings, what is the relationship between non-terminating sequences,
>> non-halting problems etc, and incompressible strings?
>
> Nonstationary sources must have infinite memory. For example, the
> simple counting sequence I gave requires a counter that can store
> arbitrarily large integers.

like a turing machine?

--

Bye.
Jasen

snork...@gmail.com

unread,

Nov 28, 2006, 10:30:11 AM11/28/06

to

Matt Mahoney wrote:
> I guess we are in agreement then. Language (including its grammar) is
> complex, but there is a simple algorithm for learning it.

!

I don't know much about this (no Hutter prize for me). I guess from the
remarks in the thread the theory is that the algorithm for learning
language is bounded by the size of of the information we inherit -
which we now assume is mostly found in the genome. And if we throw away
junk sequences, and maybe then throw away sequences for building
fingers and toes, maybe the algorithm fits in a 1G USB key.

Still, I find your statement remarkable. Would I be on safe ground if I
stated it as:

"Language (including its grammar) is complex, but there is a simple

(but unknown) algorithm for learning it."

?

|
| Mark Nelson - http://marknelson.us
|

Matt Mahoney

unread,

Nov 28, 2006, 11:26:40 AM11/28/06

to

That's what I believe. All learning in animals can be classified as
classical conditioning (learning associations) or operant conditioning
(reward/punishment). For language modeling, operant conditioning is
insignificant. Classical conditioning has a simple neural model first
proposed by Hebb in 1949 and is currently the basis of neural network
training. I think the main obstacle to AI has been the huge
computational power needed to simulate the brain.

Our DNA puts an upper bound on the complexity of the brain's learning
algorithm. That is 6 x 10^9 bits, but a lot of that is junk DNA (98% I
think), and a lot of the remainder is redundant copies of genes, or DNA
that codes for other parts of the body (as you mentioned), or
specialized structures in the brain that don't deal with language, or
for brain chemistry needed to make neurons work which can be abstracted
out in a model.

The brain has about 10^11 neurons and maybe 10^14 synapses, but I think
a langauge model can be much smaller. If the information content of a
language model needed to pass the Turing test is 10^9 bits (based on
what we read/hear in a lifetime), then a neural network with 10^5
neurons and 10^9 connections trained on 1 GB of text should be
sufficient. If we represent each concept (word, idiom, or grammatical
structure, plus time delays) as an association with a few dozen neurons
each (overlapping), then this neural network ought to be trainable on a
modern PC in about a day using 2 GB memory and MMX/SSE2 vector
processing. Already we have statistical models trained on 10^8 bits,
which is equivalent to the language model of a 2 or 3 year old child.
Google has much larger models, but I don't believe they are modeling
higher grammatical structures found in adult language, either because
they haven't figured out how, or it is not important to their business
goals.

-- Matt Mahoney

Matt Mahoney

unread,

Nov 28, 2006, 11:31:24 AM11/28/06

to

Exactly. Real computers have finite memory, so their output must
eventually repeat, making them stationary sources. But it might take a
long time. If a language model has 10^9 bits of memory, then the cycle
time could be up to 2^(10^9), allowing it to cycle through 1 GB of text
in every possible natural language that could theoretically be learned
before it had to repeat.

-- Matt Mahoney

jasen

unread,

Nov 29, 2006, 2:00:35 AM11/29/06

to

On 2006-11-28, Matt Mahoney <matma...@yahoo.com> wrote:
>
> jasen wrote:
>> On 2006-11-26, Matt Mahoney <matma...@yahoo.com> wrote:
>>
>> >> On the subject of nonstationary distributions and incompressible
>> >> strings, what is the relationship between non-terminating sequences,
>> >> non-halting problems etc, and incompressible strings?
>> >
>> > Nonstationary sources must have infinite memory. For example, the
>> > simple counting sequence I gave requires a counter that can store
>> > arbitrarily large integers.
>>
>> like a turing machine?
>

> Exactly.

If a turing machine can do it, it's computable.

Bye.
Jasen

Rob Freeman

unread,

Nov 29, 2006, 7:36:39 AM11/29/06

to

Matt Mahoney wrote:
>
> I guess we are in agreement then. Language (including its grammar) is
> complex, but there is a simple algorithm for learning it.

I'm tempted to accept this. There is a kernel of agreement there. Your
statement that grammar is not expressible, but it is learnable, has
much the right flavor.

But I worry about preconceptions that come with the word "learnable".

There is an ordering experiment I've used before which demonstates why
I don't like to use the word "learning". I expect I've mentioned it to
you. It is the one where you take a group of people and try to order
them wrt two independent characteristics, say golf score and salary. In
general ordering according to one will mix the distribution wrt the
other and vice-versa. (You can order them wrt golf-score, and they will
be out of order wrt salary, and order wrt salary will in general
disorder them wrt golf-score--though if you play with the boss there
might be an anti-correlation.)

You could say that it is not possible to describe the system completely
in terms of any one order. On the other hand you can order the system
completely in the sense that you can order it perfectly either way, at
will.

Anyway, if you try to do both you get a statistical compromise which
predicts neither perfectly.

This is what I think we have for language. You can't completely order
the system (no single complete order.) Attempts lead to a statistical
compromise--statistical language models or probabilistic grammars.

Now if "learning" means ordering, then I agree with your analysis.
Ordering is the essence of how we must understand such a system. But if
"learning" means finding a single complete ordering, then obviously
that is not possible.

It is this assumption (single complete ordering) which has tripped us
up in linguistics. I think talk of "learning" continues to lock us in
that mistaken assumption.

But if we clearly understand that "learning" is a process, and only the
process, then OK.

Oh, and in concrete terms we have to understand that the process
(ordering) acts on an incompressible collection of facts. You don't get
the power you need if the facts are not incompressible (independent.)

> I am not familiar with all of the work of IDSIA. I am familiar with
> Hutter's AIXI and some papers by Schmidhuber, Chatin, Solomonoff and
> Legg. Their work on computational limits of learning are very general.
> The understanding I get is that it is not possible to explicitly
> specify a language model. It must be learned.

I can see how you might interpret IDSIA's work this way. There is a
kernel of truth. You are seeing the idea that you cannot "explicitly
specify" something (=no complete ordering.)

But as I say, I think talking of "learning" is dangerous. Most people
will understand this to mean it is the thing to be learned which
matters (grammar, language model), not the process of learning, whereas
in fact understanding depends on realizing the _process_ is _all_ we
have.

I prefer to emphasize this idea that there is no end point to the
"learning" process by saying the order found must be ad-hoc or chaotic.

> The brain can build but not model something as complex as itself.

You are putting a lot of burden on the words "build" and "model", but
the fact you want to make a distinction is good (this is the same
distinction I am making between "language" and "grammar", or "process"
and "thing to be learned".)

BTW, there is a trick in this size argument. It is quite nice.

If you look at what Stephen Wolfram is doing, he seems to be
emphasizing a sense in which these kinds of systems (automata)
spontaneously generate complexity.

I think the content is this. The brain can't model something more
complex than itself. But there are more orders to be found between
objects in a set than there are objects in the set. So in a funny way,
a set can model something larger than itself (if the objects in the set
are independent=the set is random, or incompressible) because a set is
a model for all the orders which can be placed upon it.

I don't know how new/surprising this result is. Hofstadter hints at it
when he says self-referencing systems might be a model for
self-awareness or consciousness.

For us it just means grammar/meaning can be bigger than the sum of all
our knowledge=we can have new ideas/new syntax.

> I don't see any theory that separated grammar from the other types of
> knowledge such as a lexicon, semantics, or pragmatics that make up a
> model.

No, there is no such theory. I am not suggesting one, either. I just
suggest that syntax, as well as lexicon, semantics, pragmatics, all of
which I would call grammar, should best be modeled as patterns among
sets of incompressible facts.

The idea that language is best modeled as sets of incompressible facts
is new, to my knowledge. Though Paul Hopper's Emergent Grammar, circa
1987, does hint at the idea of a grammar that cannot be described.
Hopper:

"Because grammar is always emergent but never present, it could be said
that it never exists as such, but is always coming into being. There
is, in other words, no 'grammar' but only 'grammaticization'."

"...grammar ... is always deferred, always in a process but never
arriving ... and since I can only choose a tiny fraction of data to
describe, any decision I make ... is very likely to be ... disputed."

"The notion of emergence is a pregnant one. It is not intended to be a
standard sense of origins or genealogy, not a historical question of
'how' the grammar came to be the way it 'is', but instead it takes the
adjective emergent seriously as a continual movement towards structure,
a postponement or 'deferral' of structure, a view of structure as
always provisional, always negotiable, and in fact as epiphenomenal,
that is at least as much an effect as a cause."

http://eserver.org/home/hopper/emergence.html

Hopper is not a computer scientist, and as far as I know never tried to
model this precisely, or explored the computational/mathematical
angles. In practice he seems to have limited himself to describing word
formation, which is a nice smooth process where the really interesting
mathematical complexities don't appear.

> I think to understand these things, we need to know how they
> work in the brain. We need to experimentally determine the cognitive
> limits on human language learning and usage, and build language models
> with neural networks. The computational theory can only tell us about
> the size of the models, not the details.

It might still be true that some detail of brain structure is
important. But computational theory need not only tell us about size.
It also has things to say about knowability.

Already we have the result that if language is an incompressible
collection of facts, grammar will be unknowable (random) and bigger
than the collection of facts which describe it (if those facts are
random/incompressible.)

Without any more details about the brain those results explain all(?)
the problems we have had modeling grammar. (What remains to be
explained?)

We should take the hint and explore a model of language as a collection
of incompressible facts (producing grammar by generalization, as a kind
of self-referencing automaton on those facts.) If we strike any more
problems we can look again at biological detail. The problem now is
that we are not even applying what we know about the maths of the
problem.

-Rob

Rob Freeman

unread,

Nov 30, 2006, 11:07:28 AM11/30/06

to

ma...@ieee.org wrote:
> Matt Mahoney wrote:
> > I guess we are in agreement then. Language (including its grammar) is
> > complex, but there is a simple algorithm for learning it.
>

> ...
>
> ... I find your statement remarkable. Would I be on safe ground if I

> stated it as:
>
> "Language (including its grammar) is complex, but there is a simple
> (but unknown) algorithm for learning it."

We know the algorithm to "learn" grammar, Mark. We have known it these
last 50 years.

The problem has been with our goals, not our methods. We have failed to
model language because we have not understood there is no single
complete grammar to be learned.

You could call this abandoning the "assumption of grammar" (and keeping
only the assumption of computability.)

-Rob

snork...@gmail.com

unread,

Dec 1, 2006, 9:48:38 AM12/1/06

to

Rob Freeman wrote:

> We know the algorithm to "learn" grammar, Mark. We have known it these
> last 50 years.

As I said, I don't know much about this. (and because of that I
apologize for the ongoing crosspost to comp.ai-nat-lang.) But Matt used
the phrase "language (including grammar)" and the part I found more
remarkable was the word "language", which to me implied semantics.

I concede that I possess this algorithm, and I think I know where it
is, and I think I know what hardware is required to run it, but if
somebody "knows" it I'd like to know more. (And as always, would like
to see working code!)

|
| Mark - http://marknelson.us
|

Rob Freeman

unread,

Dec 1, 2006, 10:24:27 PM12/1/06

to

As far as working code to "learn" the semantics of language goes this
SourceForge project might be a place to start:

http://www.d.umn.edu/~tpederse/senseclusters.html

Note, I think that like everyone else, Ted Pedersen et al. fail to
grasp the complexity issue implicit in their representations. In my
terms: their goals (optimal global clusters?) are wrong, rather than
their methods. But as a free software project collecting together
algorithms to "learn" semantics from raw text, it is probably as good
as any.

Also, Dekang Lin has some nice on-line demos which cluster together
words of similar meaning:

e.g. http://www.cs.ualberta.ca/~lindek/demos/proxysim.htm

As I say, these algorithms have been known for more than 50 years. What
happened was that 50 years ago Chomsky observed they did not result in
nice global clusters, and concluded the methods were flawed!

Quite ironic really.

What with the pervasiveness of computer technology the algorithms are
becoming popular again. But we still haven't come up to speed that the
goal of finding nice global clusters is flawed. Even though it follows
from the results Turing used as the basis of computability (and agrees
with Chomsky's observations.)

Double irony.

If you have any concrete goals I can give you other references. I have
my own code for a parser assuming only computability (i.e. not assuming
global clusters.)

-Rob