Shannon's Benchmark Exceeded?

250 views
Skip to first unread message

James Bowery

unread,
Jul 2, 2018, 10:09:23 AM7/2/18
to Hutter Prize
The rationale for the Large Text Compression Benchmark states:

"...in 1950, Claude Shannon estimated the entropy (compression limit) of written English to be about 1 bit per character [3]. To date, no compression program has achieved this level."

The "date" referred to is July 23, 2009.

The current compression ratio is 1 bit per 8.5 bits or better than one bit per byte.  Understanding that the entropy of the benchmark corpus is lower than that used by Shannon, is there a principled way of rendering the measures comparable?

If one normalizes with a trivial measure of entropy, say, arithmetic coding, and then uses that size as the numerator in the compression ratio, would it render the measures comparable enough to begin making reasonable statements about the intelligence represented in the recent submissions relative to human intelligence?

James Bowery

unread,
Jul 2, 2018, 10:18:26 AM7/2/18
to Hutter Prize
Perhaps Nevill-Manning's SEQUITUR would be an appropriate preprocessor for the arithmetic coder to take out a higher-level, but still trival, order.

--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize+unsubscribe@googlegroups.com.
To post to this group, send email to hutter...@googlegroups.com.
Visit this group at https://groups.google.com/group/hutter-prize.
For more options, visit https://groups.google.com/d/optout.

James Bowery

unread,
Jul 2, 2018, 10:34:13 AM7/2/18
to Hutter Prize
Perhaps LIMITS OF HUMAN WORD PREDICTION PERFORMANCE by Lesher et al would be a better human benchmark than Shannon's.

Matt Mahoney

unread,
Jul 2, 2018, 12:11:33 PM7/2/18
to Hutter-Prize
It's true that phda9 and cmix exceed 1 bit per character on the large
text benchmark. However, Shannon's estimated range of the entropy of
English is 0.6 to 1.3 bpc with a 100 character context. Shannon tested
by having subjects guess the next character (A-Z or space) until
correct. About 80% were correct on the first guess, 7% on the second,
and down from there. The range of possible probability distributions
consistent with the observed results gave entropies was 0.6 to 1.3
bpc.

Cover and King tried to resolve this ambiguity by having subjects
directly assign probabilities using a gambling game and there result
was 1.3 bpc. However this method is time consuming (5 hours for 1
paragraph) and humans are not very good at assigning numeric
probabilities. We tend to overestimate the probability of rare events
(which is why lottery tickets sell), which gives an artificially high
entropy.

The large text benchmark is about 70% English text and 30% formatting,
tables, links, XML, automatically generated articles, etc, which
probably has lower entropy than the pure text without punctuation or
capitalization that Shannon used. I can't give a better number because
I never attempted to measure this text using human character or word
prediction. I would be surprised if the phda9 and cmix models are high
level enough to produce responses that would pass a Turing test. The
models are more comparable to toddler level grammars using
dictionaries, simple clauses and sentences using n-grams, and word
proximity semantics. I don't believe they model grammar at a high
enough level to do arithmetic or compound sentences.
On Mon, Jul 2, 2018 at 10:34 AM James Bowery <jabo...@gmail.com> wrote:
>
> Perhaps LIMITS OF HUMAN WORD PREDICTION PERFORMANCE by Lesher et al would be a better human benchmark than Shannon's.
>
> --
> You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize...@googlegroups.com.
> To post to this group, send email to hutter...@googlegroups.com.
> Visit this group at https://groups.google.com/group/hutter-prize.
> For more options, visit https://groups.google.com/d/optout.



--
-- Matt Mahoney, mattma...@gmail.com

Matthias Gallé

unread,
Jul 2, 2018, 1:13:34 PM7/2/18
to hutter...@googlegroups.com
James,

For the combination of grammar + AC, Kieffer and Yang formalized that in their "Grammar-based codes: a new class of universal lossless source codes" papers (part I and II if I remember correctly)

--mg


On Mon, Jul 2, 2018 at 4:18 PM James Bowery <jabo...@gmail.com> wrote:
Perhaps Nevill-Manning's SEQUITUR would be an appropriate preprocessor for the arithmetic coder to take out a higher-level, but still trival, order.
On Mon, Jul 2, 2018 at 9:09 AM, James Bowery <jabo...@gmail.com> wrote:
The rationale for the Large Text Compression Benchmark states:

"...in 1950, Claude Shannon estimated the entropy (compression limit) of written English to be about 1 bit per character [3]. To date, no compression program has achieved this level."

The "date" referred to is July 23, 2009.

The current compression ratio is 1 bit per 8.5 bits or better than one bit per byte.  Understanding that the entropy of the benchmark corpus is lower than that used by Shannon, is there a principled way of rendering the measures comparable?

If one normalizes with a trivial measure of entropy, say, arithmetic coding, and then uses that size as the numerator in the compression ratio, would it render the measures comparable enough to begin making reasonable statements about the intelligence represented in the recent submissions relative to human intelligence?

--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize...@googlegroups.com.

To post to this group, send email to hutter...@googlegroups.com.
Visit this group at https://groups.google.com/group/hutter-prize.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize...@googlegroups.com.

James Bowery

unread,
Jul 2, 2018, 2:54:06 PM7/2/18
to Hutter Prize
The uncertainty in the human measurement -- being primarily the cost of labor -- might be overcome by turning it into a crowd-sourced online game where the corpus is yet-to-be-published prose, sampled without replacement.  There would need to be a pretty high flow of such prose since more recent efforts (such as the cited "Limits of Human Word Prediction Performance") found that prior context sentences are necessary.  So each challenge must consume at least a few sentences at once.

The uncertainty in the _relative_ entropy of prose vs Wikipedia source is something I sought to address by suggesting they be normalized with some very basic kind of compression.  Matthias's suggestion of "A Universal Grammar-Based Code For Lossless Compression of Binary Trees" followed by arithmetic coding is appealing.

Is there some way of testing the adequacy of such automated normalization short of human measurement on the Wikipeida source?

> To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize+unsubscribe@googlegroups.com.

> To post to this group, send email to hutter...@googlegroups.com.
> Visit this group at https://groups.google.com/group/hutter-prize.
> For more options, visit https://groups.google.com/d/optout.



--
-- Matt Mahoney, mattma...@gmail.com
--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize+unsubscribe@googlegroups.com.
Message has been deleted

Ruchira Datta

unread,
Jul 2, 2018, 7:54:01 PM7/2/18
to hutter...@googlegroups.com
Wow, that is both ingenious and hilarious. Thank you!

On Mon, Jul 2, 2018 at 2:59 PM, Brian Mingus <refle...@gmail.com> wrote:
There is a fascinating corpus I am aware of that could be useful. What I like about it is that the ratio of text lengths speaks more to how much of the text is needed to communicate the actual meaning. 

I give you: http://sqapo.com
Reply all
Reply to author
Forward
0 new messages