It's true that phda9 and cmix exceed 1 bit per character on the large
text benchmark. However, Shannon's estimated range of the entropy of
English is 0.6 to 1.3 bpc with a 100 character context. Shannon tested
by having subjects guess the next character (A-Z or space) until
correct. About 80% were correct on the first guess, 7% on the second,
and down from there. The range of possible probability distributions
consistent with the observed results gave entropies was 0.6 to 1.3
bpc.
Cover and King tried to resolve this ambiguity by having subjects
directly assign probabilities using a gambling game and there result
was 1.3 bpc. However this method is time consuming (5 hours for 1
paragraph) and humans are not very good at assigning numeric
probabilities. We tend to overestimate the probability of rare events
(which is why lottery tickets sell), which gives an artificially high
entropy.
The large text benchmark is about 70% English text and 30% formatting,
tables, links, XML, automatically generated articles, etc, which
probably has lower entropy than the pure text without punctuation or
capitalization that Shannon used. I can't give a better number because
I never attempted to measure this text using human character or word
prediction. I would be surprised if the phda9 and cmix models are high
level enough to produce responses that would pass a Turing test. The
models are more comparable to toddler level grammars using
dictionaries, simple clauses and sentences using n-grams, and word
proximity semantics. I don't believe they model grammar at a high
enough level to do arithmetic or compound sentences.
On Mon, Jul 2, 2018 at 10:34 AM James Bowery <
jabo...@gmail.com> wrote:
>
> Perhaps LIMITS OF HUMAN WORD PREDICTION PERFORMANCE by Lesher et al would be a better human benchmark than Shannon's.
>
> --
> You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
hutter-prize...@googlegroups.com.
--
-- Matt Mahoney,
mattma...@gmail.com