Stochastic Context Free Grammar

23 views
Skip to first unread message

nadam

unread,
Sep 19, 2008, 8:21:24 AM9/19/08
to Hutter Prize
Hi!

I've only found the Hutter prize some days ago...:)
I am wondering: did anyone try an approach of creating an algorithm,
which automatically learns a Stochastic Context Free Grammar (
http://en.wikipedia.org/wiki/PCFG ) based on the data?

Creating such an algorithm seems to be non-trivial, so before working
on it harder, I would like to know whether anyone tried that approach
for this problem.

Regards:
Adam

Matt Mahoney

unread,
Sep 19, 2008, 10:20:21 AM9/19/08
to Hutter...@googlegroups.com
I'm not aware of PCFG being applied to text compression, although in theory it might work. The top compressors like paq8hp12 and durilca4linux model language mostly at the level of n-grams and a crude semantic/part-of-speech model in which related words are grouped in a predefined dictionary and whole groups are used as context. I believe that durilca4linux used a clustering algorithm to find related words. However, its top position in the LTCB is probably because it uses twice as much memory. The model itself is more primitive because it uses PPM rather than context mixing, which restricts the semantic model to a 1-dimensional space.

In earlier work in language models, modeling grammar has not been as successful as modeling semantics using sparse n-grams or LSA, but that might be because the corpus was not large enough to learn English grammar. That might be a problem for enwik8, although I believe enwik9 should be large enough. enwik8 is about the amount of language that a 3 year old child is exposed to, so it seems to me to be too small to train an adult level grammar model on it.

-- Matt Mahoney, matma...@yahoo.com


--- On Fri, 9/19/08, nadam <na...@freemail.hu> wrote:

Reply all
Reply to author
Forward
0 new messages