Grammar Preprocessing?

130 views
Skip to first unread message

James Bowery

unread,
Jul 14, 2022, 9:55:12 AM7/14/22
to Hutter Prize
Has anyone looked at lossless compression of a prepossessed corpus in the form of a grammar tree?   Admittedly a small C++ English language parser will produce inadequate trees on occasional sentences such as:

"I once shot an elephant in my pajamas. How he got in my pajamas I'll never know"

And then there is the problem of finding an adequate Mediawiki grammar.

But the point here is that just as converting a .zip compressed text into .bz2 requires decompression preprocessing into a higher dimensional space, so it may make sense to "decompress" Mediawiki text into a higher dimensional space that makes semantic content more apparent to a compression algorithm.


Neil Harding

unread,
Jul 14, 2022, 4:11:51 PM7/14/22
to hutter...@googlegroups.com
I spent about 6 months trying to transform it into higher level, but that was with the 100MB version, not with the 1GB version.

Neil Harding

--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hutter-prize/1dd70677-4a10-40d4-8fea-d40f0eee6129n%40googlegroups.com.

Matt Mahoney

unread,
Jul 14, 2022, 4:51:00 PM7/14/22
to Hutter-Prize
Look up "Compression by Induction of Hierarchical Grammars" by Nevill-Manning, Witten, and Maulsby, 1994. Yes it works. It might be something that could be added to a mix of models for a slight gain.

James Bowery

unread,
Jul 14, 2022, 10:52:36 PM7/14/22
to Hutter Prize
My question wasn't about grammar induction* but rather about using an existing parser to preparse into a tree.

*Actually SEQUITUR is the first thing I tried about 10 years ago as an approach to the 100M corpus.  I think I have the C code around somewhere.  However, the way I compressed the induced grammar was to try to adapt Hecht-Nielsen's cogent confabulation to a hierarchical "swirl" (his term) that he used to perform plausible next sentence generation.

Sebastian Nerger

unread,
Jul 15, 2022, 10:54:17 AM7/15/22
to hutter...@googlegroups.com
I think the way to use the old-styled-looseless compression is a dead end. of course these can be an other half percent of compression rate. Maybe, imho, combine some more or less new algorithm, pattern etc. to as a new art of compression is required.

Reply all
Reply to author
Forward
0 new messages