Calculating empirical entropy of enwik8

Alex

unread,

Aug 1, 2013, 2:00:54 AM8/1/13

to hutter...@googlegroups.com

Recent time many publications talking about "k-th order empirical entropy" of the text.
They say it can be calculated in a linear time:
See for example http://simongog.github.io/lessons/2012/08/26/Calculating_H_k_in_linear_time/
What about calculating variable (unbound) context length empirical entropy of the text?
I guess it can be calculated in reasonable time for ewnik8, since for k-th order algorithm it is linear.
On the other hand such empirical value for enwik8 would be good lower bound estimation of compression ratio for PPM based
compression methods, which sounds very interesting.

Matt Mahoney

unread,

Aug 1, 2013, 9:52:14 AM8/1/13

to Hutter-Prize

The unbounded empirical entropy of any string is 0. When the context
becomes sufficiently large, each next symbol occurs at most once in
each context.

-- Matt Mahoney, mattma...@gmail.com

James Bowery

unread,

Aug 1, 2013, 5:45:06 PM8/1/13

to Hutter Prize

On Thu, Aug 1, 2013 at 8:52 AM, Matt Mahoney <mattma...@gmail.com> wrote:

When the context

becomes sufficiently large, each next symbol occurs at most once in
each context.

Perhaps this is a naive question:

You encoded a text as a table with N columns where each row is a node in a hierarchical grammar, and chose N such that it was "sufficiently large".

You then use the fact that the rows can be re-ordered to compress the columns.

Is there a name for the class of compression algorithms into which this strategy falls?

Matt Mahoney

unread,

Aug 1, 2013, 5:50:50 PM8/1/13

to Hutter-Prize

On Thu, Aug 1, 2013 at 5:45 PM, James Bowery <jabo...@gmail.com> wrote:
> Perhaps this is a naive question:
>
> You encoded a text as a table with N columns where each row is a node in a
> hierarchical grammar, and chose N such that it was "sufficiently large".
>
> You then use the fact that the rows can be re-ordered to compress the
> columns.
>
> Is there a name for the class of compression algorithms into which this
> strategy falls?

You mean that each row is a context and the sequence of words that
occur in that context? What do you mean by a "node in a hierarchical
grammar"?

--
-- Matt Mahoney, mattma...@gmail.com

James Bowery

unread,

Aug 1, 2013, 5:56:55 PM8/1/13

to Hutter Prize

On Thu, Aug 1, 2013 at 4:50 PM, Matt Mahoney <mattma...@gmail.com> wrote:

On Thu, Aug 1, 2013 at 5:45 PM, James Bowery <jabo...@gmail.com> wrote:
> Perhaps this is a naive question:
>
> You encoded a text as a table with N columns where each row is a node in a
> hierarchical grammar, and chose N such that it was "sufficiently large".
>
> You then use the fact that the rows can be re-ordered to compress the
> columns.
>
> Is there a name for the class of compression algorithms into which this
> strategy falls?

You mean that each row is a context and the sequence of words that
occur in that context?

Yes.

What do you mean by a "node in a hierarchical
grammar"?

I'm thinking of trees derived from Nevill-Manning and Witten's work SEQUITUR but of arbitrary degree such as:

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.150.5583

James Bowery

unread,

Aug 1, 2013, 6:14:59 PM8/1/13

to Hutter Prize

Giving it a little more thought, the node id prevents you from gaining much, if anything, by reordering the rows.

Probably the best you can do is keep the rows in node id order and Burrows–Wheeler transform each of the remaining columns to compress them.

Matt Mahoney

unread,

Aug 1, 2013, 6:19:13 PM8/1/13

to Hutter-Prize

On Thu, Aug 1, 2013 at 5:56 PM, James Bowery <jabo...@gmail.com> wrote:
> I'm thinking of trees derived from Nevill-Manning and Witten's work SEQUITUR
> but of arbitrary degree such as:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.150.5583

Like byte pair encoding. Find the most frequent pair of symbols and
replace each instance with a new symbol. Repeat until there is no pair
of symbols that occurs more than once. You could generalize this to
finding high frequency strings of length greater than 2, but finding
pairs would find larger strings as well and be simpler.

I tested a byte pair encoder in
http://mattmahoney.net/dc/text.html#5322 but the compression was not
very good. I don't know of any recent attempts.

James Bowery

unread,

Aug 1, 2013, 6:47:38 PM8/1/13

to Hutter Prize

That's basically the digram approach taken by SEQUITUR.

Something I find intriguing about "Structuring labeled trees for optimal succinctness, and beyond" is that they have an N-gram approach that uses a transform analogous to the BWT that they call "xbw":

We present a novel approach to the problem of succinct manipulation of labeled trees by designing what we call the xbw transform of the tree, in the spirit of the well-known Burrows-Wheeler transform for strings. xbw transform uses path-sorting and grouping to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. Using the properties of the xbw transform, we (i) derive the ﬁrst-known (near-)optimal results for succinct representation of labeled trees with O(1) time for navigation operations, (ii) optimally support the powerful subpath search operation for the ﬁrst time, and (iii) introduce a notion of tree entropy and present linear time algorithms for compressing a given labeled tree up to its entropy beyond the information-theoretic lower bound averaged over all tree inputs.Our xbw transform is simple and likely to spur new results in the theory of tree compression and indexing, and may have some practical impact in XML data processing

Matthias Gallé

unread,

Aug 1, 2013, 6:42:32 PM8/1/13

to hutter...@googlegroups.com

On Thu, Aug 1, 2013 at 11:19 PM, Matt Mahoney <mattma...@gmail.com> wrote:

On Thu, Aug 1, 2013 at 5:56 PM, James Bowery <jabo...@gmail.com> wrote:
> I'm thinking of trees derived from Nevill-Manning and Witten's work SEQUITUR
> but of arbitrary degree such as:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.150.5583

These are "grammar-based compressor". Given a sequence you try to find a grammar that generates exactly one sequence, and then compress this grammar. The other side decompress the grammar and generates the (unique) string.

If the grammar is context-free this is also known as the "Smallest Grammar Problem".

Like byte pair encoding. Find the most frequent pair of symbols and
replace each instance with a new symbol. Repeat until there is no pair
of symbols that occurs more than once. You could generalize this to
finding high frequency strings of length greater than 2, but finding
pairs would find larger strings as well and be simpler.

Actually, greedy obtains the best as far as I know: find the string that would reduce the most the grammar size. A variant of size*occurrences.

I tested a byte pair encoder in
http://mattmahoney.net/dc/text.html#5322 but the compression was not
very good. I don't know of any recent attempts.

Most of these algorithms run in supra-linear time. They are getting a revival (I see a couple a papers/year) because they permit you easily to do compressed pattern matching (find a (compressed) subsequence in a compressed string without decompressing it).

--
-- Matt Mahoney, mattma...@gmail.com

--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize...@googlegroups.com.
To post to this group, send email to hutter...@googlegroups.com.
Visit this group at http://groups.google.com/group/hutter-prize.
For more options, visit https://groups.google.com/groups/opt_out.

Reply all

Reply to author

Forward