When the contextbecomes sufficiently large, each next symbol occurs at most once in
each context.
On Thu, Aug 1, 2013 at 5:45 PM, James Bowery <jabo...@gmail.com> wrote:You mean that each row is a context and the sequence of words that
> Perhaps this is a naive question:
>
> You encoded a text as a table with N columns where each row is a node in a
> hierarchical grammar, and chose N such that it was "sufficiently large".
>
> You then use the fact that the rows can be re-ordered to compress the
> columns.
>
> Is there a name for the class of compression algorithms into which this
> strategy falls?
occur in that context?
What do you mean by a "node in a hierarchical
grammar"?
We present a novel approach to the problem of succinct manipulation of labeled trees by designing what we call the xbw transform of the tree, in the spirit of the well-known Burrows-Wheeler transform for strings. xbw transform uses path-sorting and grouping to linearize the labeled tree T into two coordinated arrays, one capturing the structure and the other the labels. Using the properties of the xbw transform, we (i) derive the first-known (near-)optimal results for succinct representation of labeled trees with O(1) time for navigation operations, (ii) optimally support the powerful subpath search operation for the first time, and (iii) introduce a notion of tree entropy and present linear time algorithms for compressing a given labeled tree up to its entropy beyond the information-theoretic lower bound averaged over all tree inputs.Our xbw transform is simple and likely to spur new results in the theory of tree compression and indexing, and may have some practical impact in XML data processing
On Thu, Aug 1, 2013 at 5:56 PM, James Bowery <jabo...@gmail.com> wrote:
> I'm thinking of trees derived from Nevill-Manning and Witten's work SEQUITUR
> but of arbitrary degree such as:
>
> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.150.5583
Like byte pair encoding. Find the most frequent pair of symbols and
replace each instance with a new symbol. Repeat until there is no pair
of symbols that occurs more than once. You could generalize this to
finding high frequency strings of length greater than 2, but finding
pairs would find larger strings as well and be simpler.
I tested a byte pair encoder in
http://mattmahoney.net/dc/text.html#5322 but the compression was not
very good. I don't know of any recent attempts.
--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize...@googlegroups.com.
To post to this group, send email to hutter...@googlegroups.com.
Visit this group at http://groups.google.com/group/hutter-prize.
For more options, visit https://groups.google.com/groups/opt_out.