removing observations, conditional training, etc.

74 views
Skip to first unread message

Jason Eisner

unread,
Jan 20, 2011, 1:40:48 AM1/20/11
to sequence...@googlegroups.com
Hi -- Thanks very much for publicizing and maintaining this code.  I'm interested in using a sequence memoizer as a component of a larger model, which leads me to a few questions.

(1) In my setting, the training sequences are not fixed.  They are themselves imputed by a sampler.  So I can't just feed in all the training sequences once and for all.  Rather, as the state of the sampler evolves, I need to continually add and remove short independent training sequences.  In principle, this is straightforward thanks to the exchangeability of the training sequences with one another.  However, the implementation probably involves removing data from a trie.  I see at
     http://www.sequencememoizer.com/documentation/sequencememoizer/index.html
that newSequence() and continueSequence() make it possible to add new sequences.  However, it doesn't look like there are methods currently for removing sequences.  I haven't looked at the source code.  Would it be easy for me (or you!) to add this functionality?

(2) Is there anything wrong with using a very large alphabet size?

(3) Is any support coming for graphical PY processes?

(4) Actually, I am using the sequence memoizer merely to encode a kind of conditional backoff.  I care about p(yz | ...tuvwx), and I want a model of it that backs off to p(yz | uvwx) and p(yz | vwx) and so on.  This is roughly what your code is for.  However, note that I shouldn't train the sequence ...tuvwxyz.  In my setting, the arbitrarily long context ...tuvwx is given by some OTHER model, and I only want the SM model to be trained on and to predict the last two characters, yz.  If I trained on ...tuvwxyz, then I would be learning also that uv tends to be followed by w, which is a fact about the distribution over contexts, not a fact about the conditional distribution that the SM is supposed to model.  Spuriously learning that uv tends to be followed by w would both waste memory and distort the predictive distributions.  It is possible to hack around the latter problem with a bit of extra expense, but that doesn't save the waste.  Suggestions?

Thanks again!
-cheers, jason

Nicholas Bartlett

unread,
Jan 20, 2011, 3:00:56 PM1/20/11
to sequence...@googlegroups.com
Jason, 

This kind of functionality is not currently built into the sequence memoizer code.  Primarily this is because the sequential nature of the training sequence is fundamental for the data structure, namely the prefix tree, being used by the sequence memoizer.  It is not obvious how to update the underlying data structure to accommodate the addition and removal of counts in different contexts.

Using a very large alphabet is natural when using Pitman-Yor processes since they naturally model power law type behavior.

There is no current plan to release code for the graphical Pitma-Yor process.

Nick Bartlett
--
BARTLETT Nick
Department of Statistics
Columbia University, New York

Jason Eisner

unread,
Jan 20, 2011, 3:19:38 PM1/20/11
to sequence...@googlegroups.com
Thanks, Nick.  I think it actually is possible to modify the data structures to handle (1) and probably also (4), but I haven't looked at the code so I may be missing something. 

For the moment, I'll probably just roll my own hierarchical CRP or PYP, or use a distance-dependent CRP.  But I have to decide how much effort to invest in class design and things like resampling the hyperparameters; would have been nice to build on your code.

-cheers, jason

Jan Gasthaus

unread,
Jan 21, 2011, 7:38:54 AM1/21/11
to sequence...@googlegroups.com
Hi Jason,

I have recently released some code that also implements the sequence memoizer
and related models. It is written in C++ and exposes a slightly lower-level
interface than Nick's Java version, and as such may be easier to tailor to your
purposes.

It does not support (1), but it does separate the manipulation of the
underlying context tree from the HPYP model, so (4) can be easily done by
separately inserting contexts into the context tree (using the insertContext()
method in HPYPModel) and observations into these contexts (using
insertObservation). Observations can also be removed from contexts
(removeObservation() method), but currently there is no way to remove contexts.

As Nick mentioned, is not exactly clear how to efficiently remove contexts from
the underlying compressed prefixed tree -- but if you have any ideas how to do
this I would love to hear them.

The code is at: http://www.gatsby.ucl.ac.uk/~ucabjga/code.html

Currently it's lacking a bit in documentation, but the code is commented
reasonably well.

Cheers,

Jan

Frank Wood

unread,
Jan 21, 2011, 7:52:21 AM1/21/11
to sequence...@googlegroups.com
Jan,

I am _SO_ sorry.  I keep forgetting to put your link on the website.  Remind me this weekend and I'll do it.  I was gone for so long it slipped my mind.

Frank

Jason Eisner

unread,
Jan 21, 2011, 10:22:07 AM1/21/11
to sequence...@googlegroups.com
Great, thank you!  We will have a look ...
-jason

On Fri, Jan 21, 2011 at 7:38 AM, Jan Gasthaus <jgas...@googlemail.com> wrote:
Reply all
Reply to author
Forward
0 new messages