Combining word and sub-word LMs for a large OOV task

593 views
Skip to first unread message

Joan Puigcerver

unread,
Jan 14, 2016, 9:54:37 AM1/14/16
to kaldi-help
Hi,

I am facing a task with a considerable number of OOV words and I though of combining a word-level n-gram with a sub-word n-gram language model (i.e. character-level language model). 

The first think I thought was to expand the arcs in the the basic word-level n-gram model into its sub-word correspondence, and then do an interpolation between the two language WFST (G_1 and G_2). However, this makes the resulting WFST non-deterministic, and during the Viterbi decoding only the scores from one of the two branches would be used (the max). Instead, I need the scores of the two LMs to be interpolated.

Thus, I thought to produce two lattices using each LM and then try to combine the two lattices using the lattice-combine or lattice-interp, but I am not sure if this would work either, since both use a fstprojection and I think that the scores would not be summed up either.

The only way I know for sure that would work is to produce an explicit set of n-best hypothesis and then do the interpolation using these two sets. However, a lot of information from the lattices is lost in this way or n has to be set to a very large number, and I'm wondering if there would be a way of doing this interpolation directly from the lattices.

Has anyone tried this so far? How would you proceed?

Joan Puigcerver

unread,
Jan 14, 2016, 10:17:07 AM1/14/16
to kaldi-help
My bad: I mean fstcomposition, not fstprojection...

Daniel Povey

unread,
Jan 14, 2016, 1:19:40 PM1/14/16
to kaldi-help
You can do it in lattice rescoring as long as your lattice can potentially have all the paths you care about (i.e. as long as it contains the sub-words).   Combining different LMs with weights is easy using combinations of lattice-compose and lattice-scale.  lattice-compose always adds the LM with a weight of one, and you can use lattice-scale before an after to in effect add the LM with a weight that's different from one-- see lmrescore.sh, where it uses this trick to add the 'old' LM with a weight of -1, to subtract its scores.


Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Joan Puigcerver

unread,
Jan 19, 2016, 9:03:28 AM1/19/16
to kaldi-help, dpo...@gmail.com
Thanks for your reply, Dan.
 
You can do it in lattice rescoring as long as your lattice can potentially have all the paths you care about (i.e. as long as it contains the sub-words).

That is precisely my problem. Since I have a large OOV rate, many character sequences exist in the lattice produced from the char-level LM that do no exist in the word-level LM. The opposite may be true too, due to pruning, but it is less likely.
 
Combining different LMs with weights is easy using combinations of lattice-compose and lattice-scale. lattice-compose always adds the LM with a weight of one, and you can use lattice-scale before an after to in effect add the LM with a weight that's different from one-- see lmrescore.sh, where it uses this trick to add the 'old' LM with a weight of -1, to subtract its scores.

As you mention, lattice-compose "adds" the acoustic costs and graph costs from each lattice. But, this is not the same as a linear interpolation between the likelihoods, since the costs are the (negative) log-likelihood. Instead, I would be doing a logarithmic interpolation between the two models, not a linear one which is what I want.

I am wondering, if there is a lattice-determinize version that determinizes using the log-semiring instead of the tropical. This would be perfect, since I could easily create a non-deterministic lattice with the union of the (scaled) original lattices and then determinize it. This should work, since, AFAIK, the union of the (scaled) lattices would be determinizable.

Let's assume that the lattices are converted to regular FSTs: Then, the recipe would be something like this:

1. Linear scale to each lattice. Just create a new initial state with an epsilon transition to the old initial state and cost -log interpolation_weight, for each lattice.
2. fstunion scaled_lattice1.fst scaled_lattice2.fst | fstproject --project_output=true | fstdeterminizestar --use-log=true > linear_interpolation.fst

I think this should work, provided that the union of the lattices is in fact determinizable on the output symbols, which I think is the case.

What do you think? Is there another way of doing this in a more direct way?

Daniel Povey

unread,
Jan 19, 2016, 1:43:59 PM1/19/16
to Joan Puigcerver, Nagendra Goel, kaldi-help

As you mention, lattice-compose "adds" the acoustic costs and graph costs from each lattice. But, this is not the same as a linear interpolation between the likelihoods, since the costs are the (negative) log-likelihood. Instead, I would be doing a logarithmic interpolation between the two models, not a linear one which is what I want.

Yes, the interpolation that you get from lattice-compose is logarithmic.

I am wondering, if there is a lattice-determinize version that determinizes using the log-semiring instead of the tropical. This would be perfect, since I could easily create a non-deterministic lattice with the union of the (scaled) original lattices and then determinize it. This should work, since, AFAIK, the union of the (scaled) lattices would be determinizable.

Lattices have their own semiring due to the different types of scores.  lattice-combine can scale and combine the lattices in the way you want.  Doing lattice-determinize generally won't make much of a difference here however you do it- these two types of lattices would almost never contain identical paths so properly summing the probabilties (or not) wouldn't make a difference.  Doing this and then finding the lattice best path would always give you a path from one or other lattice.  If you do MBR decoding (lattice-mbr-decode, IIRC it will cobmine them in a more meaningful way).
 
Let's assume that the lattices are converted to regular FSTs: Then, the recipe would be something like this:

1. Linear scale to each lattice. Just create a new initial state with an epsilon transition to the old initial state and cost -log interpolation_weight, for each lattice.
2. fstunion scaled_lattice1.fst scaled_lattice2.fst | fstproject --project_output=true | fstdeterminizestar --use-log=true > linear_interpolation.fst

I think this should work, provided that the union of the lattices is in fact determinizable on the output symbols, which I think is the case.

What do you think? Is there another way of doing this in a more direct way?

lattice-combine does this, without the determization (which as I said, wouldn't give you a meaningful combination of the lattices anyway).  MBR decoding will give you a little bit of combination.  But this is not at all equivalent to linear interpolation of the LMs.  In fact, it's not clear to me that it's even meaningful to linearly combine the LMs since they are over different types of symbol.  What you probably want to do is to create an FST representation of an OOV word and use 'fstreplace' to insert it into the 'UNK' symbol in the LM, at the 'LG' stage of graph compilation.  Getting this LM to compile correctly is tricky because of the determinization that happens in building HCLG.  At one point Nagendra Goel was working on scripts like this but I don't think they are ready.

Dan
 

Joan Puigcerver

unread,
Jan 21, 2016, 5:06:50 AM1/21/16
to kaldi-help, joap...@gmail.com, nagend...@govivace.com, dpo...@gmail.com
Lattices have their own semiring due to the different types of scores.  lattice-combine can scale and combine the lattices in the way you want.  Doing lattice-determinize generally won't make much of a difference here however you do it- these two types of lattices would almost never contain identical paths so properly summing the probabilties (or not) wouldn't make a difference.  Doing this and then finding the lattice best path would always give you a path from one or other lattice.  If you do MBR decoding (lattice-mbr-decode, IIRC it will cobmine them in a more meaningful way).

I'll take a look at the MBR decoding, thanks for the advice.
 
lattice-combine does this, without the determization (which as I said, wouldn't give you a meaningful combination of the lattices anyway).  MBR decoding will give you a little bit of combination.  But this is not at all equivalent to linear interpolation of the LMs.  In fact, it's not clear to me that it's even meaningful to linearly combine the LMs since they are over different types of symbol.  What you probably want to do is to create an FST representation of an OOV word and use 'fstreplace' to insert it into the 'UNK' symbol in the LM, at the 'LG' stage of graph compilation.  Getting this LM to compile correctly is tricky because of the determinization that happens in building HCLG.  At one point Nagendra Goel was working on scripts like this but I don't think they are ready.

I think I have not explained clearly what I'm trying to do. Basically, I am trying to implement the interpolation of LMs for the decoding, as explained in this paper: http://dx.doi.org/10.1109/ICFHR.2014.64

They report a huge improvement in terms of perplexity when going from a "backoff"-kind combination (exactly the one that you propose, which they have also tried in their decoder) to the interpolation of the two LMs. Of course, in the paper they only measure the gains in PPL, because implementing the interpolation of the LMs is tricky...

First, you can break down the word-level n-gram into characters (basically, this would be the LG graph + fstproject --project_output=false). This would be a LM (it's no longer an n-gram), with the same info as the word-level n-gram, but over the set of characters, instead of words. Originally, I thought I could linearly interpolate this graph and the graph obtained from the character-level n-gram, but these are probably not determizable, thus, the viterbi decoding would go through one model or the other...

As I tried to explain earlier, I thought I could maybe try to do the interpolation over the lattices, based on this:

argmax_c P(x | c) P_int(c) = 
argmax_c P(x | c) (k * P_w(c) + (1 - k) * P_c(c)) = 
argmax_c k * P(x | c) * P_w(c) + (1 - k) * P(x | c) * P_c(c)

Where P_w and P_c are the language models obtained from the word-level n-grams and character-level, respectively.

Since the lattices don't have any loop, at least I know that the determization (in the log-semiring) is possible (although I'm aware that it could blow-up, due to the exponential cost).

I hope it is clearer now what I am trying to achieve.

Daniel Povey

unread,
Jan 21, 2016, 4:08:07 PM1/21/16
to Joan Puigcerver, kaldi-help, Nagendra Goel
(For everyone's info, this is all in the context of handwriting recognition).
I think the best approach would be to have the core code of all of this not use FSTs, but be more conventional language-modeling code, which you would use DeterministicOnDemandFst to wrap in an FST like way for lattice rescoring based on composition.  (A similar interface to how we use ConstArpaLm).  You could generate the lattices using a simple character-level language model, and then rescore using a special-purpose program.
All of this requires a deep understanding of language modeling, and also some basic understanding of FSTs.  It won't be particularly easy.

Dan

Reply all
Reply to author
Forward
0 new messages