REAL-TIME ONE-PASS DECODING WITH RECURRENT NEURAL NETWORK LANGUAGE MODEL

1,367 views
Skip to first unread message

Filip Jurcicek

unread,
Feb 17, 2017, 8:28:55 AM2/17/17
to kaldi-help
Hi, 

I just read the paper "REAL-TIME ONE-PASS DECODING WITH RECURRENT NEURAL NETWORK LANGUAGE MODEL FOR SPEECH RECOGNITION"(https://pdfs.semanticscholar.org/8ad4/4f5161ad04c71fe052582168bd7a45217d36.pdf) which is using on-the-fly hypothesis rescoring. 
My understanding is that it is a form of on the fly composition. 

Does anyone have experience with using RNNLM in in the first pass decoding? Is it worth implementing the ideas mentioned in the paper? 

Or is viable to use standard KALDI's lattice RNN rescoring in online recognition setting instead? 

Best, 
FIlip

Daniel Povey

unread,
Feb 17, 2017, 10:27:14 PM2/17/17
to kaldi-help
I think it would end up being more efficient in terms of total time taken to rescore the lattices after generating them.   This is what we found, for instance, in decoding with small/big ARPA-type LMs (see the "biglm" decoding, which we don't really use because it's slow).

If you really cared about latency, however, it might be better to do the RNNLM rescoring as you go.  I think the best approach would be a "biglm" type of scenario where the decoding graph (HCLG) is composed on-the-fly with a graph that adds the difference between the regular LM-score and the RNNLM score; you'd have to identify the histories as belonging to the same state once they reached a certain n-gram order (e.g. 4-gram) to stop the state-space blowing up (this is a standard approximation in RNNLM rescoring).

So the framework would be that the decoding FST is generated by on-the-fly composition, as in the "biglm" decoder, and the actual decoder code just sees an Fst<StdArc>, and doesn't even care about the fact that there is on-the-fly composition going on.  With chain models and lower-frame-rate models, decoding is well below real-time so the slowdown from the on-the-fly composition should be acceptable.

Something that has been holding us back is uncertainty about which RNNLM code or interface to use.  We're leaning towards rolling our own, based on the nnet3 framework, to avoid having to interface with too much external code (it seems that RNNLM toolkits are very prone to being developed quickly and afterwards, not being maintained much).  Anyway I'm hoping that in a 6 month timeframe we should be able to check something like this.


Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Filip Jurcicek

unread,
Feb 27, 2017, 10:55:26 AM2/27/17
to kaldi-help, dpo...@gmail.com
Hi Dan,

thank you for the information. 

I looked at gmm-decode-biglm-faster (e.g. https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/gmmbin/gmm-decode-biglm-faster.cc#L168), BiglmFasterDecoder (https://github.com/kaldi-asr/kaldi/blob/master/src/decoder/biglm-faster-decoder.h#L51), and *DeterministicOnDemandFst (https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/fstext/deterministic-fst.h#L100)

I see that one option is to extend LatticeFasterOnlineDecoder (https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/decoder/lattice-faster-online-decoder.h#L47)
so that it would use fst::DeterministicOnDemandFst<fst::StdArc> (lm_diff_fst) as in a similar way as is BiglmFasterDecoder. Then, I would implement something like
fst::RNNDeterministicOnDemandFst<StdArc> using some RNN language model, e.g. based on NNET3. And as you suggest, I would limit the word history to prevent the state space to get too large. This would have the advantage, that I can get scores for any N-gram history while keeping the RNN LM relatively small in memory. Though, it may be computationally expensive (depending on the size of the RNN LM) and the number of active states (word histories).  I could limit the computational cost by using some form of cashing as described in https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rnnFirstPass.pdf

However, considering approximating an RNN LM with some N-gram history (e.g. 4- grams) I could directly approximate RNN LM by an N-gram LM offline and extend only LatticeFasterOnlineDecoder to work with lm_diff_fst. My understanding is that the approximation of an RNN LM would have to result in a very large LM. The approximated LM must cover all possible (reasonable) N-gram histories as they are generated offline instead of online as in the approach described in the paragraph above. In this case, I would save time on RNN computations but it would require more memory. 

Do you have any intuition what is a better solution? Do you believe that using RNN LM directly during decoding (the first approach) would result in (significantly) better WER?

All the best, 
Filip
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,
Feb 27, 2017, 1:48:57 PM2/27/17
to Filip Jurcicek, kaldi-help

I looked at gmm-decode-biglm-faster (e.g. https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/gmmbin/gmm-decode-biglm-faster.cc#L168), BiglmFasterDecoder (https://github.com/kaldi-asr/kaldi/blob/master/src/decoder/biglm-faster-decoder.h#L51), and *DeterministicOnDemandFst (https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/fstext/deterministic-fst.h#L100)

I see that one option is to extend LatticeFasterOnlineDecoder (https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/decoder/lattice-faster-online-decoder.h#L47)
so that it would use fst::DeterministicOnDemandFst<fst::StdArc> (lm_diff_fst) as in a similar way as is BiglmFasterDecoder. Then, I would implement something like
fst::RNNDeterministicOnDemandFst<StdArc> using some RNN language model, e.g. based on NNET3.

Yes, although the nnet3-based RNNLMs are not ready (however, Hainan has been making progress).
 
And as you suggest, I would limit the word history to prevent the state space to get too large. This would have the advantage, that I can get scores for any N-gram history while keeping the RNN LM relatively small in memory. Though, it may be computationally expensive (depending on the size of the RNN LM) and the number of active states (word histories).  I could limit the computational cost by using some form of cashing as described in https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rnnFirstPass.pdf

However, considering approximating an RNN LM with some N-gram history (e.g. 4- grams) I could directly approximate RNN LM by an N-gram LM offline and extend only LatticeFasterOnlineDecoder to work with lm_diff_fst. My understanding is that the approximation of an RNN LM would have to result in a very large LM. The approximated LM must cover all possible (reasonable) N-gram histories as they are generated offline instead of online as in the approach described in the paragraph above. In this case, I would save time on RNN computations but it would require more memory. 


They tried this at Microsoft (I think Geoff Zweig was involved).  It does not work very well..  The difference with what you get when you limit the state-space by mapping, say, all things with the same 4-gram history to the same state, is that with the latter, even though you might not get the correct history from more than 4 words ago, it's probably *almost* correct.  When you completely throw away the history you get a bigger degradation and you lose most of the benefit of using an RNNLM in the first place.

Dan

Filip Jurcicek

unread,
Feb 28, 2017, 2:37:39 AM2/28/17
to kaldi-help, filip.j...@gmail.com, dpo...@gmail.com


On Monday, 27 February 2017 19:48:57 UTC+1, Dan Povey wrote:


I looked at gmm-decode-biglm-faster (e.g. https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/gmmbin/gmm-decode-biglm-faster.cc#L168), BiglmFasterDecoder (https://github.com/kaldi-asr/kaldi/blob/master/src/decoder/biglm-faster-decoder.h#L51), and *DeterministicOnDemandFst (https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/fstext/deterministic-fst.h#L100)

I see that one option is to extend LatticeFasterOnlineDecoder (https://github.com/kaldi-asr/kaldi/blob/85a3dd5f0b71e419abf1169a26b759bfc423a543/src/decoder/lattice-faster-online-decoder.h#L47)
so that it would use fst::DeterministicOnDemandFst<fst::StdArc> (lm_diff_fst) as in a similar way as is BiglmFasterDecoder. Then, I would implement something like
fst::RNNDeterministicOnDemandFst<StdArc> using some RNN language model, e.g. based on NNET3.

Yes, although the nnet3-based RNNLMs are not ready (however, Hainan has been making progress).
 
Yes, I may use nnet3-based RNNLMs when ready or something else if I get to work on this earlier then expected. Thanks for the note!

 
And as you suggest, I would limit the word history to prevent the state space to get too large. This would have the advantage, that I can get scores for any N-gram history while keeping the RNN LM relatively small in memory. Though, it may be computationally expensive (depending on the size of the RNN LM) and the number of active states (word histories).  I could limit the computational cost by using some form of cashing as described in https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/rnnFirstPass.pdf

However, considering approximating an RNN LM with some N-gram history (e.g. 4- grams) I could directly approximate RNN LM by an N-gram LM offline and extend only LatticeFasterOnlineDecoder to work with lm_diff_fst. My understanding is that the approximation of an RNN LM would have to result in a very large LM. The approximated LM must cover all possible (reasonable) N-gram histories as they are generated offline instead of online as in the approach described in the paragraph above. In this case, I would save time on RNN computations but it would require more memory. 


They tried this at Microsoft (I think Geoff Zweig was involved).  It does not work very well..  The difference with what you get when you limit the state-space by mapping, say, all things with the same 4-gram history to the same state, is that with the latter, even though you might not get the correct history from more than 4 words ago, it's probably *almost* correct.  When you completely throw away the history you get a bigger degradation and you lose most of the benefit of using an RNNLM in the first place.

I see. Thanks! 

All the bets,
Filip
 

Dan
Reply all
Reply to author
Forward
0 new messages