How can I get phoneme-level and word-level segmentation during online recognition

Zibo Meng

unread,

Nov 25, 2015, 4:38:43 PM11/25/15

to kaldi-help

Dear all,

I am now using fisher_english pretrained nnet2 model to do online decoding. Can I get the phoneme level segmentation?

Thanks

Daniel Povey

unread,

Nov 25, 2015, 4:46:15 PM11/25/15

to kaldi-help

You can get it the same way you would using any decoder, from the lattice.

You can use lattice-best-path (with the appropriate acoustic scale) and pipe the 'alignments-wspecifier' into ali-to-phones. ali-to-phones has various options to control the formatting.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zibo Meng

unread,

Nov 25, 2015, 6:32:14 PM11/25/15

to kaldi-help, dpo...@gmail.com

Dan,

Thank you very much!

It really helped.

Best,

Zibo

Zibo Meng

unread,

Nov 25, 2015, 7:09:14 PM11/25/15

to kaldi-help, dpo...@gmail.com

Dear Dan,

I have a follow up question.

I tried to use the pretrained nnet2-online model trained on librispeech from http://kaldi-asr.org/downloads/build/6/trunk/egs/librispeech/s5/exp/nnet2_online/, and get the graph from http://kaldi-asr.org/downloads/build/6/trunk/egs/librispeech/s5/exp/tri5b/graph_tgsmall/, using the following command:

online2-wav-nnet2-latgen-faster --do-endpointing=false --online=false --config=nnet_a_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=graph/words.txt nnet_a_online/final.mdl graph/HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 cut.wav|" ark:/dev/null

However, I only get the result : utterance-id1 <UNK> <UNK>

Did do something wrong?

Thanks,

Best,

Zibo

On Wednesday, November 25, 2015 at 4:46:15 PM UTC-5, Dan Povey wrote:

Daniel Povey

unread,

Nov 25, 2015, 7:12:45 PM11/25/15

to Zibo Meng, kaldi-help

If you set the verbose level higher you can see the objective function improvement from iVector estimation. If it is very large (more than 10 or 20 or so) it generally indicates severe acoustic-environment mismatch. This can sometimes lead to very bad decoding results. There is a configuration variable to that program that can be used to downweight the silence in the iVector estimation; if it is set to something like 0.001 and you also supply the list of silence phones from lang/phones/silence.csl (or something like that), it can sometimes help improve robustness.

Dan

Zibo Meng

unread,

Nov 25, 2015, 7:30:56 PM11/25/15

to kaldi-help, mzbo...@gmail.com, dpo...@gmail.com

Dear Dan,

Thank you very much!

Would you please give the name of the configuration variable you mentioned?

Best,

Zibo

Reply all

Reply to author

Forward