How can I get phoneme-level and word-level segmentation during online recognition

1,431 views
Skip to first unread message

Zibo Meng

unread,
Nov 25, 2015, 4:38:43 PM11/25/15
to kaldi-help
Dear all, 

I am now using fisher_english pretrained nnet2 model to do online decoding. Can I get the phoneme level segmentation?

Thanks

Daniel Povey

unread,
Nov 25, 2015, 4:46:15 PM11/25/15
to kaldi-help
You can get it the same way you would using any decoder, from the lattice.
You can use lattice-best-path (with the appropriate acoustic scale) and  pipe the 'alignments-wspecifier' into ali-to-phones.  ali-to-phones has various options to control the formatting.

Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Zibo Meng

unread,
Nov 25, 2015, 6:32:14 PM11/25/15
to kaldi-help, dpo...@gmail.com
Dan,

Thank you very much!

It really helped.

Best,

Zibo

Zibo Meng

unread,
Nov 25, 2015, 7:09:14 PM11/25/15
to kaldi-help, dpo...@gmail.com
Dear Dan,

I have a follow up question.

I tried to use the pretrained nnet2-online model trained on librispeech from http://kaldi-asr.org/downloads/build/6/trunk/egs/librispeech/s5/exp/nnet2_online/, and get the graph from http://kaldi-asr.org/downloads/build/6/trunk/egs/librispeech/s5/exp/tri5b/graph_tgsmall/, using the following command:

online2-wav-nnet2-latgen-faster --do-endpointing=false     --online=false     --config=nnet_a_online/conf/online_nnet2_decoding.conf     --max-active=7000 --beam=15.0 --lattice-beam=6.0     --acoustic-scale=0.1 --word-symbol-table=graph/words.txt    nnet_a_online/final.mdl graph/HCLG.fst "ark:echo utterance-id1 utterance-id1|" "scp:echo utterance-id1 cut.wav|"    ark:/dev/null

However, I only get the result : utterance-id1 <UNK> <UNK>

Did do something wrong?

Thanks,

Best,

Zibo



On Wednesday, November 25, 2015 at 4:46:15 PM UTC-5, Dan Povey wrote:

Daniel Povey

unread,
Nov 25, 2015, 7:12:45 PM11/25/15
to Zibo Meng, kaldi-help
If you set the verbose level higher you can see the objective function improvement from iVector estimation.  If it is very large (more than 10 or 20 or so) it generally indicates severe acoustic-environment mismatch.  This can sometimes lead to very bad decoding results.  There  is a configuration variable to that program that can be used to downweight the silence in the iVector estimation; if it is set to something like 0.001 and you also supply the list of silence phones from lang/phones/silence.csl (or something like that), it can sometimes help improve robustness.
Dan

Zibo Meng

unread,
Nov 25, 2015, 7:30:56 PM11/25/15
to kaldi-help, mzbo...@gmail.com, dpo...@gmail.com
Dear Dan,

Thank you very much!

Would you please give the name of the configuration variable you mentioned?

Best,

Zibo
Reply all
Reply to author
Forward
0 new messages