How does beam value in forced alignment affect the performance?

854 views
Skip to first unread message

ih4...@gmail.com

unread,
Dec 14, 2015, 10:21:33 AM12/14/15
to kaldi-help
I find that setting different beam value for "gmm-align-compiled" produce different WER. And the performance is not monotonically changed with respect to the beam value.
My test is based on the "train_mono.sh" script on a handwriting number recognition task similar with TI digits.
So, how does the beam value used in viterbi training affect the decoding performance? Is there any general rule to set the beam value?

Daniel Povey

unread,
Dec 14, 2015, 2:34:00 PM12/14/15
to kaldi-help
There will generally be an optimal beam but the differences are very tiny, and in your case may 
simply be due to random noise rather than real differences.
As you make the beam smaller you lose more data to failed alignent (and/or make more alignment errors), but as you make the beam larger you end up aligning data that maybe had a wrong transcript, so in the limit it can sometimes make things slightly worse to make it larger.   But this is probably dataset-dependent (depending on how many bad transcripts you have).  In general we expect any differences to be quite tiny.  
Larger beam is slower, too.
Also I think gmm-align-compiled has two beams- one for the initial pass, and one for a second pass if the initial pass fails to reach the final state.

Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ih4...@gmail.com

unread,
Dec 15, 2015, 4:39:43 AM12/15/15
to kaldi-help, dpo...@gmail.com
Thanks for you responding, Dan. 

Why would the procedure may get a wrong transcript when setting a large beam value? In my understanding, the ground truth transcription is used in forced alignment.

Jan Trmal

unread,
Dec 15, 2015, 10:41:45 AM12/15/15
to kaldi-help, Dan Povey
Theoretically, it's correct. In practice, it's not uncommon to see sometimes certain files to fail to align with their "official" transcripts. It can have several reasons (the acoustic model do not fit well the data, the transcript is not right), but if you fight against that by increasing the beam, you will possibly get an alighnment. The danger is that the boundaries (i.e. which frame belongs to which phone/state) will be incorrect.  
Sometimes you see bunch of files failing to align during the initial mono stages, that's completely ok, unless it's large portion of the training data -- the monophone models are usually quite simple, so they do not account for all the variability in the data.
y.

ih4...@gmail.com

unread,
Dec 17, 2015, 9:09:03 AM12/17/15
to kaldi-help, dpo...@gmail.com
Yenda, could you give more details about the "several reasons" of the failing of aligning an observation with its ground truth transcription? In my understanding, no matter how bad the acoustic model fits the data, there would always be a "best" alignment. As long as the number of observation frames is larger than that of the states that chained according to the HMMs, it can have an alignment. When would the failing happen? My understanding may not accurate, please correct me if there is any mistake.

Daniel Povey

unread,
Dec 17, 2015, 1:37:48 PM12/17/15
to Qiang Guo, kaldi-help
It relates to pruned Viterbi search.  Not all states are active on any given frame, and if no final-states are active on the last frame, the alignment fails.
Dan

ih4...@gmail.com

unread,
Dec 17, 2015, 10:43:20 PM12/17/15
to kaldi-help, ih4...@gmail.com, dpo...@gmail.com
Thanks, Dan, now I totally understand. :)
Reply all
Reply to author
Forward
0 new messages