Alignment using chain models

41 views
Skip to first unread message

Guanbo Wang

unread,
Mar 7, 2024, 5:48:37 AMMar 7
to kaldi-help
Hi there,

I'm currently working on aligning some utterances in WAV format with their corresponding text, aiming to obtain precise timestamps for each word/phoneme. What I only have are pretrained chain models. 

However, chain models always have a `frame_subsampling_factor` of 3, which makes the resolution of time stamps 30 ms. To keep the resolution as accurate as 10 ms, I set the `frame_subsampling_factor` to 1 during the alignment. Basically, it works. The alignment output looks similar to those when setting `frame_subsampling_factor` to 3, and the resolution of time stamps becomes 10 ms. But I'm uncertain if this simple workaround is considered a proper or recommended practice.

Any insights about this? Thanks in advance.

nshm...@gmail.com

unread,
Mar 8, 2024, 5:59:32 AMMar 8
to kaldi-help
Back in early TTS days it was even popular to have 5ms frames for better alignment. Often 10ms is too large.
Reply all
Reply to author
Forward
0 new messages