Alignment using chain models

41 views

Skip to first unread message

Guanbo Wang

unread,

Mar 7, 2024, 5:48:37 AMMar 7

to kaldi-help

Hi there,

I'm currently working on aligning some utterances in WAV format with their corresponding text, aiming to obtain precise timestamps for each word/phoneme. What I only have are pretrained chain models.

However, chain models always have a `frame_subsampling_factor` of 3, which makes the resolution of time stamps 30 ms. To keep the resolution as accurate as 10 ms, I set the `frame_subsampling_factor` to 1 during the alignment. Basically, it works. The alignment output looks similar to those when setting `frame_subsampling_factor` to 3, and the resolution of time stamps becomes 10 ms. But I'm uncertain if this simple workaround is considered a proper or recommended practice.

Any insights about this? Thanks in advance.

nshm...@gmail.com

unread,

Mar 8, 2024, 5:59:32 AMMar 8

to kaldi-help

Back in early TTS days it was even popular to have 5ms frames for better alignment. Often 10ms is too large.

Reply all

Reply to author

Forward

0 new messages