Hi there,
I'm currently working on aligning some utterances in WAV format with their corresponding text, aiming to obtain precise timestamps for each word/phoneme. What I only have are pretrained chain models.
However, chain models always have a `frame_subsampling_factor` of 3, which makes the resolution of time stamps 30 ms. To keep the resolution as accurate as 10 ms, I set the `frame_subsampling_factor` to 1 during the alignment. Basically, it works. The alignment output looks similar to those when setting `frame_subsampling_factor` to 3, and the resolution of time stamps becomes 10 ms. But I'm uncertain if this simple workaround is considered a proper or recommended practice.
Any insights about this? Thanks in advance.