Problems with training a HMM GMM model for whistled speech recognition.

Maciej Jakubiak

unread,

Jun 12, 2022, 8:48:00 PM6/12/22

to kaldi-help

Hello,

I am writing a thesis on speech recognition of whistled speech (it’s a rare type of speech that replaces regular voice with whistling). I’m using a standard recipe for building a HMM-GMM model from examples; it goes from monophone training, through several phases of triphone training, to SGMM training. I changed MFCC parameters to fit whistled speech, and extracted both mfcc and pitch in data extraction, using the script written for that. Apart from that, I didn’t change much.

I used that recipe on data in Silbo Gomero, the whistled dialect of Spanish. I also used the unmodified recipe on the same amount of data in spoken Spanish, in order to compare the results. I used the same language model in both of these cases, since they are both Spanish.

Both cases return very similar WER and CER after the monophone training; however, after the triphone and SGMM training phases, the Spanish model gets more accurate, while the Silbo model looses accuracy. I’m a better linguist than I am a programmer, but from what I gathered, this might mean that phoneme alignment is difficult for my Silbo model; I am not sure how to check that, however.

I haven’t been able to find anyone with a similar issue on the internet. The problem I have is that this is either an inherent property of Kaldi working on whistled speech data, or just a bug I’ve been unable to fix; I’d rather not mistake a bug for a discovery, obviously.

If anyone encountered a situation where a monophone model works well, but none of the others do, or if anyone has an idea what might be causing that, please let me know. Please let me know if any additional information from me may help, too.

Thank you for help,

Maciej Jakubiak

unread,

Jun 15, 2022, 10:18:00 PM6/15/22

to kaldi-help

I managed to solve this issue myself. Here's the solution, if anyone comes here in the future searching for help, or is simply curious:

Whistled speech consonants are, by their nature, a change in pitch over time, and regular settings for MFCC in kaldi are not great at catching that. I found that setting the frame_length for 100 miliseconds, and frame_shift for 40 miliseconds, gives a good result; bigger frames mean that change over longer time is detected easier. Tests run with these settings improved with the triphone training, and gave me comparable results to tests run on spoken Spanish with the regular settings and the same amount of data.

arlo...@gmail.com

unread,

May 16, 2023, 3:35:30 AM5/16/23

to kaldi-help

This is a fascinating task -- I'm curious to know how your thesis turned out!

My initial thoughts:

1. You might not want to use MFCC for a language that is so highly characterized by pitch. MFCC was designed to largely discard pitch information in speech, focusing instead on the spectral envelope; that tends to work particularly well for a non-tonal language such as English. I would expect that you might do better with explicit pitch features, or perhaps a high-resolution log filterbank.

2. How much data do you have? I would expect that for such a minority language it would be considered extremely low-resource w.r.t. supervised training data. In that sense, you might well expect that a simple monophone model could outperform more complex models with more context and parameters.

3. Increasing the frame length and frame shift could help, as you noted empirically. I think the rationale might be that bigger frames provide greater frequency resolution, while longer frame shift helps to model long-term dynamics. The latter could also be modeled with delta features or spliced features -- but that increases the feature dimensionality which probably isn't a good idea if you have a very limited amount of data.

[Sorry this reply is a year late!]

Reply all

Reply to author

Forward