Hello,
I am writing a thesis on speech recognition of whistled speech (it’s a rare type of speech that replaces regular voice with whistling). I’m using a standard recipe for building a HMM-GMM model from examples; it goes from monophone training, through several phases of triphone training, to SGMM training. I changed MFCC parameters to fit whistled speech, and extracted both mfcc and pitch in data extraction, using the script written for that. Apart from that, I didn’t change much.
I used that recipe on data in Silbo Gomero, the whistled dialect of Spanish. I also used the unmodified recipe on the same amount of data in spoken Spanish, in order to compare the results. I used the same language model in both of these cases, since they are both Spanish.
Both cases return very similar WER and CER after the monophone training; however, after the triphone and SGMM training phases, the Spanish model gets more accurate, while the Silbo model looses accuracy. I’m a better linguist than I am a programmer, but from what I gathered, this might mean that phoneme alignment is difficult for my Silbo model; I am not sure how to check that, however.
I haven’t been able to find anyone with a similar issue on the internet. The problem I have is that this is either an inherent property of Kaldi working on whistled speech data, or just a bug I’ve been unable to fix; I’d rather not mistake a bug for a discovery, obviously.
If anyone encountered a situation where a monophone model works well, but none of the others do, or if anyone has an idea what might be causing that, please let me know. Please let me know if any additional information from me may help, too.
Thank you for help,
Maciej Jakubiak