I'm building an ASR for an exotic language (from Semitic languages origin) - I have very limited data, 2.5 hours in total - which I used 2 hours for training and 0.5 hours for testing.
Monophone training went okay, I'm getting around 50% WER on my test set.
But when I start triphone training, the WER goes up to 70%. I'm not sure if this is really about overfitting to my limited training data or I'm missing something very obvious here.
Here is some more details about my setup:
My lexicon is 5k words, I have 30 phones in total from which 6 are vowels (excluding sil and spn phones)
I built a 3gram LM (with KneserNey discounting) from the training set's transcripts (and I understand this is very very tiny for LM, but currently I couldn't find much written form to estimate my LM).
PPL on training set is around 50, on test set it is 400 (I know I should build a better LM, but don't think this is related to my poor tri phone performance compared to monophone).
My data is segmented and I have silence in boundaries.
During triphone training, aligning data looks fine.
I tried lowering number of leaves and total number of Gaussian components, they did not help much, also tried lowering number of training iterations (in case of over fitting), no luck.
In monophone training, avg likelihood goes from around -105 to -86
In triphone training, from around -90 to -82
So my question is: what could be wrong in my setup? Is it simply over-fitting to my limited training data?
Should I even try more with this setup or I should focus on getting more data, at least 10h?