Building a Kaldi model dedicated to spelling words letter by letter

Iztok Lebar Bajec

unread,

May 7, 2021, 2:44:20 PM5/7/21

to kaldi-help

I have googled for this but I am unable to find any useful advice. So I would like to ask if anyone has ever attempted something like this, and if so, what approach they have taken and how successful they were. If not, what would the best approach be. Could a tdnn-f nnet3 model be suitable for something like this at all.

Thank you for the help,

Iztok

nshm...@gmail.com

unread,

May 7, 2021, 6:36:23 PM5/7/21

to kaldi-help

There is no much difference between recognition letters and recognition of generic speech. The more data you have for training the better is your accuracy. tdnn-f nnet3 model is ok.

Letters are hard to recognize by themselves because they are short. The hardest is "e" set - b/p/d/t/e. Accuracy is not going to be high, even worse in noise.

You'd better design the whole system to avoid recognition of letter or expand them with names/radio words.

Iztok Lebar Bajec

unread,

May 8, 2021, 9:06:38 AM5/8/21

to kaldi-help

I understand that some letters are hard, and that it is better to avoid a system based on individual letter recognition, but in certain scenarios this is unavoidable. One example is a dictation system, where an ASR capable of recognising individual letters can aid the user to fill out unrecognised words, i.e. words that are not part of the initial lexicon. I am also aware that using names/radio words (the so called NATO alphabet) helps, as it provides a larger phonetic context, but for example, Dragon dictating ASR by Nuance seems not to have this requirement and they seem to still be quite successful at recognising letters on an individual letter level.

I am working with a language where the graphemic and phonetic form are quite similar, meaning that individual letters when pronounced as part of a word are mostly represented by the same phoneme as when pronounced on an individual basis (pronounced as a single letter). I have rather good results with low WER for generic speech, but I am having difficulty with spelling letter by letter.

I guess the questions I am running with start at: Is it really just the amount of data? what would be a minimal amount of data? Is it regardless of the language, better to have separate phones, dedicated just for spelling?

Any help will be appreciated,

Iztok

david.e....@gmail.com

unread,

May 8, 2021, 10:28:30 AM5/8/21

to kaldi-help

Hey, I think, if I understand your mission correctly, that this would be quite easy to achieve, especially if you a phonemic language, as you mention. What I would suggest is to train an acoustic model using the graphemes (letters) of the language. That can be done by altering the pronunciation in the lexicon so that it is just the letters expanded. E.g.

word -> w o r d

bubblegum -> b u b b l e g u m

...

so forth.

You can set all the individual letters as the non-silence phones in the dictionary. Other files in the dict would, I think, remain just the same.

You can think about the G-fst just as a heavily subword tokenized language model, down to each individual character. Some research has been done on this for Kaldi, most notably from Peter Smits in Aalto University. You would need to alter the L-fst to handle subword tokenized units that can be done with the code here -> https://github.com/aalto-speech/subword-kaldi. You would then have to alter the text corpus, say your corpus is the words "I am a dog" you would add a boundary marker after every letter if there is no word boundary after it "I a+ m a d+ o+ g". This just one of four possible marking styles, the other are listed in paper that is in the Git repository. To train the LM you can use KenLM or any other Kaldi supported tool. Just not that you would need a longer N-gram context the normal, perhaps at least 6 gram.

If you successfully compile a decoding graph with this setup it will output characters along with the boundary marker and you can alter the "wer_output_filter" file found in the local dir. This is a list of sed commands that should be done on the hypothesis. It's called upon by score.sh.

In about two weeks I will be posting my thesis code in a user-friendly script in this repository -> https://github.com/cadia-lvl/samromur-asr. It already has most of these steps implement for subword ASR modelling but requires some cleanup.