alphabet recognition

Zhe LI

unread,

May 27, 2021, 11:14:22 AM5/27/21

to kaldi-help

hello,

I have a model trained with large vocabulary in french. Now i need the model transcribe only the alphabets and the numbers. But with L.fst, it match the phones to words when we speak two letters quickly.

How can i do this? I wonder if i can work at the level of the L.fst graph. For example, i keep the lattice with a length 1 if the lattice is not in my list of numbers.

Thanks in advance for your reply,

Best Regards,

Zhe LI

unread,

May 27, 2021, 11:20:00 AM5/27/21

to kaldi-help

can i reduce the L.fst graph in order that the model doesn't match phones to words ?

Ho Yin Chan

unread,

May 28, 2021, 12:38:48 AM5/28/21

to kaldi-help

Your lexicon for testing just need the alphabet and numbers.

Zhe LI

unread,

May 28, 2021, 3:37:06 AM5/28/21

to kaldi-help

Thanks for your reply. i can create a new HCLG.fst with a lexicon containing only alphatbet and numbers, right ?

There will be a problem with the match between pdf-id and these new symbols ? I didn't find the information about this.

I trained my model with HMM-DNN.

Zhe

lali...@gmail.com

unread,

May 28, 2021, 3:47:08 AM5/28/21

to kaldi-help

Yes, right you can.

This is possible that training and test lexicon diffren for example you have 20k uniq wrd in training lexicon and test lexicon contains 200k words. notice that phones set must the same in both lexicons. phones symbols must the same.

Zhe LI

unread,

May 28, 2021, 4:09:57 AM5/28/21

to kaldi-help

Thanks a lot for your reply. I'am still new to kaldi :)

so i need copy the phones symbols for the alphabet and numbers in another new file phones.txt, then i give this new phone.txt containing the same phones symbols as argument in the steps/prepare_lang.sh for example, right?

Best regards,

lali...@gmail.com

unread,

May 28, 2021, 10:00:20 AM5/28/21

to kaldi-help

a simple way is to copy your current dict_dir (usually in data/local/) in a new dict_dir and replace your custom lexicon in new dict_dir

and other steps to create graph and decoding. look at https://kaldi-asr.org/doc/data_prep.html#data_prep_lang_creating

1- create a new dict_dir with the new lexicon and the same phones.

2- create lang_dir using utils/prepare_lang.sh

3- create lang_dir_test by utils/format_lm.sh

4- create graph_dir by utils/mkgraph.sh

your new alphabet lexicon must contain alphabets and their pronunciations with phone symbols.

e.g.

A e y

B b i:

C s i:

for good results, you must train your language model based alphabet corpus

Reply all

Reply to author

Forward