Using $KALDI_ROOT/tools/sctk/bin/hubscr.pl for new language other than arabic | english | german | mandarin | spanish

225 views
Skip to first unread message

manjunath ke

unread,
Sep 1, 2016, 2:57:00 AM9/1/16
to kaldi-help

Hi Dan/kaldiers

I am trying to build system for Indian Language Bengali... I m modifying the TIMIT settings in egs/timit/s5 for Bengali language.

Everything is OK but one thing I found is that in local/score.sh, there is call to $KALDI_ROOT/tools/sctk/bin/hubscr.pl.

$KALDI_ROOT/tools/sctk/bin/hubscr.pl script expects  -l as [ arabic | english | german | mandarin | spanish ]. How can I reuse this script for Bengali langugae.

Is there is any other way for scoring than this?

Thanks

Daniel Povey

unread,
Sep 1, 2016, 2:58:58 AM9/1/16
to kaldi-help
TIMIT is the worst possible starting point you could have picked.
Almost any of the other setups would be a better place to start from.
> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

manjunath ke

unread,
Sep 1, 2016, 3:04:47 AM9/1/16
to kaldi-help, dpo...@gmail.com
Hi Dan

Thanks for the reply. 

Could you please suggest me some workaround for this? Is it not at all possible to decode with this setup??

Will it be fine, if I pass English as argument (for -l) and try to decode for Bengali language. What would be the consequences?

If not TIMIT, what would you suggest me to start with RM or WSJ.
Thanks

Danijel Korzinek

unread,
Sep 1, 2016, 5:16:45 AM9/1/16
to kaldi-help, dpo...@gmail.com
Instead of TIMIT try starting from something like Voxforge. Tedlium is also a good, free choice. Finally, you'll get the best results on Librispeech (for free), but that could take some time and resources as it's pretty huge.

manjunath ke

unread,
Sep 1, 2016, 6:00:38 AM9/1/16
to kaldi-help, dpo...@gmail.com
Hi Danijel Korzinek,  

Thanks for the reply. 

I have speech corpora having only phone-level transcription. I dont have word-level or utterance-level transcription. My database of Bengali, a Indian Language, contains "file.wav and file.phn", similar to the TIMIT db, while i dont have file.wrd or file.text files as in TIMIT db.

I dont have word level pronunciation dictionary and LMs. I am just interested in phoneme recognition and not in the sentence level recognition. I could see that tedlium uses word level pronunciation dictionary.

Could you please let me know the set-ups,  supporting mono/tri phone training (similar to TIMIT). And don't require LM, word level pronunciation dictionary.

Thanks in advance.

Danijel Korzinek

unread,
Sep 1, 2016, 6:30:40 AM9/1/16
to kaldi-help, dpo...@gmail.com
Kaldi is an advanced LVCSR system based on WFSTs. While you could do what you want in Kaldi, it's really not worth the effort IMO. Why not simply use something scikit-learn or keras? 

If you really want to use Kaldi, for some reason, then you should treat phonemes as words. Make a lexicon that maps each phoneme to itself and make a grammar/LM that simply copies the phonemes as they are. If you feel like it, you can eve train a bigram LM on the phoneme sequences using SRILM or something.

BTW, this is the same procedure you would use with and LVCSR system (HTK, Julius, Sphinx), but phoneme recognition is a useless topic, apart from research.
Reply all
Reply to author
Forward
0 new messages