Kaldi trained model for 8kHz telephonic voice recognition

2,410 views
Skip to first unread message

nishan....@gmail.com

unread,
Sep 28, 2015, 3:33:15 PM9/28/15
to kaldi-help
Hello,

I'm trying to use Kaldi to build a voice recognition application, where I would like to recognize short utterances coming through a telephone line (about 5-10 seconds duration). So my voice is sampled at 8 kHz with 16-bits per sample.

I successfully compiled Kaldi and ran the samples in [1] and [2], I was able to reproduce the results documented in both of these documents. Specially, the transcription output I got for Bill Gate's TED talk in [2] was pretty impressive!

However, both these models performed very poorly on my telephonic voice data. Are there any good trained Kaldi models for telephonic voice recognition? Perhaps, trained on the Switchboard dataset? Unfortunately, I don't have access to a large amount of good quality transcribed data to train a model from scratch.



Any help would be highly appreciated.

Thanks very much!

Kind Regards,
Nishan

Daniel Povey

unread,
Sep 28, 2015, 3:55:32 PM9/28/15
to kaldi-help
The Fisher-English models used in http://kaldi-asr.org/doc/online_decoding.html should perform OK for telephony speech if it's US-accented and otherwise similar to the Fisher data.  Fisher is very similar to Switchboard.

I'm adding a bit at the end of http://kaldi-asr.org/doc/online_decoding.html that explains how to downweight the silence in the iVector estimation-- this sometimes helps in domain-mismatched data (we found it very helpful in the ASpIRE challenge).

Here is what I am adding (while it compiles):
+Note that for mismatched data, sometimes the iVector estimation can get confused and lead to bad results.
+Something that we have found useful is to weight down the silence in the iVector estimation.
+To do this you can set e.g. <code>--ivector-silence-weighting.silence-weight=0.001</code>; you need to set the silence
+phones as appropriate, e.g. <code>--ivector-silence-weighting.silence-phones=1:2:3:4</code>
+(this should be a list of silence or noise phones in your phones.txt; you can experiment with
+which ones to include).

Dan


--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nishan Wickrama

unread,
Sep 29, 2015, 12:53:33 AM9/29/15
to kaldi...@googlegroups.com
Thanks very much for the generous help Dan. That is incredibly useful to me!

Kind Regards,
Nishan

--
You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/Y7BRZjEIkcI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.

Nishan Wickrama

unread,
Oct 3, 2015, 4:58:49 PM10/3/15
to kaldi-help
Hello again,

I tried using the Fisher-English models as per the instructions in http://kaldi-asr.org/doc/online_decoding.html. However, the accuracy on my own telephone voice data remains low. I had the same experience with Sphinx earlier: I could get only 10% accuracy with Sphinx's default telephone voice model and the language model. But I could improve the accuracy up to 55% easily using the following tricks:

1. MAP adaptation for the acoustic model (I have about 4000 of transcribed wave files. Each wave file is about 5 - 10s duration. They are not very good quality transcriptions though.)
2. Built a new ARPA language model using my transcriptions only. My vocabulary is very limited - about 1000 words, so the language model helps a lot (about 20% contribution to the accuracy improvement).

I strongly believe that Kaldi should be able to easily outperform the accuracy I got with the Sphinx. It is just that I haven't figured out how to use it properly yet.

Could you point me to a documentation that describes how I can do acoustic model adaptation + language modelling with Kaldi please?

Any help would be highly appreciated.

Kind Regards,
Nishan

Daniel Povey

unread,
Oct 3, 2015, 5:13:07 PM10/3/15
to kaldi-help, Guoguo Chen
MAP adaptation doesn't apply to neural nets; however, you can do it
with regular acoustic models, see
steps/train_map.sh.
Many of the example scripts build language models, but we don't have
documentation for your specific use case, and we don't have the
bandwidth to support people like you as much as you need (Kaldi's
target audience is speech professionals more than downstream users, so
it's a bit different from Sphinx in that regard).

Guoguo, if you have some free time at some point it would be nice if
you could add to the online-decoding documentation, a section on how
you would train your own language model using SRILM, build the graph,
and decode with it, since a lot of people seem to have this same
question.

Dan

Guoguo Chen

unread,
Oct 3, 2015, 5:47:34 PM10/3/15
to Daniel Povey, kaldi-help
Sure will do.

Guoguo
--

Nishan Wickrama

unread,
Oct 4, 2015, 3:49:22 AM10/4/15
to kaldi-help, Daniel Povey
Thanks Dan and Guoguo.

I will try to figure out how to build a language model with SRILM and use it with the provided Fisher-English online models. Guoguo, if you get time to write a documentation on this please let me know.

Kind Regards,
Nishan


Daniel Povey

unread,
Oct 4, 2015, 4:07:14 PM10/4/15
to Nishan Wickrama, kaldi-help
You can follow the pull request at
https://github.com/kaldi-asr/kaldi/pull/203/files#diff-0
where we are working on this.
Dan


On Sun, Oct 4, 2015 at 3:49 AM, Nishan Wickrama

Nishan Wickrama

unread,
Oct 5, 2015, 4:34:25 PM10/5/15
to Daniel Povey, kaldi-help
Thanks very much Dan. That's very helpful!

Nishan Wickrama

unread,
Oct 8, 2015, 12:33:41 PM10/8/15
to kaldi-help, Daniel Povey, chengu...@gmail.com
Hello,

Thanks for adding additional documentation to the git repo.

I tried the "Example for using your own language model with existing online-nnet2 models" section of the new documentation, which provides instructions to build a new language model with the same vocabulary. However, during the conversion from ARPA to WFST, I got the following error:

Could you give some hints on how to fix this error please? My train.txt file contains only sentences, one per line, without punctuation or any other special characters like <s> and </s>.

--> generating a 28 word sequence
--> resulting phone sequence from L.fst corresponds to the word sequence
--> L.fst is OK
--> generating a 19 word sequence
--> resulting phone sequence from L_disambig.fst corresponds to the word sequence
--> L_disambig.fst is OK

Checking data/lang_own/oov.{txt, int} ...
--> 1 entry/entries in data/lang_own/oov.txt
--> data/lang_own/oov.int corresponds to data/lang_own/oov.txt
--> data/lang_own/oov.{txt, int} are OK

--> data/lang_own/L.fst is olabel sorted
--> data/lang_own/L_disambig.fst is olabel sorted
ERROR: FstHeader::Read: Bad FST header: data/lang_own/G.fst
--> ERROR: data/lang_own/G.fst is not ilabel sorted
awk: cmd. line:1: BEGIN{while((getline<disambig)>0) is_disambig[]=1; is_disambig[0] = 1; while((getline<words)>0){ if($1=="<s>"||$1=="</s>") is_forbidden[$2]=1;}} {if(NF<3 || is_disambig[$3]) print; else if(is_forbidden[$3] || is_forbidden[$4]) { print "Error: line " $0 " in G.fst contains forbidden symbol <s> or </s>" | "cat 1>&2"; exit(1); }}
awk: cmd. line:1:                                               ^ syntax error
awk: cmd. line:1: error: invalid subscript expression
ERROR: FstHeader::Read: Bad FST header: data/lang_own/G.fst
--> ERROR: failure running command to check for disambig-sym loops [possibly G.fst contained the forbidden symbols <s> or </s>, or possibly some other error..  Output was: 
fst type                                          vector
arc type                                          standard
input symbol table                                none
output symbol table                               none
# of states                                       0
# of arcs                                         0
initial state                                     -1
# of final states                                 0
# of input/output epsilons                        0
# of input epsilons                               0
# of output epsilons                              0
# of accessible states                            0
# of coaccessible states                          0
# of connected states                             0
# of connected components                         0
# of strongly conn components                     0
input matcher                                     y
output matcher                                    y
input lookahead                                   n
output lookahead                                  n
expanded                                          y
mutable                                           y
error                                             n
acceptor                                          y
input deterministic                               y
output deterministic                              y
input/output epsilons                             n
input epsilons                                    n
output epsilons                                   n
input label sorted                                y
output label sorted                               y
weighted                                          n
cyclic                                            n
cyclic at initial state                           n
top sorted                                        y
accessible                                        y
coaccessible                                      y
string                                            y
--> G.fst did not contain cycles with only disambig symbols or epsilon on the input, and did not contain
the forbidden symbols <s> or </s> (if present in vocab) on the input or output.
--> ERROR (see error messages above)

Kind Regards,
Nishan

Guoguo Chen

unread,
Oct 8, 2015, 1:24:21 PM10/8/15
to Nishan Wickrama, kaldi-help, Daniel Povey
It looks like your data/lang_own/G.fst is empty. Could you check your G.fst size and confirm this? What is the output from the conversion (the step you create data/lang_own/G.fst)? You only provided the log from the validation script (which is the last step).

Guoguo
--

Daniel Povey

unread,
Oct 8, 2015, 5:20:09 PM10/8/15
to Guoguo Chen, Nishan Wickrama, kaldi-help
It seems to have actually been an error in the awk script (where a $1
was disappearing because it needed to be escaped at the shell level).
I just fixed it. Maybe his awk compiler is a little stricter than
others.
Dan

Nishan Wickrama

unread,
Oct 11, 2015, 6:50:19 AM10/11/15
to Daniel Povey, Guoguo Chen, kaldi-help
Hello,

Sorry, as Guoguo guessed, there was an error in a previous step and G.fst was empty. I have fixed it now and successfully completed all the steps up to the "compile the decoding graph" step. In that step, the documentation says:

$model_dir is the model directory which contains the model "final.mdl" and the tree "tree".

Now, the trained model available under http://kaldi-asr.org/downloads/build/5/trunk/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/ does have "final.mdl" but not "tree". So I copied the "tree" file from http://kaldi-asr.org/downloads/build/2/sandbox/online/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu/tree. Is that a reasonable thing to do? Results were better than the what I got with the pre-trained language model.

Regards,
Nishan

Guoguo Chen

unread,
Oct 11, 2015, 10:51:59 AM10/11/15
to Nishan Wickrama, kaldi-help, Daniel Povey

Yes that is the same tree.

Reply all
Reply to author
Forward
0 new messages