Kaldi vs DeepSpeech

3,369 views
Skip to first unread message

orum farhang

unread,
Aug 25, 2019, 12:01:24 PM8/25/19
to kaldi-help
Hi All,

Is there any paper/experiment to compare the accuracy of Baidu's deepspeech(Mozilla implementation) and Kaldi? Also would you have any advice to use which of them in which situation and what is pros/cons of these tools?

Thanks,

JW van Leussen

unread,
Aug 25, 2019, 12:27:15 PM8/25/19
to kaldi-help
I don't have a paper to back it up, but we made this comparison at my former employer. Training on a dataset of about 50 hours of telephone speech, which was part Switchboard and partly collected in-house, we got an error rate of about 80% with DeepSpeech and about 58% with Kaldi. While neither is , it showed that Kaldi could learn more from this small amount of data. Subsequent experiments with Kaldi showed spectacular improvements when we increased the size of our training data set.

One pro of DeepSpeech is that it's "end-to-end" and so you don't need to worry about a language model, pronunciation dictionary etc. However for English these are not so hard to come by and you can just adapt an existing recipe in Kaldi (we used Switchboard). A con of Kaldi is that it's a little harder to set up and takes some getting used to. But all in all I would recognize investing time in learning to use Kaldi, it will pay dividends in ASR performance.


Op zondag 25 augustus 2019 18:01:24 UTC+2 schreef orum farhang:

orum farhang

unread,
Aug 25, 2019, 12:34:10 PM8/25/19
to kaldi...@googlegroups.com
Many Thanks Van,

I also found this github page which provides WER comparision between different tools/papers: 




--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/2054f853-81d4-4382-8b5b-25a555fcd0f4%40googlegroups.com.

Itai Peer

unread,
Sep 4, 2019, 3:19:12 AM9/4/19
to kaldi-help
Thanks Orum   ,  did not see this page so far , quite helpfull

I wonder , do the main companies , like google, apple and facebook  , publish there ASR   results on these corpora ? 
I don't mean the open-source  , like wav-to-letter or deep speech  , but the actual engine that is used when google is transcribing  , siri or Alexa ?

these engines should be more focused and optimised to narrated text i guess ,   but still it might be nice to see how they handles in SWB or ASPIRE 

P.S.  didn't see any ASPIRE results there in WER_WE_ARE   , as it is one of the more challenging corpora 

בתאריך יום ראשון, 25 באוגוסט 2019 בשעה 19:34:10 UTC+3, מאת orum farhang:
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Daniel Povey

unread,
Sep 4, 2019, 5:52:18 AM9/4/19
to kaldi-help
I think the aspire data may not have been fully publicly released, plus it was only a test set, not a training set.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/556bd84d-98ed-4e9e-ab67-79bff09e0cf9%40googlegroups.com.

Itai Peer

unread,
Sep 4, 2019, 8:25:42 AM9/4/19
to kaldi...@googlegroups.com
I guess  ,  you right  , but for the big engines , lack of training data should not matter 

You received this message because you are subscribed to a topic in the Google Groups "kaldi-help" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/kaldi-help/ShHsWU0pwZE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuyQuSparTKW9o%3DefuBzxSfBhc3weB0e_gR512K1HTsbhjQ%40mail.gmail.com.


--
Itai Peer

Shujian Liu

unread,
Sep 5, 2019, 12:57:35 AM9/5/19
to kaldi-help
Hi orum, there is paper compared results on Tedlium 3 dataset with Kaldi and Deep Speech 2: https://arxiv.org/abs/1805.04699

Itai Peer

unread,
Sep 5, 2019, 4:18:30 AM9/5/19
to kaldi-help
thanks Liu  , this are very interesting results.  

Kaldi "beats" Deep-speech2 ,   but what I find interesting is that with double training data size , Kaldi accuracy almost does not improve at all 

I wonder why it is so ?   is this ASR not scale-able (with respect to training-data size)  ? 
did Kaldi reach the upper-accuracy boundary for this corpora ? 
It might be , since TEDLIUM has a good audio quality ,  mostly normal and unified speech rate and volume , and a single  speaker in each recording

 maybe the ""chain" model is optimized for medium size training data size (200 hours) so when increasing  the corpora it is outsize its optimal zone. 

It could be also that they did not tune well ? 




capture(1).png



בתאריך יום חמישי, 5 בספטמבר 2019 בשעה 07:57:35 UTC+3, מאת Shujian Liu:

Daniel Povey

unread,
Sep 6, 2019, 7:28:23 PM9/6/19
to kaldi-help
Probably either something about the new data, or because the model size is not that large.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages