Alternative "Ready to Use" ASR Models Other than Aspire?

Denis

unread,

Oct 24, 2018, 11:43:19 AM10/24/18

to kaldi-help

Hi Users (or probably Dan),

I'm a "medium level" user, where I've used Kaldi for data generation at word/phone level and have converted to sausages, etc, but don't intimately understand how to build my own models. Spent an hour going through past papers/kaldi models/ and searched past forum questions before posting this.

Is there anything as straightforward to use for ASR as Aspire (e.g. download and run)? And ideally something trained on a larger vocabulary (so lower WER).

On http://kaldi-asr.org/models.html , nothing else seems to be geared at decoding English audio data. If so can you point me to this documentation?

I define straightforward as, where I can run cmd.sh and path.sh and then run the below with the appropriate exp files. Or maybe change parameters slightly.

online2-wav-nnet3-latgen-faster \
        --online=false \
        --do-endpointing=false \
          --frame-subsampling-factor=3 \
          --config=exp/tdnn_7b_chain_online/conf/online.conf \
          --max-active=7000 \
          --beam=15.0 \
          --lattice-beam=6.0 \
          --acoustic-scale=1.0 \
          --word-symbol-table=exp/tdnn_7b_chain_online/graph_pp/words.txt \
          exp/tdnn_7b_chain_online/final.mdl \
          exp/tdnn_7b_chain_online/graph_pp/HCLG.fst \
          'ark:echo utterance-id1 utterance-id1|' \
          'scp:echo utterance-id1 {file_name_wav}|' \
          'ark,t:{file_name_lat}'

(Alternatively, would there be hyperparameters to tweak in the above that would lead to NOTABLY different transcription results, either better or worse)

Logical candidates seems like LibriSpeech and Ted-Lium, but:

The LibriSpeech model downloads as a .dms file and doesn't seem to be the same thing.
http://www.openslr.org/11/

Ted-Lium is missing the exp folder
https://github.com/kaldi-asr/kaldi/tree/master/egs/tedlium/s5

Thank you!

joseph.an...@gmail.com

unread,

Oct 24, 2018, 1:25:38 PM10/24/18

to kaldi-help

You should be able to use your own language models with Aspire acoustic models.

Daniel Povey

unread,

Oct 24, 2018, 2:06:33 PM10/24/18

to kaldi...@googlegroups.com

There is a tutorial somewhere on how to build your own LMs for the Aspire acoustic models.

https://chrisearch.wordpress.com/2017/03/11/speech-recognition-using-kaldi-extending-and-using-the-aspire-model/

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/31e64f93-d34f-4f25-be4d-60ee084c9fbd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Denis

unread,

Oct 24, 2018, 2:49:35 PM10/24/18

to kaldi-help

Err. Yes, that link is actually the reason I'm relatively comfortable using the Aspire Model. It's "straightforward" since I, as not Kaldi expert, can download the model from right here (already trained on Fischer):

http://kaldi-asr.org/models.html

Is there any other model out there (e.g. on LibriSpeech or Ted-lium) that would be a simple plug and chug rather than involved process?

Training your own language model is forewarned as "taking multiple days" and dependent on Sphinx, etc and I don't have a use-case for it.

Unlike for Aspire, which has a detailed READMEs and how to prepare the model, LibriSpeech only talks about size of data and Ted-Lium doesn't have one. I feel like this should be documented somewhere, but I don't know where to look. I assume I can't just take the Language Model from LibriSpeech and stick it into Aspire?

Daniel Povey

unread,

Oct 24, 2018, 7:47:57 PM10/24/18

to kaldi...@googlegroups.com

If there is no readme, the run.sh documents what you have to do.

Sorry, we aren't really geared towards downloadable models right now-- the idea is you train them yourself using the provided scripts.

Regarding slotting in language models, you'd have to figure out how to get the dictionary in the right form. All these tools assume you are basically a speech scientist and this is what you do.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/bd38368e-0c9c-41a8-9bb6-a7f439cd6fb2%40googlegroups.com.

Denis

unread,

Oct 25, 2018, 4:10:37 PM10/25/18

to kaldi-help

Thanks Dan; appreciate your responsiveness!

To contribute, I'll summarize this thread for posterity (so hopefully you get less questions in future):

Aspire model is the only one available in a ready-to-go fashion for Audio-Speech Recognition (ASR) English tasks. This is trained on the Fischer English model, which is recorded phone conversations. So decent for general conversational English, but lacking in technical terms.

http://kaldi-asr.org/models.html

There is a nice summary of how to use that here (although the README) is equally detailed.

https://chrisearch.wordpress.com/2017/03/11/speech-recognition-using-kaldi-extending-and-using-the-aspire-model/

Basically, download file, run a~4 scripts to prepare it, then run really long command to decode.

Rest can be trained but require downloading the data and running the scripts to train the model yourself. I personally haven't found pedantic READMEs about how to do this for LibriSpeech or Ted-Lium at a non-speech-scientist level.

If somebody wants to tack on with documentation if this changes in the future, feel free to update the link in this thread.

Nickolay Shmyrev

unread,

Oct 25, 2018, 4:44:33 PM10/25/18

to kaldi-help

There is also 16khz prebuilt English model (common voice + librispeech + voxforge + tedlium I think) available at

https://github.com/gooofy/zamia-speech#download

Reply all

Reply to author

Forward