Decoding audio files without having the text available

467 views
Skip to first unread message

Iraklis Gougousis

unread,
Sep 9, 2016, 4:59:43 AM9/9/16
to kaldi-help
I can't seem to find anything about this.

Let's assume that you have a large dataset and you have successfully trained and tested it. All the necessary files (language models, acoustic models, graphs, mfccs etc) are stored somewhere that you have access to. How can you use them to decode new audio files (.wav for example) that you do not have the transcription available? In the various run.sh scripts that I've seen , when using steps/decode.sh, it expects various directories as arguments, including one that has text, dict, utt2spk and various others, which can't be easily generated when you want to decode new audio data. How can I use trained models to decode new audio files? Thanks in advance.

Danijel Korzinek

unread,
Sep 9, 2016, 7:36:03 AM9/9/16
to kaldi-help
The text in the decode scripts is used only for the last portion of the script, ie the scoring. If you do not have a transcription, you can simply skip the scoring and everything should be okay. If you''re really lazy, you can provide an empty text file and just ignore the scoring to get your result.

What models are you usng (also which decoding script do you use with them)? You don't really have to use the scripts for anything. For many of the model, the decoding of a single WAV file comes down to running a single program, but it will really depend on the type of model you are using...

Danijel

Iraklis Gougousis

unread,
Sep 9, 2016, 8:03:59 AM9/9/16
to kaldi-help

I have trained the models on Librispeech and out of curiosity I wanted to use them to decode a single wav file, just to see what happens and to better familiarize myself with the process of decoding. So you recommend to take a look at the decode.sh and run it without scoring basically?

Danijel Korzinek

unread,
Sep 10, 2016, 5:22:00 AM9/10/16
to kaldi-help
The models in exp/mono to exp/tri2b utilize the "decode.sh" script. The models in "exp/tri3b" use "decode_fmllr.sh". The fMLLR decoding is a bit involved (requires several steps). The tri2b models are much easier to use - you generally need a single program (after you have the features extracted).

In fact, for tri2b, you can use the "online-wav-gmm-decode-faster" (from src/onlinebin) to decode directly from the WAV.

There are also methods to decode the SAT models in online mode (the online2bin binaries), but you will have to look around for some scripts on how to prepare the proper models for that.

Finally, you can simply reconstruct the "decode.sh" and similar files by hand to see how they work. There are a lot of "if" statements in these scripts since they are designed to work with many different configurations, but if you use a specific model, it's usually a couple of steps that need to be performed for a succesful decode: a couple a programs have to be run, one-by-one. Looking at the decode logs for librispeech will also tell you exactly which program were run and in what order.

Danijel
Message has been deleted

Iraklis Gougousis

unread,
Sep 11, 2016, 6:33:12 AM9/11/16
to kaldi-help
Nice, thank you very much :)
Reply all
Reply to author
Forward
0 new messages