Decoding .wav files

filmingi...@gmail.com

unread,

Dec 11, 2015, 11:37:03 PM12/11/15

to kaldi-help

Hello everyone,

In my quest to get kaldi working for our needs, I've posted a few times and you've been generous enough to help. This week I received a much more powerful server (quad socket, quad core with 64 GB RAM) and got it online with a Tesla S1070 external quad GPU unit, and was able to complete the main run.sh tedlium script in just over a day, which is quite acceptable.

So now I'm trying to achieve proof of concept. All I want to do now is perform ASR on a single .wav file using the models I've trained with the tedlium recipe, ideally one of the tri3 models. What's the simplest way to run a decoder against a .wav file? I've looked and looked for command-line examples, but keep running into missing config file examples or examples using directories that don't exist. Is there a place I can find a command-line example that will work with what was trained? I'm not looking to do anything fancy, just achieve proof of concept right now. The logs show very acceptable accuracy so I'm anxious to try it out.

This is what's available to me in exp:

ali_train_pdf.counts       mono0a                train_sorted.scp tri3_ali
cmvn-g.stats                mono0a_ali          tree                tri3_denlats
cv.scp                          nnet                  tri1             tri3_mmi_b0.1
cv.scp_non_local          splice5.proto        tri1_ali            tr_splice5.nnet
feat_type               train.scp              tri2
final.mdl            train.scp.10k          tri2_ali
log                              train.scp_non_local tri3

I realize there are numerous options, but in this instance, I just want to do some basic tests using the most accurate models trained by the main, basic tedlium recipe. As time goes on, I'll set up a GRID so I have a cluster for training, and do more advanced stuff with neural nets, but baby steps first.

Thanks!

RB

Daniel Povey

unread,

Dec 11, 2015, 11:44:48 PM12/11/15

to kaldi-help

At some point we will add documentation on this as we keep getting asked this question.

To decode one .wav file you'd generally have to go through a similar sequence of programs to what you did on the training data to prepare the features prior to decoding, e.g. compute-mfcc-feats, compute-cmvn-stats. Rather than figuring out the exact command lines, which is hard without understanding quite a lot about Kaldi, it may be easier to prepare the data in a directory as described in http://kaldi-asr.org/doc/data_prep.html, and run scripts like steps/compute_mfcc_feats.sh and steps/compute_cmvn_stats.sh, as done in the run.sh.

If you can't figure it out from this (maybe with a small clarification or two from me), there are various people who do speech consulting and who may be willing to take on a very small job; they have been mentioned in a previous thread (e.g. in the US, try Nagendra Goel/GoVivace or JeffAdams/Cobalt). Kaldi was never intended for use by anyone other than speech recognition professionals, and it's growing big enough that I can't provide free consulting to all and sundry.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

filmingi...@gmail.com

unread,

Dec 13, 2015, 6:00:46 PM12/13/15

to kaldi-help, dpo...@gmail.com

Hi Dan,

Some more documentation would be good. There are apparently lots of people experimenting with kaldi, but what I've noticed is a lack of high-level documentation just describing how kaldi works from a basic level, without getting way down into the weeds - which files and directories are for what, which scripts should be used for training different types or simple decoding, etc. So many of the scripts are similar in name and the comments just don't differentiate enough what is used for what. And for many of us, we don't need to delve into the nitty-gritty of everything, we just want to run some basic experiments. I've had a long IT career designing and buildings complex networks and figuring out complex systems on my own, and I'll eventually figure all this out, but little pointers here and there on how to do basic things without digging too deep really help. I know kaldi was designed for speech recognition professionals, that seems to be mentioned with every post on the forums, but there are enough of us who aren't speech recognition professionals experimenting with it that I think some high-level documentation would be worth it, and keep a lot of us from asking so many questions. Once I learn this stuff, I'll gladly contribute back with documentation help. I had already read that link you provided, but I'm still trying to figure it out. I've probably read almost all of the documentation and many forum posts by this point as well.

One thing that would be perfect is something I found in an Amazon VM with pre-built models that's all ready to go (and unfortunately extremely limited in accuracy). It's called "Offline transcription system for Estonian using Kaldi" (though with English models, too; I'm sure you're familiar with it) which has *exactly* what many of us are looking for - speech2text.sh - which I'm trying to modify to use the models I've trained, though I still don't know yet what all of the files do so it's not working yet but that is exactly what I'm looking for.

And I have spoken with many consultants, and the ones willing to provide consulting (and not just sell access to their own API) want to sell a complete, turnkey package, which down the road I may go with but we can't get more funding until I achieve proof of concept, and for that right now I just want to put my trained models to use decoding some .wav files. None seem willing to just provide a couple hours of hand-holding to get what I have working - we'd gladly pay for two or three hours of consulting for that, which should be more than enough.

I know kaldi is a complex product in a complex field, but it seems like there should be some simpler ways of using it for some basic testing, unless I'm somehow over-complicating things.

Thanks for your time,

RB

Daniel Povey

unread,

Dec 13, 2015, 6:39:11 PM12/13/15

to filmingi...@gmail.com, kaldi-help

Hm.

We could probably add a page with documentation for people who don't know about speech recognition, but I don't put a high priority on it right now, as I am working on speed and accuracy improvements that are likely to have much more impact (and would also invalidate some of the documentation).

Also, without the kind of deep understanding that can only come from years of experience (or access to such a person), people like you are bound to run into problems that can't be fixed without help from someone more experienced. It becomes unsustainable for me because there become too many questions.

There are plenty of students or other people out there who would be quite able to show you what to do and give you the help you need; they just don't have experience in setting up consulting arrangements; and in addition the bulk of them are either overseas or they don't have the type of visa that would allow them to officially work in the US. You would have to help them navigate these issues somehow. I can give you some names if you want, or vet people who contact you.

Dan

hhiy...@eqratech.com

unread,

Jun 22, 2016, 4:57:06 PM6/22/16

to kaldi-help

Hi
Did you solve your problem ?
If not I might help you

Reply all

Reply to author

Forward