How could I integrate real-time recording stream with Kaldi online decoding ?

emily...@gmail.com

unread,

Aug 22, 2015, 11:54:26 PM8/22/15

to kaldi-help

First many thanks to Kaldi developers.

It has been a great experience working with offline decoding. Now I am heading to online decoding features. I have fed 5-hour clean audio files from Librispeech corpora into online decoderonline2-wav-gmm-latgen-faster. The WER result is slightly higher than same input passed through offline SAT model but acceptable. I am wondering if i'd like to connect real-time recording with Kaldi online decoder, how to feed the stream into decoder? For example, could I assume that I can use sox package to do the recording and trimming, and Kaldi would pop out each single transcription within a reasonable delay, sourcing from the folder which contains all the required features extracted from trimmed files, i.e. MFCC, wav.scp. Thus in my mind these two processes should be executed in parallel 1) recording + extracting MFCC, 2) decoding. But I guess it is not as simple as you send couple of processes in background and use wait command until they finish then move forward to next step. Prior to working with kaldi, I did't have any experience in command-line based programming. I hope my enquiry makes sense.

Cheers

Daniel Povey

unread,

Aug 22, 2015, 11:59:02 PM8/22/15

to kaldi-help

It has been a great experience working with offline decoding. Now I am heading to online decoding features. I have fed 5-hour clean audio files from Librispeech corpora into online decoder online2-wav-gmm-latgen-faster. The WER result is slightly higher than same input passed through offline SAT model but acceptable.

I am wondering if i'd like to connect real-time recording with Kaldi online decoder, how to feed the stream into decoder? For example, could I assume that I can use sox package to do the recording and trimming, and Kaldi would pop out each single transcription within a reasonable delay, sourcing from the folder which contains all the required features extracted from trimmed files, i.e. MFCC, wav.scp. Thus in my mind these two processes should be executed in parallel 1) recording + extracting MFCC, 2) decoding. But I guess it is not as simple as you send couple of processes in background and use wait command until they finish then move forward to next step. Prior to working with kaldi, I did't have any experience in command-line based programming. I hope my enquiry makes sense.

I'm not aware that sox can be used for that type of thing.

You'd probably need to use OS-specific capabilities for audio capture. If you can do audio capture from C++ then you can directly do the decoding in the same thread, by modifying the code in online2-wav-gmm-latgen-faster to operate from features captured in real time instead of ones read in from disk

Dan

Cheers

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

emily...@gmail.com

unread,

Aug 25, 2015, 10:58:15 PM8/25/15

to kaldi-help, dpo...@gmail.com

Thanks for your reply Dan.

I got a error once running steps/online/prepare_online_decoding.sh:

steps/online/prepare_online_decoding.sh: Accumulating statistics for basis-fMLLR computation
run.pl: 24 / 24 failed, log is in /users/5/a1616275/kaldi-trunk/egs/adl/exp/feed1/2015-08-26_12-13-54/log/basis_acc.*.log

But I do have pre-trained SAT model. In log file which contains error like below. Hopefully you can give me clue. Cheers

LOG (apply-cmvn:main():apply-cmvn.cc:146) Applied cepstral mean normalization to 3 utterances, errors on 0
LOG (transform-feats:main():transform-feats.cc:158) Overall average [pseudo-]logdet is -94.419 over 194 frames.
LOG (transform-feats:main():transform-feats.cc:161) Applied transform to 3 utterances; 0 had errors.
transform-feats --utt2spk=ark:/users/5/a1616275/kaldi-trunk/egs/adl/data/feed1/2015-08-26_12-13-54/split24/1/utt2spk ark:/users/5/a1616275/kaldi-trunk/egs/adl/exp/tri3b/trans.1 ark:- ark:-
WARNING (transform-feats:main():transform-feats.cc:87) No fMLLR transform available for utterance 2015-08-26_12-13-54.flac, producing no output for this utterance
WARNING (transform-feats:main():transform-feats.cc:87) No fMLLR transform available for utterance one_1.flac, producing no output for this utterance
WARNING (transform-feats:main():transform-feats.cc:87) No fMLLR transform available for utterance one_10.flac, producing no output for this utterance
LOG (transform-feats:main():transform-feats.cc:161) Applied transform to 0 utterances; 3 had errors.
LOG (gmm-post-to-gpost:main():gmm-post-to-gpost.cc:124) Done 0 files, 0 with no posteriors, 0 with other errors.
LOG (gmm-post-to-gpost:main():gmm-post-to-gpost.cc:128) Overall avg like per frame (Gaussian only) = -nan over 0 frames.
LOG (gmm-post-to-gpost:main():gmm-post-to-gpost.cc:131) Done converting post to gpost
WARNING (gmm-post-to-gpost:Close():kaldi-io.cc:446) Pipe apply-cmvn --utt2spk=ark:/users/5/a1616275/kaldi-trunk/egs/adl/data/feed1/2015-08-26_12-13-54/split24/1/utt2spk scp:/users/5/a1616275/kaldi-trunk/egs/adl/data/feed1/2015-08-26_12-13-54/split24/1/cmvn.scp scp:/users/5/a1616275/kaldi-trunk/egs/adl/data/feed1/2015-08-26_12-13-54/split24/1/feats.scp ark:- | splice-feats ark:- ark:- | transform-feats /users/5/a1616275/kaldi-trunk/egs/adl/exp/feed1/2015-08-26_12-13-54/final.mat ark:- ark:- | transform-feats --utt2spk=ark:/users/5/a1616275/kaldi-trunk/egs/adl/data/feed1/2015-08-26_12-13-54/split24/1/utt2spk ark:/users/5/a1616275/kaldi-trunk/egs/adl/exp/tri3b/trans.1 ark:- ark:- | had nonzero return status 256
WARNING (gmm-basis-fmllr-accs-gpost:main():gmm-basis-fmllr-accs-gpost.cc:141) Did not find posts for utterance 2015-08-26_12-13-54.flac
WARNING (gmm-basis-fmllr-accs-gpost:main():gmm-basis-fmllr-accs-gpost.cc:141) Did not find posts for utterance one_1.flac
WARNING (gmm-basis-fmllr-accs-gpost:main():gmm-basis-fmllr-accs-gpost.cc:141) Did not find posts for utterance one_10.flac

Daniel Povey

unread,

Aug 26, 2015, 8:38:13 PM8/26/15

to emily...@gmail.com, kaldi-help

You probably gave it the alignment directory for the wrong/mismatched data.

dan

maitua...@gmail.com

unread,

Mar 26, 2016, 9:10:03 PM3/26/16

to kaldi-help

hello,i have the same question as emily.I wonder if there are any library in c++ that can support recording and saving as ./wav file to decode online.?
can anyone tell me?????
i need your help.Thank you!!!

Vào 10:54:26 UTC+7 Chủ Nhật, ngày 23 tháng 8 năm 2015, emily...@gmail.com đã viết:

Daniel Povey

unread,

Mar 26, 2016, 9:17:17 PM3/26/16

to kaldi-help

Not really, it's very platform dependent. I think PortAudio tries to do it, but since installation is tricky and it's not the best solution for all platforms, we decided to leave Kaldi as a speech recognition library and not continue to support audio-capture code.

Dan

--

Ruoho Ruotsi

unread,

Mar 29, 2016, 4:52:24 PM3/29/16

to kaldi-help, dpo...@gmail.com

Look into using the Kaldi gstreamer plugin:

to read:

http://kaldi.sourceforge.net/online_programs.html

https://github.com/alumae/kaldi-gstreamer-server

https://github.com/jcsilva/docker-kaldi-gstreamer-server

cheers,

Message has been deleted

Daniel Povey

unread,

Mar 30, 2016, 3:08:10 PM3/30/16

to Zhang Tan, kaldi-help

There is no tutorial as such, but there is a page with some documentation

http://kaldi-asr.org/doc/online_decoding.html

Actually I'm not sure at this point if the lack of good "for dummies" documentation in Kaldi is a feature or a bug. The danger with making it too easy for people who are not primarily speech recognition researchers to start using Kaldi, is that they will sink effort into it and then get stuck and ask a lot of questions and take up our time. Kaldi was always designed for people who aim to spend years doing speech recognition, not for casual users. Maybe at some point we can "can" it in a way that's easily usable by others, but right now that's not the case.

Dan

On Wed, Mar 30, 2016 at 3:01 PM, Zhang Tan <tandy...@gmail.com> wrote:

Is there a tutorial for real-time recognition? Thanks.

Zhang Tan

unread,

Mar 30, 2016, 3:16:58 PM3/30/16

to kaldi-help, tandy...@gmail.com, dpo...@gmail.com

Thanks. I am a beginner. There are few clear documents "for dummies" indeed...

Zhang Tan

unread,

Mar 30, 2016, 3:22:04 PM3/30/16

to kaldi-help, tandy...@gmail.com, dpo...@gmail.com

I am trying to get the speed for online recognition. I am not sure how long it will take to know that..
Is it faster than Sphinx?

Daniel Povey

unread,

Mar 30, 2016, 3:33:19 PM3/30/16

to Zhang Tan, kaldi-help

I'm not sure, but it's probably not faster than Sphinx. Sphinx was optimized for speed; kaldi is probably twice as accurate in terms of WER (or even more), but may not be as fast as Sphinx because we normally use different models, like neural nets, that are slower to evaluate, or at least bigger models. The online-nnet2 can be tuned to take around real-time if you have a good (fast) machine with one thread. The 'chain' models can be even faster (maybe twice faster than real time on a fast machine with one thread, if you tune the beam), but the online decoding is not ready yet (at least, not in the official Kaldi repository).

You could also look at the online GMM-based decoding (there is something in online2/, and there are scripts somewhere, look at egs/rm/s5/local/online/run_gmm.sh). That can be faster, depending on the beams, model sizes, etc.

Dan

Zhang Tan

unread,

Mar 30, 2016, 4:04:52 PM3/30/16

to kaldi-help, tandy...@gmail.com, dpo...@gmail.com

Hi Dan, thank you very much for your clarification.

Reply all

Reply to author

Forward