Decode a whole file using online2-tcp-nnet3-decode-faster

Sergei Tushev

unread,

May 29, 2019, 2:17:03 PM5/29/19

to kaldi-help

Hello.

I have already trained nnet3 model, that succesfully works with nnet3-latgen-faster binaries with 18% WER.

So now I want to use online2-tcp-nnet3-decode-faster in my application. What options should I pass to decoder to get only final recognition result, not by pieces, like it work by default.

Something like (Pseudo Code):

TCP client:

s = socket()

wav=wave.open(wavefile)

s.send(wav.all_data())

# waiting for result

recognized_text = s.recv()

TCP server (decoder):

s = socket()

get_whole_file = true

data = s.recv()

text = decode(data)

s.send(text)

Thank you.

Daniel Povey

unread,

May 29, 2019, 2:21:54 PM5/29/19

to kaldi-help, Hossein Hadian, Danijel Korzinek

Not sure, maybe Hossein or Danijel would know would know?

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/1667e70e-617d-4954-8bd1-1482d4983fd0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Danijel Korzinek

unread,

May 29, 2019, 3:57:46 PM5/29/19

to kaldi-help

People often ask this, or a similar sounding question. It's usually something in the form of "how can I decode the whole file" or "how can I get the lattice of the whole file". This feature (IMO) lies in the domain of "offline" and not "online" processing.

The purpose of online processing is to provide recognition results as soon as possible. This has been the case for many use-cases in the past, starting from dictation, but going to real-time dialog systems and voice assistants these days. Other service providers (eg. Google) also deliver this feature - they call it something like "streaming processing".

Getting the recognition result of the whole file (often together with the lattice) is something that is often used in offline speech analytics,usually in the call center setting. The reason people are drawn towards the TCP programs (instead of the regular decoding scripts) is convenience. The issue is that not knowing the amount and location of all the files ahead of time, you would need to spawn the decoding process from scratch for each file, which can be quite costly. Having a TCP program resident in memory with all the models loaded all the time is very convenient because you can just send files to it as needed.

All of these are engineering problems, however, not scientific ones. In fact, it is my personal opinion that the Kaldi as a research tool and an opensource project doesn't benefit much from solving these particular engineering issues The scripts Kaldi provides for offline processing are absolutely sufficient for research purposes, ie processing speech corpora. It feels to me that solving this problem is beneficial mostly to large organizations or commercial companies serving these organizations and if that is the case, they should hire proper engineers for a fair wage to solve them internally, rather than relying on the opensource/research community to do it for them for free.

Maybe I'm wrong on this? Maybe you have a legitimate scenario this would benefit the community as a whole and if so I'd love to hear it. Also, maybe there are better ways to solve this problem than using TCP/IP? Can you share what exactly you are trying to solve?

Danijel

Sergei Tushev

unread,

May 29, 2019, 5:17:44 PM5/29/19

to kaldi-help

My task is to do a voice command interface to some program. It is a kind of STT application. The user says the command, the speech detection module records a small segment, which is then fed to the speech recognition system. Therefore, it is convenient that the acoustic model is loaded into memory. However, if I use online2-tcp-nnet3-decode-faster with some default values, I get a much worse recognition result than offline decoding or even using online2-wav-nnet3-latgen-faster.
If you help me with parameters that make online2-wav-nnet3-latgen-faster and online2-tcp-nnet3-decode-faster work similary, it would be great. If not, I will understand it,

среда, 29 мая 2019 г., 22:57:46 UTC+3 пользователь Danijel Korzinek написал:

Daniel Povey

unread,

May 29, 2019, 9:34:40 PM5/29/19

to kaldi-help

I suspect the interface of the tcp program is quite similar to the regular online2-wav-nnet3-latgen-faster program, therefore you should probably start out by using the same options.

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/91168416-b76a-49b0-8951-026c4ce47c04%40googlegroups.com.

Danijel Korzinek

unread,

May 31, 2019, 2:30:25 PM5/31/19

to kaldi-help

As Dan pointed out, it may be worthwhile to debug the exact issue using other programs to pinpoint which component could be at fault.

Assuming you use the "steps/**/decode.sh" scripts for offline decoding, you should also try and use the "steps/online/decode.sh" ot "steps/online/nnet3/decode.sh" to see the effect of online decoding on the quality.

Online decoding is always going to be less flexible than offline (although not always worse), but there are some things we don't do, although we theoretically could:

- a good online VAD (we only use AM based endpointing)

- speaker adaptation, because we don't have online speaker diarization

- lattice rescoring

Danijel

Akbar

unread,

Jun 1, 2019, 7:56:18 AM6/1/19

to kaldi-help

Hi

Many thanks all.

I recently evaluate TCP text output. The results like my online results.

My test set contain 10 hours and about 10k waves. WER is about 10%.

the time of decoding is:

real 166m52.395s

user 0m56.048s

sys 1m13.615s

but it has a bug that remains decoding text every step. I mean that the output of one wave is like below:

I recently

I recently evaluate

I recently evaluate TCP text

I recently evaluate TCP text output.

the final text is must show but it shows all lines.

Daniel Povey

unread,

Jun 1, 2019, 1:51:21 PM6/1/19

to kaldi-help

That's likely not a bug but the intended behavior. It might be easier to modify it in the client code, although you could also modify the tcp-server code to only output the text at utterance end.

--

Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/771fa225-cd63-407f-b826-ece54e824e9f%40googlegroups.com.

Itai Peer

unread,

Jun 2, 2019, 6:18:01 AM6/2/19

to kaldi-help

Hi Akbar

The text format that you get is expected , , since it is "online" the next decoded word can always change context , so the previous word need to be fix ..

for example , let says after decoding the word text , it has better probablilty for the letter X ,and only after decoding output , it changes it to "text" so the output will look like

I recently evaluate

I recently evaluate TCPX

I recently evaluate TCP text output.

I use script to filter the text , maybe in the future i will try hack the cpp code so it will print only the final line for each segment ( I'm not Kaldi developer , and CPP programming is not my of my best skills , at best i got +4 , and Kaldi requires 20 roll so i will probably fail )

Regarding performance , I decode with 1 standard cpu with 3X - 5X performance ( decoding 5 minute of speech in 1 minute of processing time ) so decoding 10 hours in 166 minutes sounds reasonable , remember , unlike offline decoding , you do not use multiply CPUs here

בתאריך יום שבת, 1 ביוני 2019 בשעה 14:56:18 UTC+3, מאת Akbar:

Yu Beomgon

unread,

Aug 6, 2019, 2:48:51 AM8/6/19

to kaldi-help

hey

sergei,

could you please share the script file for online decoding using online2-tcp-nnet3-decode-faster ??

thanks.

2019년 5월 30일 목요일 오전 3시 17분 3초 UTC+9, Sergei Tushev 님의 말:

Reply all

Reply to author

Forward