Split the long audio file into utterances

3,305 views
Skip to first unread message

sabr

unread,
Feb 9, 2017, 7:22:03 AM2/9/17
to kaldi-help

Hello,

I'm sorry that I couldn't find the answer in this forum on my plain question.. tried to search "make utterances", "split into utterances" - no results.

I'm currently using "online2-wav-nnet2-latgen-faster" online decoder with Fisher-English recipe. I have the service that does SR as the job on MessageQueue.

Here are my issues:
  • currently, I don't split into utterances, so my entire long audio files are run as the 1 utterance
  • default sample-rate in mfcc.conf is 8000 in that recipe. I tried with 16000 but it fails (exceeded max. memory 5000000 bytes etc.) and dont recognize the full speech (only some random part of it)
  • it's obvious that the longer audio file, the longer it takes to recognize as single utterance (also consumes like 3-4 Gb of RAM). For example, 44min audio takes 1hour:24min
So I have several questions:
  • I believe, I should split into utterances and have a "wav.scp" file where (<utterance_id><path to the chunk of audio file>). How to do it properly? AFAIK, it should somewhere in /steps/cleanup.. (The worst case, I can split it manually via ffmpeg but it'll be abrupt split)
  • Is online-decoding actually right solution for my case? Maybe I should create the pipeline myself? Like "make mfcc" -> "decode" etc. If so, then please navigate me if you have time for that
If I could split the audio into utterances, then I believe, I could return the recognition result utterance by utterance (with API GET request), currently until the entire audio is recognized, there is no result.

Thanks for reading this question :)

Here is my command

kaldi_dir="$ASR_MODEL_DIR"
online_nnet2="$kaldi_dir/src/online2bin/online2-wav-nnet2-latgen-faster"
recipe_dir="$kaldi_dir/egs/$recipe/s5"

online_nnet2_decoding()
{
local decoding_conf=$1
local word_symbol_table=$2
local mdl=$3
local fst=$4

"$online_nnet2" --do-endpointing=true \
--online=false \
--config="$decoding_conf" \
--max-active=7000 \
--beam=15.0 \
--lattice-beam=6.0 \
--acoustic-scale=0.1 \
--word-symbol-table="$word_symbol_table" \
"$mdl" \
"$fst" \
"ark:echo utterance-id1 utterance-id1|" \
"scp:echo utterance-id1 $input_wav_file|" \
"ark:|$kaldi_dir/src/latbin/lattice-best-path --acoustic-scale=0.1 ark:- ark,t:- | $recipe_dir/utils/int2sym.pl -f 2- "$word_symbol_table" > $output_txt_file" || exit 1
}

Sabr

Vimal Manohar

unread,
Feb 9, 2017, 2:50:50 PM2/9/17
to kaldi...@googlegroups.com

You can use the approach used in Aspire to create uniform segments and decode them.

https://github.com/kaldi-asr/kaldi/blob/master/egs/aspire/s5/local/multi_condition/prep_test_aspire.sh

local/multi_condition/create_uniform_segments.py

Vimal

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Daniel Povey

unread,
Feb 9, 2017, 2:54:56 PM2/9/17
to kaldi-help
BTW, you can't mix-and-match the sampling rates.  If your data is sampled at 16kHz but the system is built for 8kHz then you should use sox or a similar tool to downsample your data.
Dan



To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.

sabr

unread,
Feb 13, 2017, 5:49:39 AM2/13/17
to kaldi-help
Thanks,

I was able to re-use that Python script (~/kaldi/egs/aspire/local/multi_condition/create_uniform_segments.py) to split into uniform segments. But I did this thing:

  1. I created the folder, and put there my WAV file (~ 44 min)
  2. My service doesnt know anything about the speakers etc. So I have to create "wav.scp" and add there only 1 utterance.
    $ echo "utt1 <abs-path-to-wav-file>" > wav.scp

  3. Then this script created "segments" , "spk2utt", "utt2spk" files
Now I have 2 ways of decoding "online", and here are my concerns:
  • Segmented audio async output: With such as 10-sec and 0-sec overlap segments recognition I'm getting worse results. I also wanted to get the recognized text async (that was the reason of choosing segmentation), but even with this last argument in command below, I'm getting results only after the online2-wav-nnet2-latgen-faster is completed.

online2-wav-nnet2-latgen-faster --online=false \
 
--do-endpointing=false -- \
 
--config=/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/conf/online_nnet2_decoding.conf \
 
--max-active=7000 \
 
--beam=15.0 \
 
--lattice-beam=6.0 \
 
--acoustic-scale=0.1 \
 
--word-symbol-table=/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt \
 
/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/final.mdl \
 
/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/HCLG.fst \
 ark
:/tmp/wav/spk2utt \
 
'ark,s,cs:extract-segments scp,p:/tmp/wav/wav.scp /tmp/wav/segments ark:- |' \
 
'ark:|/opt/kaldi/src/latbin/lattice-best-path --acoustic-scale=0.1 ark:- ark,t:- | /opt/kaldi/egs/fisher_english/s5/utils/int2sym.pl -f 2-  /opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt > /tmp/kaldi_output.txt'

  • Unsegmented audio: 

Now with this command below (--do-endpointing=true) and with specifying the only 1 wav file -- I'm getting satisfying results, but this consumes rather big resource of RAM.

online2-wav-nnet2-latgen-faster --online=false \
 
--do-endpointing=true -- \
 
--config=/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/conf/online_nnet2_decoding.conf \
 
--max-active=7000 \
 
--beam=15.0 \
 
--lattice-beam=6.0 \
 
--acoustic-scale=0.1 \
 
--word-symbol-table=/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt \
 
/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/final.mdl \
 
/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/HCLG.fst \
  'ark:echo utterance-id1 utterance-id1|' \
  'scp:echo utterance-id1 /tmp/audio_file.wav|' \
 
'ark:|/opt/kaldi/src/latbin/lattice-best-path --acoustic-scale=0.1 ark:- ark,t:- | /opt/kaldi/egs/fisher_english/s5/utils/int2sym.pl -f 2-  /opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt > /tmp/kaldi_output.txt'


Question: So for my application, should I actually use online-decoding with segmented audio in 1st option above? (There is no necessity to have async recognized text, I just want to free up RAM in that way to have at least 3 jobs for ASR)

Thanks,
Sabr

Daniel Povey

unread,
Feb 13, 2017, 2:36:28 PM2/13/17
to kaldi-help
 
I was able to re-use that Python script (~/kaldi/egs/aspire/local/multi_condition/create_uniform_segments.py) to split into uniform segments. But I did this thing:

  1. I created the folder, and put there my WAV file (~ 44 min)
  2. My service doesnt know anything about the speakers etc. So I have to create "wav.scp" and add there only 1 utterance.
    $ echo "utt1 <abs-path-to-wav-file>" > wav.scp

  3. Then this script created "segments" , "spk2utt", "utt2spk" files
Now I have 2 ways of decoding "online", and here are my concerns:
  • Segmented audio async output: With such as 10-sec and 0-sec overlap segments recognition I'm getting worse results.

You will definitely get worse results like this because you get errors near the boundaries of the segments.  We normally make the segments overlap a few seconds, generate ctm output, and then merge the ctm output together using a cutoff at the midpoint of the overlaps between the segments (so the words are not duplicated and the words closest to the split-points are not used)
 
  • I also wanted to get the recognized text async (that was the reason of choosing segmentation), but even with this last argument in command below, I'm getting results only after the online2-wav-nnet2-latgen-faster is completed.

The program online2-wav-nnet2-latgen-faster was never intended to be used in production; it was more intended to demonstrate the interfaces so that you could write your own production code.
 

online2-wav-nnet2-latgen-faster --online=false \
 
--do-endpointing=false -- \
 
--config=/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/conf/online_nnet2_decoding.conf \
 
--max-active=7000 \
 
--beam=15.0 \
 
--lattice-beam=6.0 \
 
--acoustic-scale=0.1 \
 
--word-symbol-table=/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt \
 
/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/final.mdl \
 
/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/HCLG.fst \
 ark
:/tmp/wav/spk2utt \
 
'ark,s,cs:extract-segments scp,p:/tmp/wav/wav.scp /tmp/wav/segments ark:- |' \
 
'ark:|/opt/kaldi/src/latbin/lattice-best-path --acoustic-scale=0.1 ark:- ark,t:- | /opt/kaldi/egs/fisher_english/s5/utils/int2sym.pl -f 2-  /opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt > /tmp/kaldi_output.txt'

  • Unsegmented audio: 

Now with this command below (--do-endpointing=true)

I do not recommend do-endpointing=true, because the purpose of this is to cut off decoding whenever it meets a long-ish silence.  So only the first part of your audio will be decoded.

I can't really recommend which way to go here, it's your choice.


Dan

 
and with specifying the only 1 wav file -- I'm getting satisfying results, but this consumes rather big resource of RAM.

online2-wav-nnet2-latgen-faster --online=false \
 
--do-endpointing=true -- \
 
--config=/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/conf/online_nnet2_decoding.conf \
 
--max-active=7000 \
 
--beam=15.0 \
 
--lattice-beam=6.0 \
 
--acoustic-scale=0.1 \
 
--word-symbol-table=/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt \
 
/opt/kaldi/egs/fisher_english/s5/exp/nnet2_online/nnet_a_gpu_online/final.mdl \
 
/opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/HCLG.fst \
  'ark:echo utterance-id1 utterance-id1|' \
  'scp:echo utterance-id1 /tmp/audio_file.wav|' \
 
'ark:|/opt/kaldi/src/latbin/lattice-best-path --acoustic-scale=0.1 ark:- ark,t:- | /opt/kaldi/egs/fisher_english/s5/utils/int2sym.pl -f 2-  /opt/kaldi/egs/fisher_english/s5/exp/tri5a/graph/words.txt > /tmp/kaldi_output.txt'


Question: So for my application, should I actually use online-decoding with segmented audio in 1st option above? (There is no necessity to have async recognized text, I just want to free up RAM in that way to have at least 3 jobs for ASR)

 

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages