CTM from transcript file and audio file

285 views
Skip to first unread message

Mikel Esparza

unread,
Jan 30, 2019, 8:16:24 AM1/30/19
to kaldi-help
Hi!

I would like to get the CTM file given an audio file and a transcript file from that audio. I already have prepared and tested, acoustics models (one GMM model and a NNet model) and language models (n-grams and lstm models).

I've read that GMM models works better for aligning transcript files and audio files. For my task I started from segment_long_utterances.sh file and I changed some code to transform the CTM_edits file into a final CTM where the insertions and the silences are cleaned.

It already works well with some audios, but I've seen that with some others I get problems in the aligning task. Some transcripts are not well cleaned and the decode_segmentation.sh file fails.

I want to know if there is some magical kaldi recipe which does the task of aligning taking care about all the possible problems.  Or maybe there is a better script to start with rather than segment_long_utterances.sh.

If not, which value of "beam" and "lattice_beam" is enough to force an output  from the decode_segmentation file? I've already tried --beam=20.0 --lattice-beam=6.0 and it still fails at some points.

Thanks in advance!!

Daniel Povey

unread,
Jan 30, 2019, 1:07:06 PM1/30/19
to kaldi-help
It's possible that all you need is steps/get_train_ctm.sh.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/4a1e7dea-b825-43fa-9220-648af4eaec01%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mikel Esparza

unread,
Jan 31, 2019, 7:21:47 AM1/31/19
to kaldi...@googlegroups.com
Thanks for your quick response! I have tried the script steps/get_train_ctm.sh  and it works fine!

Now I have problems with long audio files. The script  steps/nnet3/align.sh fails if the audio is too large. what should I do to be able to handle long audio files?

Thanks in advance.

Daniel Povey

unread,
Jan 31, 2019, 1:35:08 PM1/31/19
to kaldi-help
How does it fail?

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

Mikel Esparza

unread,
Feb 1, 2019, 5:30:05 AM2/1/19
to kaldi...@googlegroups.com
Hi Dan, thank you for your quick response.

I get this error

LOG (nnet3-align-compiled[5.5]:CheckAndFixConfigs():nnet-am-decodable-simple.cc:294) Increasing --frames-per-chunk from 50 to 51 to make it a multiple of --frame-subsampling-factor=3
WARNING (nnet3-align-compiled[5.5]:AlignUtteranceWrapper():decoder-wrappers.cc:501) Retrying utterance AudioCortesAragon1-00002500-00005500-1 with beam 40
WARNING (nnet3-align-compiled[5.5]:AlignUtteranceWrapper():decoder-wrappers.cc:510) Did not successfully decode file AudioCortesAragon1-00002500-00005500-1, len = 825
WARNING (nnet3-align-compiled[5.5]:AlignUtteranceWrapper():decoder-wrappers.cc:501) Retrying utterance AudioCortesAragon1-00010000-00013000-2 with beam 40
WARNING (nnet3-align-compiled[5.5]:AlignUtteranceWrapper():decoder-wrappers.cc:510) Did not successfully decode file AudioCortesAragon1-00010000-00013000-2, len = 365
WARNING (nnet3-align-compiled[5.5]:AlignUtteranceWrapper():decoder-wrappers.cc:501) Retrying utterance AudioCortesAragon1-00047500-00050500-1 with beam 40
WARNING (nnet3-align-compiled[5.5]:AlignUtteranceWrapper():decoder-wrappers.cc:510) Did not successfully decode file AudioCortesAragon1-00047500-00050500-1, len = 1000
LOG (compile-train-graphs[5.5]:main():compile-train-graphs.cc:147) compile-train-graphs: succeeded for 19 graphs, failed for 0
LOG (apply-cmvn[5.5]:main():apply-cmvn.cc:81) Copied 19 utterances.
LOG (nnet3-align-compiled[5.5]:main():nnet3-align-compiled.cc:198) Overall log-likelihood per frame is 4.60517 over 12988 frames.
LOG (nnet3-align-compiled[5.5]:main():nnet3-align-compiled.cc:201) Retried 3 out of 19 utterances.
LOG (nnet3-align-compiled[5.5]:main():nnet3-align-compiled.cc:203) Done 16, errors on 3
LOG (nnet3-align-compiled[5.5]:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.103 seconds taken in nnet3 compilation total (breakdown: 0.0323 compilation, 0.0541 optimization, 0 shortcut expansion, 0.0114 checking, 0.000498 computing indexes, 0.00514 misc.) + 0 I/O.

Since I'm not going to train any model with this data, I would like to align the full transcript text with the audio file.

Thanks in advance.

Daniel Povey

unread,
Feb 1, 2019, 1:30:17 PM2/1/19
to kaldi-help
You could try increasing the beam and retry-beam.  It could indicate a problem with the data or transcripts though.

Mikel Esparza

unread,
Feb 5, 2019, 7:14:29 AM2/5/19
to kaldi...@googlegroups.com
Analyzing the data I've realized that there are 2 minutes of silence in the audio. To try to solve this, I've thought to use a SAD model to subtract the silence from the audio. For that, I've used a pretrained SAD model. Now I have a segments file which contain only the speech parts.
The problem is that I have the transcript of the full audio file. Is there a way to tell Kaldi to align the full transcript file but only with those parts where the SAD model has said there is speech?

The only solution I see is to create a new audio file, cropping those silence parts and align that with the full transcript, but maybe there is a way to specify the same to Kaldi with metadata.

Thanks in advance!

Daniel Povey

unread,
Feb 5, 2019, 11:55:04 AM2/5/19
to kaldi-help
I don't know of an easy way to do that.

Vimal Manohar

unread,
Feb 5, 2019, 12:26:26 PM2/5/19
to kaldi-help
You can use segment_long_utterances.sh giving the whole recording level transcript as text for the segments. You can do this manually or using <text-in> <utt2text> arguments.

Mikel Esparza

unread,
Feb 8, 2019, 3:32:09 AM2/8/19
to kaldi...@googlegroups.com
It worked perfectly fine!

Thank you so much

Reply all
Reply to author
Forward
0 new messages