Creating new training material

Armin Oliya

unread,

Aug 16, 2017, 10:39:04 AM8/16/17

to kaldi-help

Hi Guys,

I'm planning to add more in-domain training data for my acoustic model,
and I'm exploring options to achieve this with as little manual intervention (hand labeling) as possible.

So assume we have a long audio recording with multiple speakers; and a corresponding transcription file with no strict structure (no metadata, think of it as a plain txt file). Question is, what are some alternatives to get segments and utt2spk files (semi) automatically?

I've experimented with clean_and_segment_data.sh and i hope there is a combination of simlar tools that can help with this.

Steps I can think of:

Speaker diarization (Lium) with aggressive segmentation (short segments)
create a biased LM based on reference transcription
decode each segment with the biased LM and find the most probable chunk of text in the reference transcription
clean up segments with clean_and_segment_data.sh

Of course the results of diarization won't be highly accurate, so i wonder what's the effect of bad speaker labels,

and if there are certain recipes which are more robust against that.

Appreciate your feedback :)

Vimal Manohar

unread,

Aug 16, 2017, 2:04:44 PM8/16/17

to kaldi-help

There is a recipe for this in:

https://github.com/kaldi-asr/kaldi/blob/master/egs/tedlium/s5_r2_wsj/local/tuning/run_segmentation_wsj_e.sh

You can use the 7 argument format of steps/cleanup/segment_long_utterances.sh if you want to make use of the diarized segments.

The exact speaker labels or diarization are not required if you are doing i-vector based speaker adaptation for neural network training. You can experiment with utils/data/modify_speaker_info.sh to merge nearby utterances into pseudo-speakers if some utterances / clusters are too small.

Vimal

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Vimal Manohar

PhD Student

Electrical & Computer Engineering

Johns Hopkins University

Daniel Povey

unread,

Aug 16, 2017, 2:10:44 PM8/16/17

to kaldi-help, David Snyder

[notice vimal replied, but had mostly written my reply so will send it anyway.]

There is a script for this, Vimal did it not long ago; it's
steps/cleanup/segment_long_utterances.sh.
That will take your long reference transcript and find the spoken
segments, aligning it will the text. Look in
egs/wsj/s5/local/run_segmentation_long_utts.sh for an example of its
usage. You can follow it by steps/cleanup/clean_and_segment_data.sh
to further clean the data (since segment_long_utterances.sh does not
take too much care to remove segments where the transcript might have
errors).

This does not give you any speaker information though. You could
just have the utt2spk be a one to one map for initial experiments.
For neural net training it's not important to have accurate speaker
information (it's not used except to the extent that it affects the
ivectors, and for training we anyway split up the speakers into groups
of at most 2 utterances; search for max2). To get basic speaker
information you could arbitrarily group pairs of successive segments
as the same speaker. If you extract ivectors for each utterance you
could then split apart those segment-pairs where the ivectors are
unusually far apart, that will eliminate most of the gross
problems/inaccuracies that this causes.

David, I'm cc'ing you because we should consider this as one of the
potential applications of diarization. It's actually an easy case
because for this application we don't care much about splitting
speakers up (for neural net training we do this anyway).

[Note to Vimal: modify_speaker_info.sh only splits up speakers, it
doesn't merge utterances that were previously from different speakers.
Unless there are changes we haven't checked in yet].

Armin Oliya

unread,

Sep 1, 2017, 9:48:34 AM9/1/17

to kaldi-help, dpo...@gmail.com

Thank you both! really helpful comments.

I did follow Vimal's receipe and managed to get accurate alignments.

So just to be clear, even if after segmenting/cleaning, multiple speakers end up speaking in a single (short) utterance, that shouldn't really affect the resuts (using nnet3).

So far i've been training/extracting ivectors on a dataset with accurate speaker info as follows:

steps/online/nnet2/train_ivector_extractor.sh ...

utils/data/modify_speaker_info.sh --utts-per-spk-max 2  ...   #for more diversity

steps/online/nnet2/extract_ivectors_online.sh ... splitted_speakers...

Can i assume i can keep using the above as is then? if i'm not mistaken i should even leave --respect-speaker-info as true: although my utternace<>speaker info is wrong, each recording does have a unique speaker id assigned.

Daniel Povey

unread,

Sep 1, 2017, 2:28:52 PM9/1/17

to Armin Oliya, kaldi-help

Yes, leave that the same. It's not important for the speakers to be 'pure'.

Armin Oliya

unread,

Sep 1, 2017, 4:08:16 PM9/1/17

to kaldi-help, dpo...@gmail.com

Got it, thanks!

Armin Oliya

unread,

Sep 8, 2017, 5:55:38 AM9/8/17

to kaldi-help

So i'm trying the recipe on a new data set and I'm stopped at the highlighted line below:

steps/cleanup/make_biased_lm_graphs.sh: creating utterance-group-specific decoding graphs with biased LMs
steps/cleanup/segment_long_utterances.sh: Decoding with biased language models...
steps/cleanup/decode_fmllr_segmentation.sh --beam 15.0 --lattice-beam 1.0 --nj 20 --cmd run.pl --mem 20G --mem 4G --skip-scoring true --allow-partial false exp/segment_long_utts_1e_train/graphs_uniform_seg exp/segment_long_utts_1e_train/train_uniform_seg exp/segment_long_utts_1e_train/lats
filter_scps.pl: warning: some input lines were output to multiple files [OK if splitting per utt]
steps/cleanup/decode_segmentation.sh --scoring-opts  --num-threads 1 --skip-scoring true --acwt 0.083333 --nj 20 --cmd run.pl --mem 20G --mem 4G --beam 10.0 --model exp/segment_long_utts_1e_train/final.alimdl --max-active 2000 exp/segment_long_utts_1e_train/graphs_uniform_seg exp/segment_long_utts_1e_train/train_uniform_seg exp/segment_long_utts_1e_train/lats.si
steps/cleanup/decode_segmentation.sh: feature type is lda
^Tsteps/cleanup/decode_fmllr_segmentation.sh: feature type is lda
steps/cleanup/decode_fmllr_segmentation.sh: getting first-pass fMLLR transforms.
dsteps/cleanup/decode_fmllr_segmentation.sh: doing main lattice generation phase
steps/cleanup/decode_fmllr_segmentation.sh: estimating fMLLR transforms a second time.
steps/cleanup/decode_fmllr_segmentation.sh: doing a final pass of acoustic rescoring.
steps/cleanup/internal/get_ctm.sh --lmwt 10 --cmd run.pl --mem 20G --mem 4G --print-silence true exp/segment_long_utts_1e_train/train_uniform_seg data/langp exp/segment_long_utts_1e_train/lats
steps/cleanup/segment_long_utterances.sh: using default values of non-scored words...
run.pl: 1 / 20 failed, log is in exp/segment_long_utts_1e_train/lats/log/get_ctm_edits.*.log

a

# steps/cleanup/internal/stitch_documents.py --query2docs=exp/segment_long_utts_1e_train/query_docs/split20/relevant_docs.3.txt --input-documents=exp/segment_long_utts_1e_train/docs/split20/docs.3.txt --output-documents=- | steps/cleanup/internal/align_ctm_ref.py --eps-symbol="<eps>" --oov-word='<unk>' --symbol-table=data/langp/words.txt --hyp-format=CTM --align-full-hyp=false --hyp=exp/segment_long_utts_1e_train/lats/score_10/train_uniform_seg.ctm.3 --ref=- --output=exp/segment_long_utts_1e_train/lats/score_10/train_uniform_seg.ctm_edits.3 
# Started at Tue Sep  5 19:43:52 CEST 2017
#
2017-09-05 19:44:09,006 [steps/cleanup/internal/align_ctm_ref.py:599 - main - ERROR ] Failed to align ref and hypotheses; got exception 
Traceback (most recent call last):
  File "steps/cleanup/internal/align_ctm_ref.py", line 596, in main
    run(args)
  File "steps/cleanup/internal/align_ctm_ref.py", line 540, in run
    for reco, ref_text in read_text(args.ref_in_file):
  File "steps/cleanup/internal/align_ctm_ref.py", line 128, in read_text
    "".format(line, text_file.name))
RuntimeError: Did not get enough columns; line epspk1219-epf18bc63a.1-000000-003000 
 in <stdin>
2017-09-05 19:44:09,047 [steps/cleanup/internal/stitch_documents.py:136 - run - ERROR ] Error processing line epspk1230-ep18a49f30.1-010000-012200 epspk1230-ep18a49f30.1,1.00,1.00
 in file exp/segment_long_utts_1e_train/query_docs/split20/relevant_docs.3.txt
2017-09-05 19:44:09,047 [steps/cleanup/internal/stitch_documents.py:147 - main - ERROR ] Failed to stictch document; got error 
Traceback (most recent call last):
  File "steps/cleanup/internal/stitch_documents.py", line 144, in main
    run(args)
  File "steps/cleanup/internal/stitch_documents.py", line 133, in run
    file=args.output_documents)
IOError: [Errno 32] Broken pipe

Could it be that the segment_long_utterances.sh is over-segmenting, resulting in segments with no transcript?

Vimal Manohar

unread,

Sep 8, 2017, 3:15:59 PM9/8/17

to kaldi...@googlegroups.com

This should now be fixed with the latest commit. It seems that your recording epspk1219-epf18bc63a.1 had zero or one word.

Vimal

Armin Oliya

unread,

Sep 8, 2017, 6:27:01 PM9/8/17

to kaldi-help

Yup sorry my bad, checked and the transcript was empty .. thanks for the fix :)

Armin Oliya

unread,

Sep 20, 2017, 7:09:50 AM9/20/17

to kaldi-help

Hi Dan,

I finished running the recipe on the data from a tv show (text file derived from subtitles) and results look generally ok.

However, there are a notable number of segments which include few unspoken words,

or miss spoken words (usually in the beginning or end of the segment but also in the middle),

or finish sooner than the last word is completely said. There are also a few segments which are quite inaccurate, not matching spoken audio at all, or missing a notable number of words.

My question is, is this level of quality good enough to include this new material for training?

Thanks!

Vimal Manohar

unread,

Sep 20, 2017, 11:33:35 AM9/20/17

to kaldi-help

On Wed, Sep 20, 2017, 07:09 Armin Oliya <armin...@gmail.com> wrote:

Hi Dan,

I finished running the recipe on the data from a tv show (text file derived from subtitles) and results look generally ok.

However, there are a notable number of segments which include few unspoken words,
or miss spoken words (usually in the beginning or end of the segment but also in the middle),
or finish sooner than the last word is completely said. There are also a few segments which are quite inaccurate, not matching spoken audio at all, or missing a notable number of words.

There is an option --align-full-hyp. You have to set that to true in order to always get the full hypothesis and not ignoring the beginning and ends even if it has some errors. The default might be false.

Segments might be wrong if the decoding result is bad such as with an out of domain model. Re-doing this with in-domain model would help. Also you can try to increase --max-words to a very large value so that the whole of closed caption might be considered instead of retrieving the closest 1000 words.

My question is, is this level of quality good enough to include this new material for training?

It can be used as a first pass. And you can add a pass of cleanup or redo alignment using newly trained in-domain models.

Daniel Povey

unread,

Sep 20, 2017, 12:44:17 PM9/20/17

to kaldi-help

Also, after steps/cleanup/segment_long_utterances.sh, you are supposed
to use steps/cleanup/clean_and_segment_data.sh-- I don't know if you
ran that stage.

Armin Oliya

unread,

Sep 21, 2017, 6:01:49 AM9/21/17

to kaldi-help

Thank you both for the feedback,

@vimal, I'll try --align-full-hyp then.

The tri4 model that I'm using is based on a 400h of audios with mixed acoustic types (phone, tv shows, news, interviews, different accents, ..) but i could say that the people speak much faster and real-life in the tv show that i'm trying to clean.

Reading your comment actually makes me think: if the output quality is dependent on the model used, how much would the alignment ouptut help improve the model as 'new' training material?

@Dan, yes i did that. my steps:

segment_long_utterances
align_fmllr
train_sat
above steps, one more time
clean_and_segment_data

(i skipped get_prons step, as i already used a probabalizied langp from my large corpus).

Daniel Povey

unread,

Sep 21, 2017, 2:38:10 PM9/21/17

to kaldi-help

I think it's better if you run clean_and_segment_data.sh directly
after segment_long_utterances without retraining the model. (you may
need to align the data though-- I forget the exact workflow). If you
are training the models on dirty data, they will learn to align that
dirty data, and clean_and_segment_data.sh won't work as well.
dan

Jaskaran Singh Puri

unread,

May 29, 2019, 9:07:09 AM5/29/19

to kaldi-help

I'm trying to run an audio file through steps/cleanup/segment_long_utterances.sh, gives me the following error

steps/cleanup/segment_long_utterances.sh: using default values of non-scored words...

2019-05-29 06:19:28,145 [steps/cleanup/internal/get_non_scored_words.py:98 - read_lang - ERROR ] problem reading file data/lang//words.txt.int


Traceback (most recent call last):


  File "steps/cleanup/internal/get_non_scored_words.py", line 108, in <module>
    read_lang(args.lang)
  File "steps/cleanup/internal/get_non_scored_words.py", line 93, in read_lang
    for line in open(lang_dir + '/words.txt', encoding='utf-8').readlines():
  File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc9 in position 6792: invalid continuation byte

While checking the terminal output, I also see getting my words replaced by <SIL> by sym2int.pl, not sure if this is what causing problems

utils/data/get_utt2dur.sh: computed data/train_long_segmented/workdir/train_long_uniform_seg.temp/utt2dur
utils/data/modify_speaker_info.sh: copied data from data/train_long_segmented/workdir/train_long_uniform_seg.temp to data/train_long_segmented/workdir/train_long_uniform_seg, number of speakers changed from 1 to 2
utils/validate_data_dir.sh: Successfully validated data-directory data/train_long_segmented/workdir/train_long_uniform_seg
fix_data_dir.sh: kept all 2 utterances.
fix_data_dir.sh: old files are kept in data/train_long_segmented/workdir/train_long_uniform_seg/.backup
steps/compute_cmvn_stats.sh data/train_long_segmented/workdir/train_long_uniform_seg/
Succeeded creating CMVN stats for train_long_uniform_seg
steps/cleanup/segment_long_utterances.sh: Stage 3 (Building biased-language-model decoding graphs)
steps/cleanup/make_biased_lm_graphs.sh --nj 1 --cmd run.pl data/train_long//text data/lang/ data/train_long_segmented/workdir data/train_long_segmented/workdir/graphs
---------------------
data/train_long//text
data/lang/
sym2int.pl: replacing i with 696
sym2int.pl: replacing would with 696
sym2int.pl: replacing like with 696
sym2int.pl: replacing to with 696
sym2int.pl: replacing have with 696
sym2int.pl: replacing a with 696
sym2int.pl: replacing corrected with 696
sym2int.pl: replacing statement with 696
sym2int.pl: replacing from with 696
sym2int.pl: replacing aarp with 696
sym2int.pl: replacing concerning with 696
sym2int.pl: replacing my with 696
sym2int.pl: replacing prescription with 696
sym2int.pl: replacing drug with 696
sym2int.pl: replacing summary with 696
sym2int.pl: replacing that with 696
sym2int.pl: replacing is with 696
sym2int.pl: replacing sent with 696
sym2int.pl: replacing to with 696
sym2int.pl: replacing me with 696
sym2int.pl: not warning for OOVs any more times
** Replaced 137 instances of OOVs with 696


steps/cleanup/make_biased_lm_graphs.sh: creating utterance-group-specific decoding graphs with biased LMs
steps/cleanup/segment_long_utterances.sh: Decoding with biased language models...

696 is <SIL> in my words.txt. What could be wrong here?

> email to kaldi...@googlegroups.com.

Vimal Manohar

unread,

May 29, 2019, 11:34:27 AM5/29/19

to kaldi-help

What language is this in? Maybe there is a non-utf8 character in words.txt

To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/9df5a5e3-98d1-4c1f-b2e8-7dceb031f40a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Vimal Manohar
PhD Student
Center for Language and Speech Processing
Johns Hopkins University
Baltimore, MD

Jaskaran Singh Puri

unread,

May 29, 2019, 11:43:41 AM5/29/19

to kaldi-help

The language is English
Will check for non utf8 but is the replacement of all words by SIL not an issue?

Vimal Manohar

unread,

May 29, 2019, 12:34:37 PM5/29/19

to kaldi-help

Usually there will an oov token like <unk> and words not in vocabulary will be replaced by that. You can set this when using prepare_lang.sh to prepare the lang directory. But in your case you are using <SIL> which I assume is the silence word for OOV. This may have worse results. Ki

On Wed, May 29, 2019, 11:43 Jaskaran Singh Puri <jaskar...@gmail.com> wrote:

The language is English
Will check for non utf8 but is the replacement of all words by SIL not an issue?

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/fa0194fe-7ebe-402a-9a59-7248b28f8f14%40googlegroups.com.

Daniel Povey

unread,

May 29, 2019, 12:38:01 PM5/29/19

to kaldi-help

We definitely need to fix get_non_scored_words.py in case it's not handling utf-8 properly.

Vimal, perhaps you could take a look and see if the issue is an obvious one?

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAGLZPU-9gG1x5pLg-h_vOrJxwtw797Ht5iKjC8JrQXmhtwny6w%40mail.gmail.com.

Vimal Manohar

unread,

May 29, 2019, 2:12:06 PM5/29/19

to kaldi-help

The script already handles utf-8. I made a PR to write to utf-8 file too. But the error seen here is not related to that. There is probably some non-utf8 or some other invalid character.

Vimal

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAEWAuySnooO5S51sfns9pA7%2BTdgMeF-TveVPvd90C9Jmg-t8gQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

Daniel Povey

unread,

May 29, 2019, 2:19:10 PM5/29/19

to kaldi-help, Jan Trmal

Hm. Perhaps we should include UTF-8 validation in validate_data_dir.pl and fix_data_dir.pl?

We seem to be basically accepting at this point that UTF-8 is the universal encoding, so it makes sense

to have that reflected in those scripts. If we had done that, this error would have been caught earlier.

Yenda, do you have any time for this kind of thing?

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAGLZPU8kEYVftRmZ8J2poQGMOO5c_Tuc8FOYLgHjVHTdPAbm7w%40mail.gmail.com.

Jaskaran Singh Puri

unread,

Jun 12, 2019, 10:36:48 AM6/12/19

to kaldi-help

So I'm using steps/cleanup/clean_and_segment_data.sh and segment_long_utterances.sh for some time now

However, since these scripts take in a SAT model, the decoding part is very slow as my file count is around 300,000, I'm estimating it'd take a couple of days to decode all the files

Is there some way I can make these scripts use the GPU for the decoding part or in fact make the entire script use GPU?

> email to kaldi...@googlegroups.com.

Daniel Povey

unread,

Jun 12, 2019, 10:38:11 AM6/12/19

to kaldi-help

No, that's not possible.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/dd6b99b3-1dcc-4f8b-b614-c38a966bb71f%40googlegroups.com.

Reply all

Reply to author

Forward