The significance and purpose of utt2uniq

Kirill Katsnelson

unread,

Mar 17, 2019, 8:18:14 PM3/17/19

to kaldi-help

I am trying to wrap my head around the file utt2uniq. It is not well documented (no mention in src/doc, so only the scripts).

I understand that it assigns a unique ID to a group of utterances that originate from a single one by some perturbation or augmentation. For example, steps/nnet2/get_perturbed_feats.sh has

# In the combined feature directory, create a file utt2uniq which maps
# our extended utterance-ids to "unique utterances". This enables the
# script steps/nnet2/get_egs.sh to hold out data in a more proper way.

when applying VTLN. I cannot find where it would be created automatically in the same manner in nnet3 models; the nnet3/chain scripts go a long way to maintain the file, but do not create it when perturbing data (e. g. utils\perturb_data_dir_speed.sh). Some chain recipes do create it, though (chime4/s5_1ch, swbc e2e examples), and steps\data\reverberate_data_dir.py creates it, too.

The file is ultimately consumed in nnet3\chain\get_egs.sh, with a comment

if [ -f $data/utt2uniq ]; then # this matters if you use data augmentation.
# because of this stage we can again have utts with lengths less than
# frames_per_eg
echo "File $data/utt2uniq exists, so augmenting valid_uttlist to"
echo "include all perturbed versions of the same 'real' utterances."

Why does it matter? What is the "more proper" way, i. e. what is wrong with treating a perturbed utterance just like a separate, bona fide novel utterance from a different speaker?

When I'm adding modified speech (some DSP, like transmission line modeling etc., nothing like reverberation) as if coming from from new speakers to the original training set, I'm getting no improvement at best, more often a degradation, on both unperturbed and even similarly perturbed dev sets ("online" only, no per-speaker ivectors). I'm running quick experiments on a small dataset, ~15hr, but confirmed that I am not overfitting. I'm limiting now to signal processing that does not shift the signal much, say 200us group delay tops (not hard to compensate, just not there yet), using coresponding "clean" alignments, and still seeing a degradation! And I'm not using the utt2uniq grouping. So this may seem to matter in the end, and the DNN training is not as omnivorous as I came to think of it, but I cannot understand why. This is the only thing that seems to differ between my pipeline and the egs that use multiple mics or reverberation.

I'd be grateful for any help getting an intuition behind this mapping. I haven't tried it yet, but I normally prefer to understand what I am doing...

-kkm

Daniel Povey

unread,

Mar 17, 2019, 8:36:19 PM3/17/19

to kaldi-help

Its purpose is to make sure that when you create a held-out set for diagnostics, you hold out all versions of those utterances; otherwise those diagnostics would not really affect how much overfitting was occurring.

If you don't use utt2uniq, your validation diagnostics may seem better than they should really be, since the utts were seen in training (e.g. with different perturbation).

perturb_data_dir_speed_3way.sh does create this file.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/babb87ef-0ee4-4919-8ea3-3b2b0b4b8634%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirill Katsnelson

unread,

Mar 18, 2019, 1:03:44 AM3/18/19

to kaldi-help

Oh. Of course, thanks. This is simply about blinding the network to the artificial perturbations during training.

-kkm

Reply all

Reply to author

Forward