I am trying to wrap my head around the file utt2uniq. It is not well documented (no mention in src/doc, so only the scripts).
I understand that it assigns a unique ID to a group of utterances that originate from a single one by some perturbation or augmentation. For example, steps/nnet2/get_perturbed_feats.sh has
# In the combined feature directory, create a file utt2uniq which maps
# our extended utterance-ids to "unique utterances". This enables the
# script steps/nnet2/get_egs.sh to hold out data in a more proper way.
when applying VTLN. I cannot find where it would be created automatically in the same manner in nnet3 models; the nnet3/chain scripts go a long way to maintain the file, but do not create it when perturbing data (e. g. utils\perturb_data_dir_speed.sh). Some chain recipes do create it, though (chime4/s5_1ch, swbc e2e examples), and steps\data\reverberate_data_dir.py creates it, too.
The file is ultimately consumed in nnet3\chain\get_egs.sh, with a comment
if [ -f $data/utt2uniq ]; then # this matters if you use data augmentation.
# because of this stage we can again have utts with lengths less than
# frames_per_eg
echo "File $data/utt2uniq exists, so augmenting valid_uttlist to"
echo "include all perturbed versions of the same 'real' utterances."
Why does it matter? What is the "more proper" way, i. e. what is wrong with treating a perturbed utterance just like a separate, bona fide novel utterance from a different speaker?
When I'm adding modified speech (some DSP, like transmission line modeling etc., nothing like reverberation) as if coming from from new
speakers to the original training set, I'm getting no improvement at best, more often a degradation, on both unperturbed and even similarly perturbed dev sets ("online" only, no per-speaker ivectors). I'm running quick experiments on a small dataset, ~15hr, but confirmed that I am not overfitting. I'm limiting now to signal processing that does not shift the signal much, say 200us group delay tops (not hard to compensate, just not there yet), using coresponding "clean" alignments, and still seeing a degradation! And I'm not using the utt2uniq grouping. So this may seem to matter in the end, and the DNN training is not as omnivorous as I came to think of it, but I cannot understand why. This is the only thing that seems to differ between my pipeline and the egs that use multiple mics or reverberation.
I'd be grateful for any help getting an intuition behind this mapping. I haven't tried it yet, but I normally prefer to understand what I am doing...
-kkm