Looking for help with making a recipe for very different acoustic datasets.

9 views
Skip to first unread message

Anton

unread,
Jun 25, 2024, 10:42:24 PM (10 days ago) Jun 25
to kaldi-help

I have 3 very different acoustic datasets that I’m trying to combine to train a single chain model. 


  1. Dataset1 — librispeech, I’m using it a source of well-behaving general English. Reasonably long utterances. 1000 hours
  2. Dataset2 — a dataset of native English children speech. Reasonably long utterances. 100 hours
  3. Dataset3 — a dataset of accented English children speech that has very short utterances (5 seconds or shorter) some of them end in the middle of the word. 3000 hours. 


All datasets have speaker information.


I’ve put together a training script that roughly follows Librispeech and multi recipes:


  1. Compute a language model that combines all the datasets.
  2. Train mono model on 10k subset of Dataset2
  3. Align si and train Deltas on full Dataset2
  4. Align si and train lda_mllt on full Dataset2
  5. Align si and train sat on full Dataset2
  6. Align fmllr and train sat on Dataset2 + Dataset3
  7. Compute pronunciation and new language model
  8. Align fmllr and train sat on  Dataset2 + Dataset3 + Dataset1 (100hours subset)
  9. Align fmllr and train sat on  Dataset2 + Dataset3 + full Dataset1
  10. Train a chain model


I feel seriously out of my depth here and I see at least a couple issues with my current setup

  1. 20% of Dataset3 utterances do not end in silence. I’m planning to add 0.2 seconds silence to the audio recordings. Questions:
    1. Is it worth doing it or should I remove those utterances from the training data?
    2. Can I add just plain (0 volume) silence interval or should I try and extract silent segments from each recoding (e.g., using sox) so the silent segment is similar to the audio sample?
  2. I see a lot of warnings trying to use fMLLR on Dataset3 utterances — they are just too short. Looking through the group archive, Dan recommends using basis_fmllr for such utterances. How can incorporate that into the training process? Should I just replace fMLLR with basis_fmllr everywhere? 


Any other suggestions on how should arrange the training process?

Reply all
Reply to author
Forward
0 new messages