Looking for help with making a recipe for very different acoustic datasets.

34 views

Skip to first unread message

unread,

Jun 25, 2024, 10:42:24 PM6/25/24

to kaldi-help

I have 3 very different acoustic datasets that I’m trying to combine to train a single chain model.

Dataset1 — librispeech, I’m using it a source of well-behaving general English. Reasonably long utterances. 1000 hours
Dataset2 — a dataset of native English children speech. Reasonably long utterances. 100 hours
Dataset3 — a dataset of accented English children speech that has very short utterances (5 seconds or shorter) some of them end in the middle of the word. 3000 hours.

All datasets have speaker information.

I’ve put together a training script that roughly follows Librispeech and multi recipes:

I feel seriously out of my depth here and I see at least a couple issues with my current setup

20% of Dataset3 utterances do not end in silence. I’m planning to add 0.2 seconds silence to the audio recordings. Questions:
1. Is it worth doing it or should I remove those utterances from the training data?
2. Can I add just plain (0 volume) silence interval or should I try and extract silent segments from each recoding (e.g., using sox) so the silent segment is similar to the audio sample?
I see a lot of warnings trying to use fMLLR on Dataset3 utterances — they are just too short. Looking through the group archive, Dan recommends using basis_fmllr for such utterances. How can incorporate that into the training process? Should I just replace fMLLR with basis_fmllr everywhere?

Any other suggestions on how should arrange the training process?

Reply all

Reply to author

Forward

Message has been deleted

0 new messages