What should an example data/train or data/test directory look like? Same files as the WSJ example: utt2spk, spk2utt, wav.scp, text, cmvn.scp, segments, and feats.scp plus utt2lang?
- Is the key of language mappings in local/general_lr_closed_set_langs.txt or local/lang_map.txt?
- Do I need a data/lang directory? It doesn't seem like the language model, lexicon, etc are being used at all.
- Does anyone have more details about why the run_logistic_regression.sh script calls compute-wer with the text files? I'm not following how this fits in with evaluating the language identification.
Secondly, I'm interested in any guidance on altering the nnet/run_dnn.sh script to be a 3-way classifier (three languages), rather than a speech recognizer. I've been using the WSJ version of run_dnn since I have a functioning WSJ GMM system. I think I need to edit train.sh. It seems like altering the $labels and/or $num_tgt variables might work, but perhaps there is a more straight forward way to use the existing Kaldi functions.
Since the system will not use the segments file - how does it determine the relationship between a wav file and utterance ID?
The VAD script runs off wav files, so my original plan was to split the wav files as this will guarantee no overlap in train and test sets.
Did these versions of the test set come as part of the corpus or was I meant to create them at some point during the pipeline.