language identification

Emily Boggs

unread,

Jul 13, 2016, 4:15:42 PM7/13/16

to kaldi-help

Hello -

I'm building and experimenting with language identification systems in Kaldi for my master's dissertation work. I have test/train data in three languages, so my goal is to build and optimize two 3-way classifiers (logistic regression and DNN) and then experiment with iVectors and bottleneck features.

I'm using the lre07 example directory as a guide, but I don't have access to all the data sources, so I haven't been able to reproduce the data preparation and I have a few questions.

What should an example data/train or data/test directory look like? Same files as the WSJ example: utt2spk, spk2utt, wav.scp, text, cmvn.scp, segments, and feats.scp plus utt2lang?
What is the format of the utt2lang file?
Is the key of language mappings in local/general_lr_closed_set_langs.txt or local/lang_map.txt?
Do I need a data/lang directory? It doesn't seem like the language model, lexicon, etc are being used at all.
Does anyone have more details about why the run_logistic_regression.sh script calls compute-wer with the text files? I'm not following how this fits in with evaluating the language identification.

Secondly, I'm interested in any guidance on altering the nnet/run_dnn.sh script to be a 3-way classifier (three languages), rather than a speech recognizer. I've been using the WSJ version of run_dnn since I have a functioning WSJ GMM system. I think I need to edit train.sh. It seems like altering the $labels and/or $num_tgt variables might work, but perhaps there is a more straight forward way to use the existing Kaldi functions.

Thank you,

Emily

David Snyder

unread,

Jul 13, 2016, 5:23:24 PM7/13/16

to kaldi-help

Hi Emily,

What should an example data/train or data/test directory look like? Same files as the WSJ example: utt2spk, spk2utt, wav.scp, text, cmvn.scp, segments, and feats.scp plus utt2lang?

You won't need a text or segments file. You just need the utt2spk, wav.scp, and utt2lang files initially. The feats.scp file is created from the wav.scp file. A segments file is not needed, since we use a frame-level VAD to remove silence. You'll see that the first few lines of the run.sh script prepare the VAD and MFCCs.

The utt2lang file is of the form:

utt1 language1

utt2 language2

etc

Is the key of language mappings in local/general_lr_closed_set_langs.txt or local/lang_map.txt?

The file local/lang_map.txt maps various ways of writing the same language to some standard form. It is used in the dataprep scripts in local/ to create the utt2lang files. For example, some multilingual LDC corpora might represent German as DEU and some other corpus might list it as GER or something like that. We want both to be mapped to "german." The other file, local/general_lr_closed_set_langs.txt, just gives a numeral mapping for each language we want to use in the evaluation. So in the logistic regression model that we create, the output index 0 corresponds to Arabic, 1 to Bengali, etc.

Do I need a data/lang directory? It doesn't seem like the language model, lexicon, etc are being used at all.

No, this is just for ASR, and not needed here.

Does anyone have more details about why the run_logistic_regression.sh script calls compute-wer with the text files? I'm not following how this fits in with evaluating the language identification.

This is really just calculating a classification error-rate. It's not the WER in the ASR sense. It just turns out that it works here as well. When you get to the stage of training a logistic regression model, take a look at the input that goes into compute-wer. It should be clear how why it works. You'll see that at the end of run.sh, there's an evaluation script that is run: https://github.com/kaldi-asr/kaldi/blob/master/egs/lre07/v1/local/lre07_eval/lre07_eval.sh . This uses both classification error-rate, as well as C_avg. You can replace this with some other error metric.

Secondly, I'm interested in any guidance on altering the nnet/run_dnn.sh script to be a 3-way classifier (three languages), rather than a speech recognizer. I've been using the WSJ version of run_dnn since I have a functioning WSJ GMM system. I think I need to edit train.sh. It seems like altering the $labels and/or $num_tgt variables might work, but perhaps there is a more straight forward way to use the existing Kaldi functions.

I think you have the right idea. AFAIK, there's no really straightforward way to do this in Kaldi. You're going to have to understand an ASR training script well enough to modify it to do what you want.

Let me know if you have further questions.

Thanks,

David

Emily Boggs

unread,

Jul 25, 2016, 11:57:22 AM7/25/16

to kaldi-help

Hello again -

Thank you for your help. I have a few follow up questions.

Since the system will not use the segments file - how does it determine the relationship between a wav file and utterance ID? My original wav files are rather long, so I was using a segments file to map utterance IDs to intervals within the wav files.

Would it be preferable to split the wav files, so each wav file corresponds to an utterance ID?
Unlike the NIST data, my data has been transcribed, so I have intervals (utterances IDs) that are labelled as non speech ("junk"). The VAD will likely find similar segments, but since I have the labels it would be worth using them. Would you recommend removing the junk segments from the data, so the VAD runs on already known speech segments? This would require first splitting the wav files.
The wav file to utterance id relationship is also relevant to how I need to subset my data for experiments. I have three languages, so three classes and I want to maintain an equal ratio of each language in the testing and training sets. Should split the data across utterances or wav files? The VAD script runs off wav files, so my original plan was to split the wav files as this will guarantee no overlap in train and test sets. If I split by utterances on the other hand, then there is overlap in the wav files because each wav file contains many utterances.

I have successfully run almost the entire LRE07 pipeline in run.sh on a small training and test set of my data. However, I got an error in the final script lre07_eval.sh when I realized I don't have 3, 10, and 30 second versions of my test data as apparently exist in data/lre07. Did these versions of the test set come as part of the corpus or was I meant to create them at some point during the pipeline. I couldn't find a script that created varying durations.

Thank you!

Emily

Daniel Povey

unread,

Jul 25, 2016, 1:44:00 PM7/25/16

to kaldi-help

If you don't use a segments file then the keys in the wav.scp are
interpreted as utterance ids. But if you really want to split the
wave files into multiple regions then you should use a segments file.
The scripts will work fine with that.

Forget the parts of the script that relate to 3, 10 and 30 second
utterances, they won't be relevant to you.

Dan

> --
> You received this message because you are subscribed to the Google Groups
> "kaldi-help" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to kaldi-help+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

David Snyder

unread,

Jul 25, 2016, 1:48:00 PM7/25/16

to kaldi-help

Hi Emily,

Since the system will not use the segments file - how does it determine the relationship between a wav file and utterance ID?

In that case, the wav file ID is the utterance ID.

I think you're right, that the best thing to do is to use your preexisting segments file. I think this shouldn't change much in the pipeline. Once you've split your wav file, you can still use the frame-level VAD to remove low energy frames (this is done by default in the scripts in lid/).

The VAD script runs off wav files, so my original plan was to split the wav files as this will guarantee no overlap in train and test sets.

How many wav files do you have? If possible, I think it would be better to use separate recordings for train and test, rather than splitting the same recordings across both.

Did these versions of the test set come as part of the corpus or was I meant to create them at some point during the pipeline.

These lists are specified somewhere in the test set. You don't need to create them, but ultimately you'll have to modify the recipe to get it to work for you (e.g., you'll need to remove the dependence on 3s, 10s, 30s lists).

Best,

David

Emily Boggs

unread,

Jul 25, 2016, 5:02:37 PM7/25/16

to kaldi-help

Hi,

I see now that I am okay to use the segments file since make_mfccs reads it to build feats.scp, which is the input to VAD, thus vad.scp will reflect whichever utterances/segments I want to have in my test or train set. I was confused in thinking that the VAD took the wav.scp file as its input.

I have around 400 wav files, but I think it will be much faster to manipulate the segments file (create separate segments files with different utterances for train and test), than to split my wav files. There will be overlap in the wav files, but not in the utterances if I create the segments file and other files (utt2spk, utt2lang, etc) appropriately.

I was able to modify the evaluation recipe to bypass the duration variations and print results.