Splitting switchboard corpus into train/test sections

os02...@gmail.com

unread,

Dec 26, 2015, 12:49:16 PM12/26/15

to kaldi-help

Hi, Dan,

I tried your switchboard nnet2 recipe, training the model on switchboard corpus and tested on hub2000 data.
now I want to split switchboard corpus into 2 parts, one for training a model and the other for testing.
I just made speaker lists for training and testing, and applied subset_data_dir.sh to data/train (one created through switchboard/hub2000 experiment) to make the directories for both purposes.
(the training data contains about 241000 utterances and the test about 22000.)
And following run.sh, I made subsequent subsets for training, dev, monophone, nodup, so on.
Now I am about to train a monophone model, but I wondered:
Do I need to do some work on data/lang_nosp directory so that this does not contain any information as to the test part of data?
For example the fact that data/lang_nosp contains words which are seen not in training data but seen in test data leads to
an unfair experiment?
I am afraid of creating a lang directory or lexicon file for my own might lead me to making poor ones and causing bad results..

Daniel Povey

unread,

Dec 26, 2015, 3:28:14 PM12/26/15

to kaldi-help

The default Switchboard scripts already do this- search for 'train_dev'. Yes, if you want the results to really be correct you should probably make sure that the lexicon creation and (even more important) the language model training processes do not use the 'dev' data. I think the checked-in scripts already do this. You'll have to study the scripts for this-- I can't give you step by step instructions.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

os02...@gmail.com

unread,

Dec 28, 2015, 2:04:59 AM12/28/15

to kaldi-help, dpo...@gmail.com

Thank you. yes, checking the way train_dev is treated in Switchboard experiment seems to be helpful. I will examine about that.

2015年12月27日日曜日 5時28分14秒 UTC+9 Dan Povey:

Reply all

Reply to author

Forward