Hi, Dan,
I tried your switchboard nnet2 recipe, training the model on switchboard corpus and tested on hub2000 data.
now I want to split switchboard corpus into 2 parts, one for training a model and the other for testing.
I just made speaker lists for training and testing, and applied subset_data_dir.sh to data/train (one created through switchboard/hub2000 experiment) to make the directories for both purposes.
(the training data contains about 241000 utterances and the test about 22000.)
And following run.sh, I made subsequent subsets for training, dev, monophone, nodup, so on.
Now I am about to train a monophone model, but I wondered:
Do I need to do some work on data/lang_nosp directory so that this does not contain any information as to the test part of data?
For example the fact that data/lang_nosp contains words which are seen not in training data but seen in test data leads to
an unfair experiment?
I am afraid of creating a lang directory or lexicon file for my own might lead me to making poor ones and causing bad results..