Problem in training DNN with i-vector for low-resource dataset

peng...@gmail.com

unread,

Mar 18, 2016, 3:42:09 PM3/18/16

to kaldi-help

Hi,

I am running steps/nnet2/train_pnorm_fast.sh to train a model using a low-resource dataset (only 3hr transcribed audio) and tried to improve this performance by appending i-vectors on spliced fMLLR. But it doesn't work. As the i-vector dim decreases, performance get better, but still worse than the baseline.

The extracting of i-vectors is mostly based on swbd/s5c/local/nnet2/run_ivector_common.sh except:

1. using fMLLR as the feature of super vectors

2. using my fMLLR-supported version of steps/online/nnet2/extract_ivectors.sh rather than steps/online/nnet2/extract_ivectors_online.sh

This is to match the setup in IBM's paper of ASRU 2013.

Running 15+5 epochs (default config), I noticed that the average log probability on validation set goes up rapidly and then start going down until the end. I guess there might be overfitting since the number of speakers are limited. But using the model with maximum log prob for decoding didn't get better result.Maybe fMLLR related parameters had not been well trained?

I tuned some parameters and training strategies, but still can't gain from i-vectors.

I also tested this recipe on a 40 hr set and 300hr swbd. On the 40 hr set, I got similar results. On swbd, appending i-vectors brought a slight improvement from 14.7% to 14.5% (Hub5'00 swbd), which is far less than expected.

Any suggestions? Thank you.

Daniel Povey

unread,

Mar 18, 2016, 4:25:24 PM3/18/16

to kaldi-help

If you are doing fMLLR, it's not really expected that iVectors will give much (if any) additional improvement, as fMLLR is quite a powerful adaptation method. The reason we preferred to use iVectors and not fMLLR is that it makes it much easier to build a single-pass, real-time system.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

peng...@gmail.com

unread,

Mar 19, 2016, 2:57:31 AM3/19/16

to kaldi-help, dpo...@gmail.com

Thank you, Dan.

Actually I also tried using LDA+MLLT as input features on 3hr data and got similar results. Now I am waiting for results of the same training on 40hr data.

I notice that ivector-extractor-est does not accept the spk2utt option and looks like accumulating statistics per utterance. In run_ivector_common.sh, i-vectors of training data are extracted per sub-speaker with at most 2 utterances, for better generalization, while those of testing data are extracted per speaker. I wonder if this mismatch would affect the performance.

btw, are there any methods to check the correctness of extracted i-vectors? There might be bugs in my code but I can't find them by reviewing the codes.

在 2016年3月19日星期六 UTC+8上午4:25:24，Dan Povey写道：

Daniel Povey

unread,

Mar 19, 2016, 2:50:08 PM3/19/16

to peng...@gmail.com, kaldi-help

Thank you, Dan.
Actually I also tried using LDA+MLLT as input features on 3hr data and got similar results. Now I am waiting for results of the same training on 40hr data.

On very tiny amounts of data, like 3 hours, iVectors don't always give an improvement- there is not enough variability to learn the things it needs to learn.

I notice that ivector-extractor-est does not accept the spk2utt option and looks like accumulating statistics per utterance.

This is how iVectors have always been estimated, even in speaker-id; it never uses the speaker labels.

In run_ivector_common.sh, i-vectors of training data are extracted per sub-speaker with at most 2 utterances, for better generalization, while those of testing data are extracted per speaker. I wonder if this mismatch would affect the performance.

No, this way is better than estimating the iVectors on the entire speakers of the training data; it's pretty robust.

btw, are there any methods to check the correctness of extracted i-vectors? There might be bugs in my code but I can't find them by reviewing the codes.

Not really, but if there are gross errors and you run at a high verbose level you'll usually see higher than normal objective function improvements in the iVector estimation. [normal == 5 to 10].

Dan

peng...@gmail.com

unread,

Mar 19, 2016, 4:55:27 PM3/19/16

to kaldi-help, peng...@gmail.com, dpo...@gmail.com

Thank you very much for your explanation. Now I believe it's really difficult to train an i-vector extractor with only 3 hr data. I tried "extracting i-vectors" by randomly sampling from a standard Gaussian distribution and got almost identical curve of log probabilities in DNN training.

In my experiment on 40 hr data using LDA+MLLT as input features, i-vectors also gave a slightly worse WER. The i-vector extractor should be almost the same as run_ivector_common.sh, except I use PLP+pitch features rather than high-resolution MFCC for LDA training, and extract one i-vector per utterance rather than sub-speaker with 2 utterances. But I don't think these should be the problem, right?

I just checked log files of ivector-extractor-est (update.*.log), but couldn't find an objective function improvement between 5 and 10. Did you mean Update():ivector-extractor.cc:1191 ?

In my swbd results, the "overall objective-function improvement per frame" is around 100 in update.0.log and around 1.0 in update.1.log. I compared these logs with those of the standard swbd/s5c/run_tdnn.sh (where I got an improvement from 16.4% to 14.6%, no speed perturb), but didn't see much difference.

Thank you again for your replying. I've been working on this problem for weeks but can't get an expected result.

I

在 2016年3月20日星期日 UTC+8上午2:50:08，Dan Povey写道：

Daniel Povey

unread,

Mar 19, 2016, 5:06:33 PM3/19/16

to peng...@gmail.com, kaldi-help

Thank you very much for your explanation. Now I believe it's really difficult to train an i-vector extractor with only 3 hr data. I tried "extracting i-vectors" by randomly sampling from a standard Gaussian distribution and got almost identical curve of log probabilities in DNN training.

That is probably too little data. You'd have to reduce the iVector dimension, e.g. to 50, as we do in the RM setup. Even then, it may give little or no improvement when training with so little data. If you just want the best results and don't care about runtime, probably fMLLR is your best bet.

In my experiment on 40 hr data using LDA+MLLT as input features, i-vectors also gave a slightly worse WER. The i-vector extractor should be almost the same as run_ivector_common.sh, except I use PLP+pitch features rather than high-resolution MFCC for LDA training, and extract one i-vector per utterance rather than sub-speaker with 2 utterances. But I don't think these should be the problem, right?

That's probably not the issue. However, I think when we used iVectors with pitch features we actually didn't include pitch in the iVector estimation. Pitch is not very Gaussian.

I just checked log files of ivector-extractor-est (update.*.log), but couldn't find an objective function improvement between 5 and 10. Did you mean Update():ivector-extractor.cc:1191 ?

Not those logs: the logs where you extract the iVectors on the training or test data.

Dan

peng...@gmail.com

unread,

Mar 19, 2016, 7:09:46 PM3/19/16

to kaldi-help, peng...@gmail.com, dpo...@gmail.com

My experiments on 3 hours data always set the i-vector dimension to 40 (100 for 40 hours data and swbd) but it's still difficult to train. I am planning to do some research on DNN+i-vector and trying to build a strong baseline. I was expecting an improvement on fMLLR, but now I hope at least it works on 40 hours data and LDA features.

Actually I am mostly using ivector-extract, not ivector-extract-online2. Sorry for not mentioning that. I am not sure whether the normal range (5, 10) is also applicable to ivector-extract. The objective function improvements for most jobs range from 3 to 10 although some are higher than 10. Just now I extracted i-vectors for 40 hours data with ivector-extract-online2. All objective function improvements range from 2.3 to 10.

Thank you for your suggestion. I will try building a system without pitch.

在 2016年3月20日星期日 UTC+8上午5:06:33，Dan Povey写道：

Daniel Povey

unread,

Mar 19, 2016, 7:15:45 PM3/19/16

to 陈智鹏, kaldi-help

My experiments on 3 hours data always set the i-vector dimension to 40 (100 for 40 hours data and swbd) but it's still difficult to train. I am planning to do some research on DNN+i-vector and trying to build a strong baseline. I was expecting an improvement on fMLLR, but now I hope at least it works on 40 hours data and LDA features.

Actually I am mostly using ivector-extract, not ivector-extract-online2. Sorry for not mentioning that. I am not sure whether the normal range (5, 10) is also applicable to ivector-extract. The objective function improvements for most jobs range from 3 to 10 although some are higher than 10. Just now I extracted i-vectors for 40 hours data with ivector-extract-online2. All objective function improvements range from 2.3 to 10.

Thank you for your suggestion. I will try building a system without pitch.

It's better to use the online iVector extraction because it leads to more variety in iVectors, and leads to better generalization to test data. This will be particularly important if the amount of training data is small.

Dan

peng...@gmail.com

unread,

Mar 20, 2016, 8:40:22 PM3/20/16

to kaldi-help, peng...@gmail.com, dpo...@gmail.com

Thank you for your suggestion. I just got some new results.

On 40 hours data, I use the same procedure of online i-vector extraction as in swbd/run_tdnn.sh (training LDA+MLLT on spliced high resolution MFCC and training UBM on LDA features). This i-vector is appended to another spiced LDA feature (trained on spliced PLP+pitch) and then input to pnorm DNN. But I got the same WER, although the log probability and accuracy on both validation set and training subset are a little higher.

Also, I found this in Kaldi Doc:

The adaptation philosphy is to give the neural net un-adapted and non-mean-normalized features (MFCCs, in our example recipes), and also to give it an iVector.

I notice that the example recipes for both RM and Switchboard do accept raw MFCC features as NN input. But my understanding is that, theoretically, neural networks can learn any feature transformation if data is enough. For small amount of data, techniques like LDA and MLLT should be helpful, right?

I am now running a DNN training using raw 40 dim high resolution MFCC and i-vectors as input features, just like what run_tdnn.sh did. If it's better than using raw features only, at least I know the extracted i-vectors are correct. Is it possible that it requires many efforts on parameter tuning to make an improvement from i-vectors on such an advanced feature like LDA?

在 2016年3月20日星期日 UTC+8上午7:15:45，Dan Povey写道：

Daniel Povey

unread,

Mar 20, 2016, 9:24:33 PM3/20/16

to 陈智鹏, kaldi-help

Sometimes iVectors when added to the input can even give a degradation, for the same reason that if you add random features to a neural net input you get a degradation-- the neural net will tend to mix them up with the 'useful' features in a random way and will then have to spend too much effort to disentangle them. That's why in the nnet2 and nnet3 scripts we have the 'lda' stage (which is not really LDA and is not dimension reducing) - it decorrelates the features and it scales down those features that are not correlated with the class labels (such as the iVector features).

If you don't do that you should at least scale down the iVectors (e.g. multiply them by a number less than one) before appending them to the neural network input.

Dan

peng...@gmail.com

unread,

Mar 21, 2016, 1:26:59 AM3/21/16

to kaldi-help, peng...@gmail.com, dpo...@gmail.com

That might be the issue! This decorrelation stage is also part of steps/nnet2/train_pnorm_fast.sh, but I remove it to simplify the network as I thought it just offers an additional gain. Thank you very much! I will get it back and retry.

在 2016年3月21日星期一 UTC+8上午9:24:33，Dan Povey写道：

陈智鹏

unread,

Mar 21, 2016, 12:55:59 PM3/21/16

to dpo...@gmail.com, kaldi-help

Unfortunately, my experiments showed that i-vectors gave a higher degradation when a decorrelation LDA was inserted at the first layer. I totally have no idea how to make it work.

Some information for diagnostics have been attached in the form of HTML, including nnet-am-infos, curves of accuracy and log probabilities, and WER.

There are 2 groups of results inside. One is LDA + ivectors on MFCC, the other one is MFCC + ivectors on MFCC.

When training, I set both num_epochs_reduce and num_epochs_extra to 5.

If you have time to check these results, I will really appreciate it.

---------------------------------------------------------
陈智鹏

DNN_ivector_diagnostics.zip

Daniel Povey

unread,

Mar 21, 2016, 5:43:38 PM3/21/16

to 陈智鹏, kaldi-help

I can't see anything very specifically wrong, except there seems to be a lot of overtraining and you might want to reduce the number of parameters slightly. iVectors are tricky to work with. I'd recommend using the nnet2-online or nnet3 scripts as they are configured well and we've resolved various issues. For example, the fact that we multiply the data by 3 is very helpful.

E.g. see egs/swbd/s5c/local/online/run_nnet2_ms_perturbed.sh, or

egs/swbd/s5c/local/nnet3/run_tdnn.sh.

Dan

peng...@gmail.com

unread,

Mar 22, 2016, 9:58:41 PM3/22/16

to kaldi-help, peng...@gmail.com, dpo...@gmail.com

Thank you. I reduced the number of parameters from 8M to 2M and got a relative improvement of 2.6% on LDA + i-vector features (48.4%), compared to the LDA-only model with approximately equal size of parameters (49.7%). It seems that there's underfitting here. So I set the number of parameters to 5M and got a better WER (47.2%). But all these performances are not better than that of the LDA-only NN with 8M (47.2%).

Also, the best performance I've ever achieved is 45.0% on a model accepting fMLLR-only features, with 8M parameters. It's really difficult to make an improvement on it by i-vectors.

I have tried run_nnet2_ms_perturbed.sh and run_tdnn.sh on Switchboard, but got a hight WER (> 50%) when applying them to my 40 hours data (the NN structure was simplified). Perhaps it requires much more efforts to make it work with not that rich data. I believe speed perturbing is quite useful, especially in low-resource scenarios. I didn't do that just because I am now focusing on i-vectors and want to save training time.

Thank you very much for spending so much time replying and giving suggestions. I will do some further exploration.

在 2016年3月22日星期二 UTC+8上午5:43:38，Dan Povey写道：

Reply all

Reply to author

Forward