Can't get accurate phoneme level log-likelihoods by using nnet3 raw model

Michael

unread,

Dec 27, 2019, 11:59:51 PM12/27/19

to kaldi-help

Hi all,

I want to train an acoustic model which can compute log-likelihood of every phoneme for each frame. There are 42 phonemes (all non-silence phones are defined in CMU dict) in my phone set and I want to use monophones so that I can make the AM as simple as possible. So the number of this acoustic model's output nodes is 42, meaning that each output node represents one specific phoneme. I use nnet3 as backend and choose a five layer TDNN structure. The total left context is 27 and the right context is 8.

For the training labels, I first get the alignments from a tri3 GMM model and use "ali-to-phones --per-frame=true" to convert all the alignments into phonemes in text format. Then I create a phone2int table and map all the alignments in text phoneme format into integers ranging from 0 to 41 and save these alignments as targets to train the TDNN model. Since it doesn't involve transition model, I thought I can directly use steps/nnet3/train_raw_dnn.py to train a raw neural network instead of a traditional acoustic model. So I did it and got the final.raw file. The training corpus include wsj0&1, Librispeech_clean300h and Tedlium_v2. The input features are 40-d MFCCs. And I've done the speed&volume perturbation for data augmentation. In the interest of keeping model small enough, I choose to use online-cmvn rather than ivectors. And I'm sure that the online-cmvn has been applied on both train and test data with the same configurations.

For testing, I use "nnet3->DecodableNnetLoopedOnline(int32 frame, int32 index)" to compute log likelihoods of 42 phones per frame. The class that I provide for constructing the decodable interface is 'Nnet', which means there is no prior division process involved. I have picked some audios which are recorded by native speakers in clean environment and I would say that there is not that much mismatch between training and testing. But during my debugging, I found out that the log-likelihoods are not accurate enough. For example, in CMU dict, there is an entry 'SPELL S P EH L'. When testing with people saying something contains 'spell', the log-likelihood of 'B' can be much bigger than 'P' for many successive frames even though the log-likelihood of 'P' is the second largest among all 42 phones. I thought maybe this is due to the missing priors. So this time I first construct a "AmNnetSimple" by using "Nnet" and use "SetPriors(posterior_vec);" to set priors in it. The posteriors vector is generated from "steps/nnet3/train_raw_dnn.py" by supplying "--compute-average-posteriors=true". And it has been renormalized by "posterior_vec.Scale(1.0/posterior_vec.Sum());" before I do the prior setting. In this way, the log-likelihood will be normalized by priors. Then I ran the test again. But the similar issue still exists.

Then I thought maybe I shouldn't train a raw model for this specific scenario. So I compared "steps/nnet3/train_dnn.py" and "steps/nnet3/train_raw_dnn.py". It seems to me that there are two big differences.

1. In train_dnn.py, it will compute initial vector for FixedScaleComponent before softmax, using priors^{prior_scale} and rescaling to average 1, which doesn't exist in train_raw_dnn.py.

2. When preparing the initial acoustic model, train_dnn.py will add priors into the initial model, which are collected by counting the pdf-ids from alignments. And train_raw_dnn.py doesn't have this procedure.

I haven't digged up dipper about how these two differences are going to influence the training of the model, especially for the second one. I assume the priors added during model initialization will be used for each iteration during the training of the traditional acoustic model, which doesn't apply for the raw model. Before I look into more lower level code, I'd like to receive any advice from you guys on whether it's worthy to make a corresponding transition model so that I can use "steps/nnet3/train_dnn.py" for training or modify the "steps/nnet3/train_raw_dnn.py" so that it can also add priors during the model initializing. Or the inaccuracy issue has nothing to do with whether using "steps/nnet3/train_dnn.py" or "steps/nnet3/train_raw_dnn.py" and I need to go some other ways to solve my current problem.

Any suggestion will be highly appreciated, thanks.

Michael

Daniel Povey

unread,

Dec 29, 2019, 4:08:58 AM12/29/19

to kaldi-help

What you did with the priors should have worked. Did you at least verify that the results were different?

The FixedScaleComponent and prior_scale are just to help convergence of training, they don't affect the probabilities in any consistent way.

There may have been some problem such as using a decodable object that wasn't picking up the priors for some reason, e.g. by using the wrong constructor or something.

Dan

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/17c6d58a-879b-4c37-a4a4-5a9fb0f4dfb0%40googlegroups.com.

Michael

unread,

Dec 29, 2019, 10:31:38 PM12/29/19

to kaldi-help

Yes, by running the decoding process on the same audio, before adding priors, the acoustic_cost which equals "-decodable->LogLikelihood(frame, arc.ilabel)" would be a very tiny positive floating point for the target phoneme, like '0.0012'. After I setting up the priors, the acoustic_cost for the target phoneme on the same frame would be a negative floating point, like '-0.175'. So I'm pretty sure that the added priors has taken effect.

Anyway, I have managed to train the model using "steps/nnet3/train_dnn.py" by creating a new lang/topo and also new tree and final.mdl in align folder. I will update the status after I get the test result from this new model. Thanks a lot.

Michael

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi...@googlegroups.com.

Reply all

Reply to author

Forward