Hi all,
I want to train an acoustic model which can compute log-likelihood of every phoneme for each frame. There are 42 phonemes (all non-silence phones are defined in CMU dict) in my phone set and I want to use monophones so that I can make the AM as simple as possible. So the number of this acoustic model's output nodes is 42, meaning that each output node represents one specific phoneme. I use nnet3 as backend and choose a five layer TDNN structure. The total left context is 27 and the right context is 8.
For the training labels, I first get the alignments from a tri3 GMM model and use "ali-to-phones --per-frame=true" to convert all the alignments into phonemes in text format. Then I create a phone2int table and map all the alignments in text phoneme format into integers ranging from 0 to 41 and save these alignments as targets to train the TDNN model. Since it doesn't involve transition model, I thought I can directly use steps/nnet3/train_raw_dnn.py to train a raw neural network instead of a traditional acoustic model. So I did it and got the final.raw file. The training corpus include wsj0&1, Librispeech_clean300h and Tedlium_v2. The input features are 40-d MFCCs. And I've done the speed&volume perturbation for data augmentation. In the interest of keeping model small enough, I choose to use online-cmvn rather than ivectors. And I'm sure that the online-cmvn has been applied on both train and test data with the same configurations.
For testing, I use "nnet3->DecodableNnetLoopedOnline(int32 frame, int32 index)" to compute log likelihoods of 42 phones per frame. The class that I provide for constructing the decodable interface is 'Nnet', which means there is no prior division process involved. I have picked some audios which are recorded by native speakers in clean environment and I would say that there is not that much mismatch between training and testing. But during my debugging, I found out that the log-likelihoods are not accurate enough. For example, in CMU dict, there is an entry 'SPELL S P EH L'. When testing with people saying something contains 'spell', the log-likelihood of 'B' can be much bigger than 'P' for many successive frames even though the log-likelihood of 'P' is the second largest among all 42 phones. I thought maybe this is due to the missing priors. So this time I first construct a "AmNnetSimple" by using "Nnet" and use "SetPriors(posterior_vec);" to set priors in it. The posteriors vector is generated from "steps/nnet3/train_raw_dnn.py" by supplying "--compute-average-posteriors=true". And it has been renormalized by "posterior_vec.Scale(1.0/posterior_vec.Sum());" before I do the prior setting. In this way, the log-likelihood will be normalized by priors. Then I ran the test again. But the similar issue still exists.
Then I thought maybe I shouldn't train a raw model for this specific scenario. So I compared "steps/nnet3/train_dnn.py" and "steps/nnet3/train_raw_dnn.py". It seems to me that there are two big differences.
1. In train_dnn.py, it will compute initial vector for FixedScaleComponent before softmax, using priors^{prior_scale} and rescaling to average 1, which doesn't exist in train_raw_dnn.py.
2. When preparing the initial acoustic model, train_dnn.py will add priors into the initial model, which are collected by counting the pdf-ids from alignments. And train_raw_dnn.py doesn't have this procedure.
I haven't digged up dipper about how these two differences are going to influence the training of the model, especially for the second one. I assume the priors added during model initialization will be used for each iteration during the training of the traditional acoustic model, which doesn't apply for the raw model. Before I look into more lower level code, I'd like to receive any advice from you guys on whether it's worthy to make a corresponding transition model so that I can use "steps/nnet3/train_dnn.py" for training or modify the "steps/nnet3/train_raw_dnn.py" so that it can also add priors during the model initializing. Or the inaccuracy issue has nothing to do with whether using "steps/nnet3/train_dnn.py" or "steps/nnet3/train_raw_dnn.py" and I need to go some other ways to solve my current problem.
Any suggestion will be highly appreciated, thanks.
Michael