Understanding fMLLR transformation

suhas pillai

unread,

Feb 5, 2017, 3:31:40 PM2/5/17

to kaldi-help

Hi all,

I want to understand how fmllr transformation matrix is estimated. In order to understand fmllr, I want to point out about MAP estimation (just to compare and understand) , where we train the model parameters on training data and in order to adapt to new speaker, we use some portion of test data (suppose 25 % of test data) and estimate new model parameters (i.e weight, mean and covariance).

I want to know which of the following is correct

1. fmllr is linear transformation of features, we estimate the transformation matrix from training data and use that on test data or
2. We estimate transformation matrix from training data and use some portion of the test data for re-estimating the transformation matrix , which is then used for testing. (This looks similar to MAP estimation)

(tri3b)
When we do (LDA+MLLT+SAT) training, we estimate fmllr transformation matrix from training data using train_sat.sh

Then when I run decode_fmllr.sh, do we use any test speaker data (like 25% of test data) for re-estimating the transformation matrix or we just use the same transformation matrix from training data.
I went through the script and as per my understanding, I don't see I have given how much of the test data is required for adaptation.
I want to know what is happening, when I run

steps/train_sat.sh
steps/decode_fmllr.sh
steps/align_fmllr.sh

-Suhas

Daniel Povey

unread,

Feb 6, 2017, 12:07:29 AM2/6/17

to kaldi-help

Traditionally when fMLLR is used, it's adapted on *test* data, not training data (since in most tasks of interest, the test speakers do not appear in the training data). But it's *unsupervised* adaptation, meaning we don't have access to the transcripts. We do an initial decoding pass without fMLLR and use the text from that decoding pass as the transcripts in the forward-backward to get an initial fMLLR transformation. Then, often, we decode again, get new transcripts, and re-estimate the fMLLR matrix again, before doing a final decoding pass.

Dan

--
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

suhas pillai

unread,

Feb 6, 2017, 10:14:30 PM2/6/17

to kaldi-help, dpo...@gmail.com

Thanks for the reply.

I am just rephrasing to get this right in my head.

For Training

We estimate initial global fMLLR matrix (by which I mean using all the training speakers) and it uses the transcripts obtained from alignment using previous models For eg like from tri2a (LDA + MLLT). Then we decode and get new transcripts and re-estimate global fMLLR matrix again, we continue to do this for number of fmllr iterations given by the user (defined in fmllr_iter variable).

For Testing

I think I understood what you said, but I have one doubt , like why don't we use fMLLR matrix estimated from training speakers (global fMLLR) while doing initial decoding to get the transcripts and use that transcripts in forward-backward algorithm to get initial fMLLR transformation matrix for test speaker. I am trying to relate it to MAP adaptation. Do you think estimating initial fMLLR for test speaker using fMLLR transformation matrix from training (global fMLLR) is fine, when the test speaker is somewhat similar to training speaker and if the test speaker is really different then do the initial decoding pass without fMLLR and use the text from that decoding pass as the transcripts in the forward-backward to get an initial fMLLR transformation for test speaker.

I would really appreciate your help in understanding this thing.

-Suhas

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.

Daniel Povey

unread,

Feb 6, 2017, 10:46:17 PM2/6/17

to suhas pillai, kaldi-help

I am just rephrasing to get this right in my head.

For Training

We estimate initial global fMLLR matrix (by which I mean using all the training speakers) and it uses the transcripts obtained from alignment using previous models For eg like from tri2a (LDA + MLLT). Then we decode and get new transcripts and re-estimate global fMLLR matrix again, we continue to do this for number of fmllr iterations given by the user (defined in fmllr_iter variable).

No, there is nothing like this. If you were treating all your training data as one speaker, fMLLR would do nothing (at least, if you are doing MLLT), so you wouldn't do it. But in fact the practice is to estimate fMLLR matrices for the individual training speakers and train on the adapted features; this is called Speaker Adapted Training (SAT).

For Testing

I think I understood what you said, but I have one doubt , like why don't we use fMLLR matrix estimated from training speakers (global fMLLR)

because there is no such thing. However, if you are doing Speaker Adapted Training, for the 1st pass of decode you would typically decode using a model that had not been subject to Speaker Adapted Training (i.e. with no transforms for training speakers).

I don't have time to follow the rest.

Dan

suhas pillai

unread,

Feb 6, 2017, 11:06:27 PM2/6/17

to dpo...@gmail.com, kaldi-help

Thanks a lot Dan....I understand now...

Suhas

Reply all

Reply to author

Forward