GMM Training

103 views
Skip to first unread message

Francesco Tuveri

unread,
Sep 26, 2016, 11:11:37 AM9/26/16
to bob-devel
Hi,

I'm trying to training a GMM (GMMMachine) with standard EM (ML_GMMTrainer). I'm following the strategy of this script (https://www.idiap.ch/software/bob/docs/latest/bioidiap/bob.bio.gmm/master/_modules/bob/bio/gmm/algorithm/GMM.html) where first a KMeans Machine is trained and then used as initialization for the GMM. I'm having the following problems:

- if I initialize the KMeans with 'RANDOM', then at the first GMM-training iteration I get
'RuntimeError: bob.learn.em.ML_GMMTrainer - cannot perform the e_step method: C++ exception caught: 'logadd: minusdif (-nan) log_b (-nan) or log_a (185.999011) is nan'

- if I don't use KMeans initialization, (so I guess the GMM is randomly initialized) I get this strange behaviour:
bob.learn.em@2016-09-26 17:03:48,861 -- INFO: Iteration = 0/50
log likelihood
= -3.670474
bob
.learn.em@2016-09-26 17:04:35,594 -- INFO: log likelihood = -3.670474
convergence value
= 0.945636
bob
.learn.em@2016-09-26 17:04:35,594 -- INFO: convergence value = 0.945636
Iteration = 1/50
bob
.learn.em@2016-09-26 17:04:35,594 -- INFO: Iteration = 1/50
log likelihood
= -3.670474
bob
.learn.em@2016-09-26 17:05:22,325 -- INFO: log likelihood = -3.670474
convergence value
= 0.000000
bob
.learn.em@2016-09-26 17:05:22,325 -- INFO: convergence value = 0.000000

I read some other posts around, and I found out that with KMEANS_PLUS_PLUS everything is fine. The problem is that this kind of initialization is too slow on my full dataset (around 100 hours of speech), for example after 3 days the initialization is still running.
Another thing that I have done is setting the following parameter, ubm.set_variance_thresholds(1e-6), but it doesn't work for me.

Pavel Korshunov

unread,
Sep 26, 2016, 11:22:07 AM9/26/16
to bob-...@googlegroups.com
Hi Francesco,

I would go into the direction of lowering ubm.set_variance_thresholds. I had a similar problem with some of my features and setting ubm.set_variance_thresholds to 1e-7 helped me. 

The way to figure it out was to check the extracted features (open HDF5 files that you have after the extraction step and look into them) and see the difference between features from different speech samples. In my case, the difference was in the range of 1e-6, so setting ubm.set_variance_thresholds to 1e-7 made sense for me. 

But may be someone will suggest a 'better' approach :)

cheers,
-pavel


--
-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/
---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Dr. Pavel Korshunov
Biometric group
Idiap Research Institute
Rue Marconi 19
CH - 1920 Martigny
Switzerland

Room: 207

Tiago Freitas Pereira

unread,
Sep 26, 2016, 11:24:13 AM9/26/16
to bob-...@googlegroups.com
Hi Francesco,

First thing, check your input data.
Are you sure that you don't have any `NaN`,  of 'inf'?

Cheers

Tiago

Francesco Tuveri

unread,
Sep 26, 2016, 12:36:02 PM9/26/16
to bob-devel
Yes, I checked the cepstral parameters and there are no inf and no NaN. I also tried decreasing the variance threshold until 1e-32 and the problem is still there. 
One thing that I noticed is that I have a lot of zero-valued MFCCs, but that shouldn't be a problem, right? 

--
-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/

---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
Dr. Pavel Korshunov
Biometric group
Idiap Research Institute
Rue Marconi 19
CH - 1920 Martigny
Switzerland

Room: 207

--
-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/

---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tiago

Pavel Korshunov

unread,
Sep 26, 2016, 1:07:24 PM9/26/16
to bob-...@googlegroups.com
may be the zero-valued MFCCs are the problem. The error message says that 'minusdif (-nan) log_b (-nan)'. So, the log likelihood to some of the Gaussians (probably initial outputs from Kmeans) is nan. Since you are in GMM step, your KMeans has finished, so you can check HDF5 files generated by Kmeans. I suspect the last completed file (I guess kmeans.hdf5) will have some nan values. You need to trace the problem to where it occurs first. 

If you want to check deeper: here is the code where it breaks here: https://www.idiap.ch/software/bob/docs/releases/last/doxygen/html/GMMMachine_8cc_source.html#l00241
Here is the implementation of LogAdd where the actual error message is raised:


--
-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/

---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Tiago Freitas Pereira

unread,
Sep 26, 2016, 3:50:03 PM9/26/16
to bob-...@googlegroups.com
Hi Francesco, 

The zeros shouldn't be a major issue.
As Pavel pointed out, the `variance_threshold` property should solve the issue when you have clusters with a variance very small.

Could you please check the output of the KMeans (right before to run the ML_GMMTrainer)?
How are the means, variances and weights?
You see something abnormal?

Cheers



-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/

---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tiago

Francesco Tuveri

unread,
Sep 27, 2016, 5:13:01 AM9/27/16
to bob-devel
Yeah something strange is happening with the Kmeans trainig. At the end of it, I end up with a lot of NaN values in the means and variances of the kmeans machine, the weights are fine. The number of NaN elements does not change if i decrease the variance threshold.

Francesco Tuveri

unread,
Sep 27, 2016, 5:28:48 AM9/27/16
to bob-devel
With 'RANDOM_NO_DUPLICATE' initialization it looks like there are no problems, but could you explain me the difference between this method and the regular 'RANDOM'?

Tiago Freitas Pereira

unread,
Sep 27, 2016, 8:09:59 AM9/27/16
to bob-...@googlegroups.com
Hi Francesco,

The RANDOM initialization will select randomly the initial means for your data
The RANDOM_NO_DUPLICATE does the same job, but will not allow two equal samples.
The KMEANS_PLUS_PLUS is more sophisticated, you can find more info here http://en.wikipedia.org/wiki/K-means%2B%2B and http://pythonhosted.org/bob.learn.em/py_api.html#bob.learn.em.KMeansTrainer.initialization_method

You said that you have some features with zeros in you dataset. Is it a lot?
I suspect that the RANDOM initialization is picking two equal samples and, as a consequence, you end up with one empty cluster.


More info about the EM based algorithms you can find here http://pythonhosted.org/bob.learn.em/index.html

Cheers



--
-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/

---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tiago

Manuel Günther

unread,
Sep 27, 2016, 12:00:26 PM9/27/16
to bob-devel
Hmm... when RANDOM is not working properly and is generating NaN values when picking the same mean twice, maybe we should remove RANDOM and have RANDOM_NO_DUPLICATE as default. As a consequence you cannot have more means than data samples, which should be the right thing to enforce anyways.

Manuel
Reply all
Reply to author
Forward
0 new messages