Question about training data

Harry

unread,

Oct 9, 2014, 12:29:34 PM10/9/14

to bob-...@googlegroups.com

Hello there,

I have a question about the training data.

I have a database and I use a part for training my UBM. Should I use the same data to train TVS?

What about enrollment data?

I am a little confused. We have 3 different parts: Train, enroll and test.

Train: used for train UBM and T

Test is clear, but what about enroll?

What is the difference between training data and enrollment data?

Thank's!

Elie Khoury

unread,

Oct 9, 2014, 12:40:37 PM10/9/14

to bob-...@googlegroups.com

Hello,

Typically, to evaluate a biometric system (say speaker recognition or face recognition), the data is split into 3 parts.

a Training set: used to train the parameters of your algorithm, e.g., to train your UBM
a Development (DEV) set: used to evaluate hyper-parameters of your algorithm, e.g.,the number of Gaussian, the threshold that corresponds to your EER, etc.
an Evaluation (EVAL) set: used to evaluate the generalization performance of your algorithm on previously unseen data

Development and Evaluation set are again split into samples that are used to enroll client models, and probe samples to be tested against all client models.

Hope this can help.
An example of protocol could be found here:
https://github.com/bioidiap/spear/tree/master/protocols/banca/G

Best regards,
Elie

--
-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/
---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

-- 
-------------------
Dr. Elie Khoury
Post Doctorant
Biometric Person Recognition Group
IDIAP Research Institute (Switzerland)
Tel : +41 27 721 77 23

Harry

unread,

Oct 9, 2014, 6:44:31 PM10/9/14

to bob-...@googlegroups.com, elie....@idiap.ch

Thank you Elie,

But my main question is still here:

Should I train UBM and T matrix with the same set of data?

If not, Should I use for example half of my dataset to train my UBM, and a small portion for T?

What is the percentage of the data I should allocate to train UBM, train T and then test?

Harry

unread,

Oct 9, 2014, 6:48:40 PM10/9/14

to bob-...@googlegroups.com, elie....@idiap.ch

Besides, I should add that when I used 50 of my dataset for train UBM and the same data for train T, and the rest for test, I didn't give good results. So I might need to change it some how. I just want to know :

1. what is the standard of allocation of the data for training UBM and T matrix.

2. Should I use another dataset to train my UBM? or should I use te same dataset to train my UBM.

Manuel Günther

unread,

Oct 10, 2014, 3:53:55 AM10/10/14

to bob-...@googlegroups.com, elie....@idiap.ch

Dear Harry,

that we don't answer your question satisfactorily is mainly because this is an open issue. So, there is no other way as trying it out yourself!

Usually, we are using the same data to train the UBM and the following matrices. However, it is not absolutely clear if this is a good choice or not.

Though I am not an expert, but I've heard that the UBM and, particularly, the T matrix need quite an amount of data to train. Just think about it: The UBM is the "universal background model", which means that it collects all the variability that might occur in anyone. If you want to train it with few features from few identities, it will most likely not be able to capture the full variability.

Happy researching!
Manuel

Message has been deleted

Harry

unread,

Oct 10, 2014, 5:26:01 AM10/10/14

to bob-...@googlegroups.com, elie....@idiap.ch

Hey Manuel,

Thank you. I see...So, what if we train UBM with another database wih enough data, but with a different microphone and totally different speakers? It will be a very different model in comparison with the test dataset?

What about T matrix? If the noises and channel characteristics in another database we use is different from the test dataset? What I'm trying to say is:

1. If we use another database which is big enough, is it better to train just UBM on it, or UBM and T?

2. Is it better to train T on the same database that we want to test? to have the same kind of channel variability?

And ...

3. A) In I-vector approach, we don't have any enrollment step. Yes?

B) But in JFA, we have an enrollment step. Correct?

4. A) We don't use any labels while training UBM and T in I-vector. We just use labels to train PLDA and in LDA. Correct?

B) But we use labels in training step in JFA. Am I right?

5. If we don't use labels to train T, can we use test data to train T? Or even UBM?

Thank you!

Harry

unread,

Oct 10, 2014, 5:29:59 AM10/10/14

to bob-...@googlegroups.com, elie....@idiap.ch

Hello Elie, I just add your answer here:

Hello again,
Here some quick answers. Please check more research papers to get a full understanding.

On 10/10/2014 11:04 AM, Harry wrote:

Hey Manuel,

Thank you. I see...So, what if we train UBM with another database wih enough data, but with a different microphone and totally different speakers? It will be a very different model in comparison with the test dataset?
What about T matrix? If the noises and channel characteristics in another database we use is different from the test dataset? What I'm trying to say is:
1. If we use another database which is big enough, is it better to train just UBM on it, or UBM and T?

I would say: train your UBM with the external database and your T matrix with your own database (or a balanced mixture of both).

2. Is it better to train T on the same database that we want to test? to have the same kind of channel variability?

Yes, indeed that how it works.

And ...
3. A) In I-vector approach, we don't have any enrollment step. Yes?

It depends how you see it and what additional post-processing you have after i-vector extraction, but indeed there's no explicit enrollment step like MAP adaptation or projection or so. But averaging i-vectors of the same speaker (what works best for speaker rec.) can be seen as a kind of enrollment.

B) But in JFA, we have an enrollment step. Correct?

right

4. A) We don't use any labels while training UBM and T in I-vector. We just use labels to train PLDA and in LDA. Correct?

Indeed, T, UBM doesn't need labels, whereas PLDA, LDA and WCCN do.

B) But we use labels in training step in JFA. Am I right?

Indeed, training JFA or ISV requires the labels of subjects to train the within- and between- subject variabilities.
Best,
Elie

Elie Khoury

unread,

Oct 10, 2014, 5:36:58 AM10/10/14

to bob-...@googlegroups.com

On 10/10/2014 11:29 AM, Harry wrote:
> 5. If we don't use labels to train T, can we use test data to train T?
> Or even UBM?

Many "No"s!!! A test set is meant to simulate a real world scenario
where you have only access to one utterance at once, and you're system
is already pre-trained!!

Harry

unread,

Oct 10, 2014, 8:37:34 AM10/10/14

to bob-...@googlegroups.com, elie....@idiap.ch

Well, Thank you Elie, you're so helpful.

I think my UBM is trained good enough, but I still have a problem with my T, because I just use half my data to train my T, the other half is used for test. But it seems T matrix doesn't work well. I don't want to use UBM training data to train T. to emphasis channel variability, I think it's better just from my database. not other databases to train T.

What do you think?

1. Should I add my UBM training to my current T training?

2. Is that fine if I use exactly the same data to train UBM and T?

Elie Khoury

unread,

Oct 10, 2014, 9:23:57 AM10/10/14

to bob-...@googlegroups.com

On 10/10/2014 02:37 PM, Harry wrote:

Well, Thank you Elie, you're so helpful.
I think my UBM is trained good enough, but I still have a problem with my T, because I just use half my data to train my T, the other half is used for test. But it seems T matrix doesn't work well. I don't want to use UBM training data to train T. to emphasis channel variability, I think it's better just from my database. not other databases to train T.

What do you think?

1. Should I add my UBM training to my current T training?

If your UBM training data is from the same database (i.e. same recording conditions), I don't see a reason not to add it to the TV training data.

2. Is that fine if I use exactly the same data to train UBM and T?

Yes, as Manuel hinted, the more data (especially "in-domain" data) you have to train UBM and T, the better you train them. For e.g., for NIST SRE's and MOBIO, I tend to use the same data to train both UBM and T. Sometimes, I do a training for a gender-independent UBM followed by a training for gender-dependent T matrix.

Please check a discussion about it in this paper.
cs.uef.fi/odyssey2014/program/pdfs/56.pdf

Best,
Elie

On Friday, October 10, 2014 11:36:58 AM UTC+2, Elie Khoury wrote:
On 10/10/2014 11:29 AM, Harry wrote:
> 5. If we don't use labels to train T, can we use test data to train T?
> Or even UBM?
Many "No"s!!! A test set is meant to simulate a real world scenario
where you have only access to one utterance at once, and you're system
is already pre-trained!!

H

unread,

Oct 10, 2014, 9:32:47 AM10/10/14

to bob-...@googlegroups.com, elie....@idiap.ch

If your UBM training data is from the same database (i.e. same recording conditions), I don't see a reason not to add it to the TV training data.

The problem is that I don't have enough data to train UBM. So I am using another dataset with totally different recording conditions and speakers etc to train my UBM. But for T, I am using half of my datatset to train T and another half for test. Should I add the other dataset (which I've already used for training UBM) to training half of my dataset to have more data for training T?

Reply all

Reply to author

Forward