Interpreting Results

Neve Stagy

unread,

Jun 12, 2017, 5:11:42 PM6/12/17

to bob-devel

Hello all,

I've been playing around with SPEAR and have a few questions. Below are my results on Voxforge:

GMM-UBM

FRR-dev=0.0633

FRR-eval=0.29

EER-dev=0.0198

HTER-eval=0.0247

RR-dev=0.9967

RR-eval=0.9967

ISV

FRR-dev=0.0533

FRR-eval=0.2633

EER-dev=0.0104

HTER-eval=0.0237

RR-dev=1.0

RR-eval=1.0

JFA

FRR-dev=0.4167

FRR-eval=0.7033

EER-dev=0.0332

HTER-eval=0.0893

RR-dev=0.99

RR-eval=0.91

1. Why is there such a large difference between FRR of dev and eval sets? For example, since gmm_voxforge.py has the number of gaussians hardcoded, what are the data-dependent factors which result in dramatic underfitting for eval sets?

2. What size corpus makes JFA or I-Vectors perform optimally? I could not identify any any literature describing specific optimal conditions.

3. My current project aims to verify whether certain voice synthesis attacks could be misclassified as a genuine user's speech. For such an experiment, is it even necessary to split the data into dev and eval sets? I am more familiar with ordinary machine learning training/testing routines than how biometric algorithms are evaluated.

3. Are these classifiers to be compared by FAR/FRR/EER alone? Which are the deciding factors used in standard practice?

4. How does the recognition rate help evaluate performance? Specifically for JFA since both FRRs are so high, how does RR remain near perfect?

Any guidance would be appreciated!

Best,

-Steve

Manuel Günther

unread,

Jun 12, 2017, 7:45:42 PM6/12/17

to bob-devel

Hi Steve,

although I am not an expert in speaker recognition, here are some thoughts.

1. First of all, I don't know, how you computed the FRR results, so I cannot tell you, why one is worse than the other. If you compute the threshold for the FRR using the development set and apply them to both dev and eval, then the threshold might not be good and, hence, the algorithm is not too good.

2. I don't know, how large the corpus should be for JFA and I-vectors, but I have learned that especially I-vectors need much much more data than present in the voxforge (toy-)database.

3. For voice PAD you might want to consider different metrics. I am not sure if bob.bio.spear provides such metrics, or even evaluation protocols for this. bob.bio.spear is designed for biometric recognition, not for PAD.

3.The optimal way to compare two classifiers is the HTER on the eval set, as only this measure is unbiased.

4. Recognition rate is evaluating identification, while EER/HTER evaluates verification. Hence, both have their right to exist. It highly depends on the task at hand. That the recognition rates are so high also is based on the very small dataset, with more number of enrolled models, the recognition rate will naturally decrease.

I hope I could help at least with understanding some of the points.

Manuel

Neve Stagy

unread,

Jun 12, 2017, 8:09:55 PM6/12/17

to bob-devel

Thank you so much!

1. First of all, I don't know, how you computed the FRR results, so I cannot tell you, why one is worse than the other. If you compute the threshold for the FRR using the development set and apply them to both dev and eval, then the threshold might not be good and, hence, the algorithm is not too good.

The command I am using is evaluate.py -r -d scores-dev -e scores-eval -c FAR. What is the correct way to go about computing thresholds for both dev and eval?

Neve Stagy

unread,

Jun 12, 2017, 8:32:57 PM6/12/17

to bob-devel

Also, how exactly is recognition rate determined? For SPEAR, is it just the total number of samples which were not discarded by the system as meaningless noise? Any elaboration would be helpful.

Manuel Günther

unread,

Jun 13, 2017, 12:41:46 PM6/13/17

to bob-devel

OK, this command evaluates the FRR at FAR=0.001. As mentioned above, in small datasets, low FAR regions are not very stable. Hence the threshold might fluctuate a lot, leading to variable FRR values for different algorithms, especially on the eval set. Otherwise the approach is correct. You might want to use a different FAR value (this is a parameter of evaluate.py).

The recognition rate is simply the number of correctly identified probe images divided by the number of probe images, see for example section 3.1 in http://publications.idiap.ch/downloads/reports/2013/Gunther_Idiap-RR-13-2017.pdf

Best regards

Manuel

Neve Stagy

unread,

Jun 13, 2017, 7:37:52 PM6/13/17

to bob-devel

OK, this command evaluates the FRR at FAR=0.001. As mentioned above, in small datasets, low FAR regions are not very stable. Hence the threshold might fluctuate a lot, leading to variable FRR values for different algorithms, especially on the eval set. Otherwise the approach is correct. You might want to use a different FAR value (this is a parameter of evaluate.py).

That makes sense. What I'm now curious about is how to identify the ideal FAR threshold.

In scores-dev (for example), I see scores ranging anywhere to 0.05 to 1.05. Are these percents, or probability?

Manuel Günther

unread,

Jun 25, 2017, 7:38:20 PM6/25/17

to bob-devel

These are scores and can be anything. The only required property is that larger scores are correlated with higher similarities.

How to select a threshold highly depends on the application at hand. One example would select the threshold based on the equal error rate (EER) on the development set.

Manuel

Neve Stagy

unread,

Jul 2, 2017, 2:41:52 PM7/2/17

to bob-devel

Sounds good, thanks!

Reply all

Reply to author

Forward