Interpreting Results

63 views
Skip to first unread message

Neve Stagy

unread,
Jun 12, 2017, 5:11:42 PM6/12/17
to bob-devel
Hello all,

I've been playing around with SPEAR and have a few questions. Below are my results on Voxforge:

GMM-UBM
FRR-dev=0.0633
FRR-eval=0.29
EER-dev=0.0198
HTER-eval=0.0247
RR-dev=0.9967 
RR-eval=0.9967

ISV
FRR-dev=0.0533
FRR-eval=0.2633
EER-dev=0.0104
HTER-eval=0.0237
RR-dev=1.0
RR-eval=1.0

JFA
FRR-dev=0.4167
FRR-eval=0.7033 
EER-dev=0.0332
HTER-eval=0.0893
RR-dev=0.99 
RR-eval=0.91
1. Why is there such a large difference between FRR of dev and eval sets? For example, since gmm_voxforge.py has the number of gaussians hardcoded, what are the data-dependent factors which result in dramatic underfitting for eval sets? 

2. What size corpus makes JFA or I-Vectors perform optimally? I could not identify any any literature describing specific optimal conditions.

3. My current project aims to verify whether certain voice synthesis attacks could be misclassified as a genuine user's speech. For such an experiment, is it even necessary to split the data into dev and eval sets? I am more familiar with ordinary machine learning training/testing routines than how biometric algorithms are evaluated.

3. Are these classifiers to be compared by FAR/FRR/EER alone? Which are the deciding factors used in standard practice? 

4. How does the recognition rate help evaluate performance? Specifically for JFA since both FRRs are so high, how does RR remain near perfect?


Any guidance would be appreciated!

Best,
-Steve

Manuel Günther

unread,
Jun 12, 2017, 7:45:42 PM6/12/17
to bob-devel
Hi Steve,

although I am not an expert in speaker recognition, here are some thoughts.

1. First of all, I don't know, how you computed the FRR results, so I cannot tell you, why one is worse than the other. If you compute the threshold for the FRR using the development set and apply them to both dev and eval, then the threshold might not be good and, hence, the algorithm is not too good.

2. I don't know, how large the corpus should be for JFA and I-vectors, but I have learned that especially I-vectors need much much more data than present in the voxforge (toy-)database.

3. For voice PAD you might want to consider different metrics. I am not sure if bob.bio.spear provides such metrics, or even evaluation protocols for this. bob.bio.spear is designed for biometric recognition, not for PAD.

3.The optimal way to compare two classifiers is the HTER on the eval set, as only this measure is unbiased.

4. Recognition rate is evaluating identification, while EER/HTER evaluates verification. Hence, both have their right to exist. It highly depends on the task at hand. That the recognition rates are so high also is based on the very small dataset, with more number of enrolled models, the recognition rate will naturally decrease.

I hope I could help at least with understanding some of the points.
Manuel

Neve Stagy

unread,
Jun 12, 2017, 8:09:55 PM6/12/17
to bob-devel
Thank you so much!

1. First of all, I don't know, how you computed the FRR results, so I cannot tell you, why one is worse than the other. If you compute the threshold for the FRR using the development set and apply them to both dev and eval, then the threshold might not be good and, hence, the algorithm is not too good.
The command I am using is evaluate.py -r -d scores-dev -e scores-eval -c FAR. What is the correct way to go about computing thresholds for both dev and eval?

Neve Stagy

unread,
Jun 12, 2017, 8:32:57 PM6/12/17
to bob-devel
Also, how exactly is recognition rate determined? For SPEAR, is it just the total number of samples which were not discarded by the system as meaningless noise? Any elaboration would be helpful.

Manuel Günther

unread,
Jun 13, 2017, 12:41:46 PM6/13/17
to bob-devel
OK, this command evaluates the FRR at FAR=0.001. As mentioned above, in small datasets, low FAR regions are not very stable. Hence the threshold might fluctuate a lot, leading to variable FRR values for different algorithms, especially on the eval set. Otherwise the approach is correct. You might want to use a different FAR value (this is a parameter of evaluate.py).

The recognition rate is simply the number of correctly identified probe images divided by the number of probe images, see for example section 3.1 in http://publications.idiap.ch/downloads/reports/2013/Gunther_Idiap-RR-13-2017.pdf

Best regards
Manuel

Neve Stagy

unread,
Jun 13, 2017, 7:37:52 PM6/13/17
to bob-devel
OK, this command evaluates the FRR at FAR=0.001. As mentioned above, in small datasets, low FAR regions are not very stable. Hence the threshold might fluctuate a lot, leading to variable FRR values for different algorithms, especially on the eval set. Otherwise the approach is correct. You might want to use a different FAR value (this is a parameter of evaluate.py).
That makes sense. What I'm now curious about is how to identify the ideal FAR threshold. 

In scores-dev (for example), I see scores ranging anywhere to 0.05 to 1.05. Are these percents, or probability? 

Manuel Günther

unread,
Jun 25, 2017, 7:38:20 PM6/25/17
to bob-devel
These are scores and can be anything. The only required property is that larger scores are correlated with higher similarities.

How to select a threshold highly depends on the application at hand. One example would select the threshold based on the equal error rate (EER) on the development set.

Manuel

Neve Stagy

unread,
Jul 2, 2017, 2:41:52 PM7/2/17
to bob-devel
Sounds good, thanks!
Reply all
Reply to author
Forward
0 new messages