Suitability of SPEAR for use in non-critical 'practical scenario'

81 views
Skip to first unread message

QA Collective

unread,
Jan 9, 2017, 1:48:36 AM1/9/17
to bob-devel
Hi All!

This is my first post so g'day from Australia! :)

As a sub-task of a project I'm working on, I want to use the audio from studio recorded video (TV talk shows mainly) to record the identity of the speakers that are involved and subsequently identify them in other shows.  I intend to do this after I've diarized the audio using LIUM [http://www-lium.univ-lemans.fr/diarization/doku.php/welcome] - which seems to do a very good job in my experience thus far.  I've read the post 'Use bob spear for speaker recognition in app' [ https://groups.google.com/d/msg/bob-devel/EeSRzhGJmHk/pRn5KLRfAgAJ ] with much interest.

While I have a Computer Science degree (10 years old), I'm certainly no academic in this area and of all the types of AI/tech I've been looking at applying to my videos (face & pose recognition, speech recognition etc) I've found speaker recognition to be the least polished for use by non academics.  Fortunately, from my understanding of the documentation, BOB & SPEAR seem to strike a good balance between having to understand and apply each element of a tool chain individually (e.g. ALIZE) and being able to select pre-calibrated tools in a somewhat custom toolchain at the command line after a database has been prepared.  Is my understanding of SPEAR correct thus far?

At present, I'm still trying to determine suitability of SPEAR over other tools for my purposes.  My foremost question at the moment is: understanding that SPEAR is currently used mostly for conducting research experiments, what are the aspects of the toolkit which may be inefficient, limiting, missing or otherwise deficient for use in more practical scenarios?  I say 'more practical scenarios' because I'm not planning on using it to grant/deny access to any production system but gather metadata on audio to identify who speaks and when.  So the consequences of failure aren't huge.  Basically,  in this instance I'm wondering if the 'research tool' focus is a disclaimer to protect against someone relying too much on SPEAR or whether there is some tangible other aspect / functionality that its missing that I haven't yet foreseen the need for.  Otherwise, SPEAR seems to be quite high in terms of its recognition accuracy in the given experiments, yes?  Again, even assessing this has been difficult as its not been easy to get clear definitions (> 1 short sentence) of what EER / HTER mean in a practical sense! :-/

Thank you all in advance,
Andrew

Tiago Freitas Pereira

unread,
Jan 9, 2017, 4:41:47 AM1/9/17
to bob-...@googlegroups.com
Hi Andrew, good morning from Switzerland.
I will try to answer you questions in line.

---


As a sub-task of a project I'm working on, I want to use the audio from studio recorded video (TV talk shows mainly) to record the identity of the speakers that are involved and subsequently identify them in other shows.  I intend to do this after I've diarized the audio using LIUM [http://www-lium.univ-lemans.fr/diarization/doku.php/welcome] - which seems to do a very good job in my experience thus far.  I've read the post 'Use bob spear for speaker recognition in app' [ https://groups.google.com/d/msg/bob-devel/EeSRzhGJmHk/pRn5KLRfAgAJ ] with much interest.

While I have a Computer Science degree (10 years old), I'm certainly no academic in this area and of all the types of AI/tech I've been looking at applying to my videos (face & pose recognition, speech recognition etc) I've found speaker recognition to be the least polished for use by non academics.  Fortunately, from my understanding of the documentation, BOB & SPEAR seem to strike a good balance between having to understand and apply each element of a tool chain individually (e.g. ALIZE) and being able to select pre-calibrated tools in a somewhat custom toolchain at the command line after a database has been prepared.  Is my understanding of SPEAR correct thus far?

Yes, you understood correctly. You can use directly the bob.bio.spear (look at bob.bio.base too) API for your purposes.

At present, I'm still trying to determine suitability of SPEAR over other tools for my purposes.  My foremost question at the moment is: understanding that SPEAR is currently used mostly for conducting research experiments, what are the aspects of the toolkit which may be inefficient, limiting, missing or otherwise deficient for use in more practical scenarios?  I say 'more practical scenarios' because I'm not planning on using it to grant/deny access to any production system but gather metadata on audio to identify who speaks and when.  So the consequences of failure aren't huge.  Basically,  in this instance I'm wondering if the 'research tool' focus is a disclaimer to protect against someone relying too much on SPEAR or whether there is some tangible other aspect / functionality that its missing that I haven't yet foreseen the need for.  Otherwise, SPEAR seems to be quite high in terms of its recognition accuracy in the given experiments, yes?  Again, even assessing this has been difficult as its not been easy to get clear definitions (> 1 short sentence) of what EER / HTER mean in a practical sense! :-/

About efficiency, you tell me.
As far as I understood you are dealing with speaker identification 1-N. bob.bio.spear is a tool for speaker verification 1-1. What I'm trying to say is that bob.bio.spear does not have functions to index audio template identities. If you want to use it for identification, it is necessary to do a brutal force search in your dataset.
About efficiency again, another important variable is the size of your input data and the algorithm that you will use. bob.bio.spear has a lot of GMM based algorithms implemented, such as GMM itself, ISV, JFA, iVector (including some stuff that researchers usually do on top of it (PLDA, WCCN, Whitening, LDA..)). Which one will you use?

About the accuracy, it is really difficult to give an opinion about this. There are so many variables to consider. Just a hint, if your background models are very uncorrelated with your target scenario, you can expect a very low accuracy.


Hope I have answered your questions

Cheers


--
-- You received this message because you are subscribed to the Google Groups bob-devel group. To post to this group, send email to bob-...@googlegroups.com. To unsubscribe from this group, send email to bob-devel+unsubscribe@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/bob-devel or directly the project website at http://idiap.github.com/bob/
---
You received this message because you are subscribed to the Google Groups "bob-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bob-devel+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Tiago

QA Collective

unread,
Jan 9, 2017, 7:48:55 PM1/9/17
to bob-...@googlegroups.com
Hi Tiago,

Many thanks for a prompt and straightforward reply.  I've snipped and replied inline too.

At present, I'm still trying to determine suitability of SPEAR over other tools for my purposes.  My foremost question at the moment is: understanding that SPEAR is currently used mostly for conducting research experiments, what are the aspects of the toolkit which may be inefficient, limiting, missing or otherwise deficient for use in more practical scenarios?  I say 'more practical scenarios' because I'm not planning on using it to grant/deny access to any production system but gather metadata on audio to identify who speaks and when.  So the consequences of failure aren't huge.  Basically,  in this instance I'm wondering if the 'research tool' focus is a disclaimer to protect against someone relying too much on SPEAR or whether there is some tangible other aspect / functionality that its missing that I haven't yet foreseen the need for.  Otherwise, SPEAR seems to be quite high in terms of its recognition accuracy in the given experiments, yes?  Again, even assessing this has been difficult as its not been easy to get clear definitions (> 1 short sentence) of what EER / HTER mean in a practical sense! :-/

About efficiency, you tell me.
As far as I understood you are dealing with speaker identification 1-N. bob.bio.spear is a tool for speaker verification 1-1. What I'm trying to say is that bob.bio.spear does not have functions to index audio template identities. If you want to use it for identification, it is necessary to do a brutal force search in your dataset.

Thanks for highlighting this in terms of cardinality.  I'd not thought about it quite like that before.  Does this 1-1 cardinality run through the entire toolchain then?  e.g. Pre-processing > Training > Enrolling > Scoring > Evaluating

Or is the 1-1 aspect just part of the scoring?  I get the impression that in a 1-1 identification scenario, you would have to nominate an identity (one of many you may have) along with an unseen voice recording and you'd get a likelihood score back?  So implementing a 1-N speaker identification would involve iterating through stored identities, scoring each against the unseen sound sample and the one with the lowest/highest score would be most likely identity of the speaker (if ALSO that score is above a set acceptance threshold).

This brings to mind a couple of other questions, which are related:
Is the 1-N scenario I've envisaged above computationally expensive (for say, 500 identities) - that is, is there a more efficient way to do 1-N?
Is the idea of a 'voice print' out of date?  I've not come across the concept much at all in my reading.  After considering what I'm doing from a 1-N perspective, I can now understand the 'lure' of having a voice print number to use perhaps as a way to find top 5 identities to do scoring.
 
About efficiency again, another important variable is the size of your input data and the algorithm that you will use. bob.bio.spear has a lot of GMM based algorithms implemented, such as GMM itself, ISV, JFA, iVector (including some stuff that researchers usually do on top of it (PLDA, WCCN, Whitening, LDA..)). Which one will you use?

From a fair amount of reading, I've concluded that iVectors seem to suit my purposes the best (although, annoyingly, I can't recall the reasons I thought that now) - perhaps my comments on accuracy below will hint at my conclusion.  I realise there are also a host of other things you can do on top as well, which is one reason I'm interested in SPEAR - already having Python hooks means that if SPEAR itself can't do something, I'll more likely be able to find a way to do it with the vast array of Python kit.

My approach in this so far is to pick the tool that is likely to give me the most options and the best accuracy and I'll give it a try.  If it doesn't work well enough in the first instance, I'll begin tinkering with likely input problems and potentially including more complexity in the toolchain, but I'm not expecting that just yet - see below details on accuracy.
 

About the accuracy, it is really difficult to give an opinion about this. There are so many variables to consider. Just a hint, if your background models are very uncorrelated with your target scenario, you can expect a very low accuracy.



Understand that you're hesitant on giving any expectations about accuracy with the huge number of computational variables involved - let alone the human ones!  Luckily, I think my scenario is relatively conducive to accuracy(?).  I'd be using studio recorded audio from TV talk shows, diarized by LIUM.  There would likely be between 30 seconds and 5+ minutes of speaking per speaker per show.  The less a speaker talks (i.e. a 30 second file), the less I care about speaker recognition accuracy for my purposes.  There are 5 speakers per show and each show will have different speakers but there are a fair few repeat appearances by speakers.  Audio quality should be quite consistent between shows, except where a TV show may have been recorded in an auditorium or hall (slight echo) for a 'special edition' show.  I may later expand this system to look at other shows, some of which have people 'dial in' on their mobile phones - this is the audio that I'm expecting especially bad accuracy on.  If I want to improve that, I'm expecting to have to do some 'tinkering' as I mentioned above, or potentially recognize the quality change and train another model for the same speaker on lower quality audio.

I am also trying to determine if there would be benefit in re-training a speaker recognition model with the growing amount of audio that is identified as likely belonging to a speaker - 'active learning'?  I realise this could be dangerous if diarization/recognition failed and you started training a single speaker model with audio from multiple speakers!  I think this is where a good reporting mechanism and occasional manual intervention and re-training would be an appropriate counter measure, along with correlation on other information I'm gathering.

 
Hope I have answered your questions

Thankyou yes, the more I speak to people the clearer things become!

Tiago Freitas Pereira

unread,
Jan 11, 2017, 2:30:59 AM1/11/17
to bob-...@googlegroups.com, Alain Komaty

At present, I'm still trying to determine suitability of SPEAR over other tools for my purposes.  My foremost question at the moment is: understanding that SPEAR is currently used mostly for conducting research experiments, what are the aspects of the toolkit which may be inefficient, limiting, missing or otherwise deficient for use in more practical scenarios?  I say 'more practical scenarios' because I'm not planning on using it to grant/deny access to any production system but gather metadata on audio to identify who speaks and when.  So the consequences of failure aren't huge.  Basically,  in this instance I'm wondering if the 'research tool' focus is a disclaimer to protect against someone relying too much on SPEAR or whether there is some tangible other aspect / functionality that its missing that I haven't yet foreseen the need for.  Otherwise, SPEAR seems to be quite high in terms of its recognition accuracy in the given experiments, yes?  Again, even assessing this has been difficult as its not been easy to get clear definitions (> 1 short sentence) of what EER / HTER mean in a practical sense! :-/

About efficiency, you tell me.
As far as I understood you are dealing with speaker identification 1-N. bob.bio.spear is a tool for speaker verification 1-1. What I'm trying to say is that bob.bio.spear does not have functions to index audio template identities. If you want to use it for identification, it is necessary to do a brutal force search in your dataset.

Thanks for highlighting this in terms of cardinality.  I'd not thought about it quite like that before.  Does this 1-1 cardinality run through the entire toolchain then?  e.g. Pre-processing > Training > Enrolling > Scoring > Evaluating

Or is the 1-1 aspect just part of the scoring?  I get the impression that in a 1-1 identification scenario, you would have to nominate an identity (one of many you may have) along with an unseen voice recording and you'd get a likelihood score back?  So implementing a 1-N speaker identification would involve iterating through stored identities, scoring each against the unseen sound sample and the one with the lowest/highest score would be most likely identity of the speaker (if ALSO that score is above a set acceptance threshold).


Have a look in the bob.bio.gmm and bob.bio.spear APIs. You can create your application on top of it and do anything you want.


This brings to mind a couple of other questions, which are related:
Is the 1-N scenario I've envisaged above computationally expensive (for say, 500 identities) - that is, is there a more efficient way to do 1-N?
Is the idea of a 'voice print' out of date?  I've not come across the concept much at all in my reading.  After considering what I'm doing from a 1-N perspective, I can now understand the 'lure' of having a voice print number to use perhaps as a way to find top 5 identities to do scoring.

Well, I'm not a speaker recognition specialist. I cannot help you in this topic. Maybe someone in the group knows something.
@Alain maybe has something to contribute :-)
What I can say for sure is that we have no code in this direction in bob.bio.spear.



 
About efficiency again, another important variable is the size of your input data and the algorithm that you will use. bob.bio.spear has a lot of GMM based algorithms implemented, such as GMM itself, ISV, JFA, iVector (including some stuff that researchers usually do on top of it (PLDA, WCCN, Whitening, LDA..)). Which one will you use?

From a fair amount of reading, I've concluded that iVectors seem to suit my purposes the best (although, annoyingly, I can't recall the reasons I thought that now) - perhaps my comments on accuracy below will hint at my conclusion.  I realise there are also a host of other things you can do on top as well, which is one reason I'm interested in SPEAR - already having Python hooks means that if SPEAR itself can't do something, I'll more likely be able to find a way to do it with the vast array of Python kit.

My approach in this so far is to pick the tool that is likely to give me the most options and the best accuracy and I'll give it a try.  If it doesn't work well enough in the first instance, I'll begin tinkering with likely input problems and potentially including more complexity in the toolchain, but I'm not expecting that just yet - see below details on accuracy.
 

About the accuracy, it is really difficult to give an opinion about this. There are so many variables to consider. Just a hint, if your background models are very uncorrelated with your target scenario, you can expect a very low accuracy.



Understand that you're hesitant on giving any expectations about accuracy with the huge number of computational variables involved - let alone the human ones!  Luckily, I think my scenario is relatively conducive to accuracy(?).  I'd be using studio recorded audio from TV talk shows, diarized by LIUM.  There would likely be between 30 seconds and 5+ minutes of speaking per speaker per show.  The less a speaker talks (i.e. a 30 second file), the less I care about speaker recognition accuracy for my purposes.  There are 5 speakers per show and each show will have different speakers but there are a fair few repeat appearances by speakers.  Audio quality should be quite consistent between shows, except where a TV show may have been recorded in an auditorium or hall (slight echo) for a 'special edition' show.  I may later expand this system to look at other shows, some of which have people 'dial in' on their mobile phones - this is the audio that I'm expecting especially bad accuracy on.  If I want to improve that, I'm expecting to have to do some 'tinkering' as I mentioned above, or potentially recognize the quality change and train another model for the same speaker on lower quality audio.

Just keep in mind that a very uncorrelated train and test conditions can degrade the error rates.

I am also trying to determine if there would be benefit in re-training a speaker recognition model with the growing amount of audio that is identified as likely belonging to a speaker - 'active learning'?  I realise this could be dangerous if diarization/recognition failed and you started training a single speaker model with audio from multiple speakers!  I think this is where a good reporting mechanism and occasional manual intervention and re-training would be an appropriate counter measure, along with correlation on other information I'm gathering.


Cheers

QA Collective

unread,
Jan 18, 2017, 7:05:01 PM1/18/17
to bob-...@googlegroups.com
Hi Tiago,

Thank you for your reply.  Sorry, I've not checked this email address as regularly as I should - the perils of using a work email for personal projects (I don't know if knowing this changes people's attitudes to me, but this project is not for corporate/government use!).

As you suggested, I'll take a detailed look at the APIs next and see what I think is achievable and probably come back with some more specific questions later.  In tandem, I'm developing a training / testing corpus of about 2500 speakers, so it will be interesting to see how that goes - especially if I end up trying it on multiple speaker recognition tools.

The idea of a voice print is enticing but possibly unrealistic as it compresses the whole concept of speaker recognition down to a number/code.  I think voice prints are more often used in text dependent speaker identification systems.  But still, whenever I read steps like in Section III of https://github.com/guker/spear , I wonder if some form of voice print concept could be developed using the results of enrollment (Step 7) as input.

Andrew

You received this message because you are subscribed to a topic in the Google Groups "bob-devel" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/bob-devel/P9T25kaAmA0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to bob-devel+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages