Current state and intended future of Recognito effort to assess suitability for use in my project

134 views

Skip to first unread message

QA Collective

unread,

Jan 9, 2017, 1:52:33 AM1/9/17

to Recognito

Hi,

First I'd like to start by saying that I applaud the intent and implementation of your project! While I have a Computer Science degree (10 years old), I'm certainly no academic in this area and of all the types of AI/tech I've been looking at lately (face & pose recognition, speech recognition etc) I've found speaker recognition to be the least polished for use by non academics. I strongly identify with the comments made here: https://groups.google.com/d/msg/recognito/xNo6Av2jHxI/ix76KBIrIwAJ ! Quietly, I wonder how many possible benefits have been forfeit from not having more people involved in the various speaker recognition projects because their requirements for entry level usage and understanding are so high.

So I'm interested in incorporating speaker recognition into a larger project I'm doing. Recognito seems by far the easiest speaker recognition system to use along with another interesting Chinese effort here: https://github.com/ppwwyyxx/speaker-recognition (I'm also considering Alize and BOB Spear - but aaaargh!). I particularly love the idea of your voice print - makes it so easy to gather a database of speakers together and see if new speaker samples match - which is exactly what I want to do. However I note that Recognito hasn't seen a lot of development in the last years. Is this because the author has moved on / become distracted as life tends to do to us? Or perhaps the project has reached a level of maturity for the target audience that means there is not a lot to change?

I ask because this is a factor contributing to my decision as to whether I should use Recognito or not. It seems very well suited to my needs programmatically, but I'm unsure as to its accuracy and whether it will continue as an active project. Sorry to be such a 'client user' - I'd usually treat an open source project as something I could become an active contributer to, but with the recent reading I've attempted (UMM,UBM,i-vectors...etc!) I feel so out of depth that I can't see that as ever being likely!!

Finally, I wonder if you know of (or have considered yourself :P) any efforts to 'figure out' a well established speaker recognition toolkit (like ALIZE) and simply write an api/script to expose its functionality without exposing copious amounts of fiddly detail (probably with pre-configured settings for common circumstances like audio quality/length number of identities etc)? If I am 'forced' to figure out one of the more complex tools for speaker recognition in all its detail and write an API for myself, I'd certainly be taking tips from the simplicity that Recognito offers in the way it exposes functionality and would share my work with the community!

Thankyou in advance,
Andrew

Amaury Crickx

unread,

Jan 11, 2017, 6:08:55 PM1/11/17

to Recognito

Hi Andrew,

Thanks a lot for these kind word. Much appreciated :-)

I'm afraid Recognito won't be the end of your quest for the right lib...

Indeed, stuff happened and I moved on... the main reason for not continuing is that I don't have the math background required. Of all the scientific papers I read I would understand the plain English parts but would remain clueless in how to solve some of the provided formula's... I had hoped to meet someone with the required mathematical understanding but it didn't happen...
Recognito uses outdated technology that works fine when the quality of the recordings is very good and the tone of the speaker is identical between recordings.

Now maybe I can help by sharing some ideas on the options you have
Thinking out loud... bear with me ^^

First, why definitely not Recognito
Using MPCC or LPCC should yield much better results than Recognito's LPC, according to the papers I read
Also, the statistical models should improve the accuracy

One of the issues with Alize / Spear / speaker-recognition is that they provide a plethora of algorithms to extract features and create models.
It means you have to plug it all together and configure it using the combination that works best for your purpose.

The purpose of these apps is to let other researchers improve on the current state of the art.

How to compare these libs?
There are actually a few known and expensive test databases called "corpus" (and a very few free ones like VoxForge)
Usually the Equal Error Rate (EER) value is used to compare results between libs
EER means the value where there are as much false positives as false negatives, the point where the 2 curves meet in the graph. (google it you'll see what I mean :-) )

Now an absolute EER value doesn't mean much. You have to be familiar with the test set in order to get an idea of what the EER value really means.

https://github.com/ppwwyyxx/speaker-recognition conducted tests based on a population of 100 speakers. This is already difficult to obtain but, unfortunately not enough...

Quality of the recorded audio is everything. Removing noise without hurting the voice signal is extremely difficult. Most difficult noises are transient (short) noises. Constant background noise is easier to lower.
If I was to add speaker recognition to an existing project, I'd suppose I'd first try to gather test data from the user base in order to be as close as possible to the actual audio quality I can expect.
I'd also download all free test dbs I can and listen to them to get an idea of how usable they are in my context.

After data collection, I'd compare all 3 implementations mentioned above using various combinations (there are papers published for these algorithms explaining how they were used together and the results they yield)
I'm pretty sure that taking the time to contact the mailing list explaining your goal and where you're stuck should provide the help you'll need to set this up.

This is a long and tedious process...

Once I know which combination would currently work best for me, I'd consider

forking their project and re-architect it my way. Kind of the biggest refactoring of my life ^^
writing a wrapper as you suggested (which I also considered before writing Recognito)

Depending on your requirements, you might be forced to opt for a rewrite. Response time comes first to my mind.

If you have some understanding of digital audio and reading quite a few scientific papers doesn't frighten you (they contain explanations on how to best tune and use the algos), I believe this is quite doable.

Please keep me informed on how you're doing or provide a link where I can check status :-)

Cheers

Amaury

QA Collective

unread,

Jan 18, 2017, 6:08:36 AM1/18/17

to Amaury Crickx, Recognito

Hi Amaury,

Thanks so much for such a detailed and well considered response. Having read a large number of scientific papers over the last year, I can most certainly empathise. There was a while, perhaps a year, when I could understand the notation and general intent of simple equations but without regular use, I've forgotten and figure that even if I tried to re-learn it, much of this stuff would be over my head anyway because I lack other assumed knowledge of this rather specialised field.

Still, I see no reason that academics or other teams cannot make this technology more accessible to use by people who don't intimately understand the details in a similar way to Kaldi or CMU Sphinx do for speech recognition (or Tesseract for OCR or OpenCV for facial recognition etc).

Thankyou for eloquently directing me away from Recognito for my specific purposes. As you've already articulated - what I do face now is the prospect of having to chain together a lot of algorithms that I only have a blunt understanding of in the hope that I won't run into too many problems I may create for myself, big problems that I don't understand or get poor results after an investment of effort that may/may not make the outcome worthwhile.

At the moment, I'm favouring Spear because it seems the most actively developed, modern and flexible in the way algorithms are chained. It's also native to Python which is my language of choice for this wider project.

Thanks for all that detail on EER. That's probably the most I've been able to find on it so far! I did Google it and read further on this page:

http://www.griaulebiometrics.com/en-us/book/understanding-biometrics/evaluation/accuracy/matching/interest/equal

Although I'm not too sure what the x axis means,similarity score? I may well have to read that entire book on bio-metrics for a better general understanding.

Regarding your comments on the Chinese real-time speaker recognition system conducting tests based on a population of 100 speakers, how many speakers would you consider adequate for testing? On their research, I also recall that their paper mentioned decreasing accuracy as the number of recognized speakers in their database increased!! That didn't seem adequate for my purposes (although their command line interface is enticingly simple - similar to your API) unless I could use multiple databases and hash/voiceprint voices to know which one(s) to check.

Your comments on audio quality I find interesting. I always thought that TV studio recorded audio would be quite good. I didn't realise that most 'damaging' noises are transient. Given many of them are recorded with participant's mics close to eachother and in front of a studio audience, I'm now not so sure I should expect results as accurate as I expected! Luckily, volume (of data) is no problem - I think! I've got at least 250 hours with 10+ mins per speaker.

At the moment, I'm preparing a large corpus for this (and other speech recognition) work. 250 hours of audio, about 2000 speakers. This is just a small sample and this is why I've been so interested in the voice print idea - so I can quickly run through an ever increasing database of speakers and pick out the top n candidates for detailed voice id processing. Otherwise, any voice id system I use would have to cater for a huge number of speakers and be quick. I'm not even sure if that's a possibility.

'Long and tedious' is exactly how I'd describe this! Not to harp on, but I'm constantly surprised how 'purely academic' this field is and how there doesn't even exist a tool that comes pre-configured for common use with configurable parameters ... like every other piece of software on earth. Sure, I might have to dive deeper if my use case is out of the ordinary or highly specific. Perhaps there is a natural aversion to doing this for fear of legal pursuit? Who knows...

I personally don't plan to re-write anyone's code. If I do find a way to bring together a solution (preferably comprised of only 1 toolset), I'll go out of my way to make it configurable, documented and re-usable by others before publishing it or contributing it. But if this drains me of too much time or yields quite poor results, I'll simply discard it - as the cost benefit wont be worth it.

I'll keep you posted as I progress over the next weeks/months. In my travels, I may still refer to Recognito ... as a good example of how simple a speaker recognition API could be!

Thanks again for all your time on this,

Andrew

--
You received this message because you are subscribed to a topic in the Google Groups "Recognito" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/recognito/AX2D-AQm64A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to recognito+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Fair

unread,

Feb 15, 2017, 2:56:21 AM2/15/17

to Recognito

Hi guys;

Amaury, what about adding speech recognition to a VoIP project like "Mumble"?

https://wiki.mumble.info/wiki/Main_Page

You mentioned about "cleaning up the audio recording" and that's what a project like Mumble does.

They did it because they are filtering out everything "isn't the user's voice" because they are optimizing their packets for voice conversations transmission.

They have an audio processing pipeline they use to clean up/filter out the user's voice.

Do you think that might be effective enough for "limiting the field" of voice recordings to capture voiceprints and to play the audio source back through for detection?

Thanks,

Mike

Amaury Crickx

unread,

Feb 15, 2017, 7:47:05 PM2/15/17

to Recognito

Hello Mike

I'm not sure how speaker recognition would be beneficial to a VoIP project like Mumble
Could you give an example use case to clarify my understanding of the intent?

Would the audio enhancements help with the quality of the voice prints?
The Opus codec seems to have a very wide range of precision.
I guess audio is much better on a lan party than over the internet... just guessing

One thing to note: gaming is an emotional rollercoaster
So I'd expect the voice samples to have a wide spectrum of different voices (shouting, laughing, mumbling ^^, ...)
The speaker recognition accuracy would be impacted by this... again, just guessing as it depends on the ratios as well: if it's mainly one voice type, averaged voiceprint shouldn't be much impacted

Cheers

Amaury

Amaury Crickx

unread,

Feb 16, 2017, 2:35:13 AM2/16/17

to Recognito, amaury...@gmail.com

At the moment, I'm favouring Spear because it seems the most actively developed, modern and flexible in the way algorithms are chained. It's also native to Python which is my language of choice for this wider project.

+1

Although I should mention that Jean-Philippe Bonastre (lead on Alize) commented on this forum that "We (a set of french academic labs) are planning to relaunch very soon ALIZE with new stuff, including Python top programs"
So I'd consider contacting him as well (either by replying here or through linked in)

Thanks for all that detail on EER. That's probably the most I've been able to find on it so far! I did Google it and read further on this page:
http://www.griaulebiometrics.com/en-us/book/understanding-biometrics/evaluation/accuracy/matching/interest/equal

Although I'm not too sure what the x axis means,similarity score? I may well have to read that entire book on bio-metrics for a better general understanding.

similarity score -> the calculated "distance"

False positives and false negatives makes more sense in the context of speaker verification (authentication) where there is a claimed identity
Speaker verification vs identification: http://www.scholarpedia.org/article/Speaker_recognition

Regarding your comments on the Chinese real-time speaker recognition system conducting tests based on a population of 100 speakers, how many speakers would you consider adequate for testing?

Obviously, the larger the better. When including statistical models, you need to be able to identify the outliers.
Guts feeling: less than 1000 is not a serious test set

On their research, I also recall that their paper mentioned decreasing accuracy as the number of recognized speakers in their database increased!!

I'd suspect most (all?) libraries attempting speaker identification to face issues with a very large corpus.
The more speakers, the higher the likelihood to have speakers with very similar voices and the false positives error rate will rise up due to the variability of human voices.

Also note that a very large corpus (millions) will also make it more difficult to get a response in real time as all voiceprints need to be compared individually with the given sample.

That didn't seem adequate for my purposes (although their command line interface is enticingly simple - similar to your API) unless I could use multiple databases and hash/voiceprint voices to know which one(s) to check.

I don't think it's possible to create a hash function that would always direct you to the correct bucket
Voiceprints are sometimes grouped in so called 'cohorts' which are groups of similar voices.
By grouping similar voices together, we can create several Universal Background Models based on these subgroups and improve accuracy

In case you have no clue what I'm talking about:
The problem with "distance" calculation is that there is no max value, so you can't really really tell how close you are to the verified voiceprint.
You can sort on this value but it doesn't allow you to figure out any confidence level.

Your comments on audio quality I find interesting. I always thought that TV studio recorded audio would be quite good. I didn't realise that most 'damaging' noises are transient. Given many of them are recorded with participant's mics close to eachother and in front of a studio audience, I'm now not so sure I should expect results as accurate as I expected! Luckily, volume (of data) is no problem - I think! I've got at least 250 hours with 10+ mins per speaker.

That's impressive ^^

At the moment, I'm preparing a large corpus for this (and other speech recognition) work. 250 hours of audio, about 2000 speakers. This is just a small sample and this is why I've been so interested in the voice print idea - so I can quickly run through an ever increasing database of speakers and pick out the top n candidates for detailed voice id processing. Otherwise, any voice id system I use would have to cater for a huge number of speakers and be quick. I'm not even sure if that's a possibility.

By definition, a system that has to go over it's entire db to check each individual voiceprint will take increasingly more time.
Creating groups of similar voices would help I guess: first determine which cohort ubm is the closest and then only search in that one (or the closest 2, 3, ... depending on returned confidence level? needs testing :-) ).
I have no clue how many buckets should/can be created before it starts negatively impacting accuracy. I would guess there is a limit.

'Long and tedious' is exactly how I'd describe this! Not to harp on, but I'm constantly surprised how 'purely academic' this field is and how there doesn't even exist a tool that comes pre-configured for common use with configurable parameters ... like every other piece of software on earth. Sure, I might have to dive deeper if my use case is out of the ordinary or highly specific. Perhaps there is a natural aversion to doing this for fear of legal pursuit? Who knows...

It's only a guts feeling, but I'm under the impression that text independent speaker recognition based on voice print only is very difficult to achieve because of
- variability of the human voice (we all have a wide spectrum of "different" voices depending on mood)
- difficulty to capture audio in a consistent/clean way without studio equipment

I personally don't plan to re-write anyone's code. If I do find a way to bring together a solution (preferably comprised of only 1 toolset), I'll go out of my way to make it configurable, documented and re-usable by others before publishing it or contributing it. But if this drains me of too much time or yields quite poor results, I'll simply discard it - as the cost benefit wont be worth it.

I'll keep you posted as I progress over the next weeks/months. In my travels, I may still refer to Recognito ... as a good example of how simple a speaker recognition API could be!

Thanks again for all your time on this,
Andrew