How to store VoicePrints?

Denis D

unread,

Oct 19, 2014, 12:47:33 AM10/19/14

to reco...@googlegroups.com

Hi Amaury!

First of all, I'd like to thank you for your effort. Recognito is pretty impressive project.

I've just downloaded the sources and trying to perform some tests and first thing that isn't clear for me is storing of voiceprints.

I'm going to develop an additional functionality for security purposes for our call-centre. The point is that users will be required to pronoun a password (always the same) by comparing the master print (that one that users recorder first time during registration) with the current one. There will probably be thousands of users and it's impossible to load and store all the master voiceprints in memory by reading wav files, as in this case I'm going to run out of memory. I also think about storing only the keys and these arrays of double, which names voiceSample. Although I've not known yet how to store these arrays, it looks better anyway than rereading the files.

Amaury, could you kindly advise me, how you would store the voice prints?

Thank you!

Amaury Crickx

unread,

Oct 20, 2014, 5:40:28 PM10/20/14

to reco...@googlegroups.com

Hi Denis,

Thanks for your kind words!

You're right, it would be pure nonsense to reload all the wav files whenever you need to authenticate someone. The VoicePrint class (containing id and array of doubles) is indeed meant to be stored for later use.

How you do this depends on the storage system of your choice. I've made the class serializable and it should be easy to configure some ORM framework like Hibernate to store them if you wish to.

I must admit I haven't made it easy to store them using pure JDBC (no getters/setters/public constructors) as I'd prefer users do not directly access the internals of this class and start building custom logic around it.

For the use cases I've had so far, I serialised a HashMap containing <Id, VoicePrint> to disk.

When I started Recognito, I didn't intend to do authentication but rather identification. The difference might seem subtle but it's not the same thing :-)

- Identification means I don't know who's talking and I'll check against every VoicePrint I've encountered so far if it's close to current one. A bit like in "CSI Manhattan" when they give the fingerprint to the system and the system tries to match it against all previously stored fingerprint. This scenario doesn't scale very well as more processing is required with each VoicePrint added to the system.

- Authentication means I have both a claimed identity and something to prove that "the user is indeed who he says he is". In this case I don't need to compare against each and every VoicePrint ever created, I'd just need to compare the new VoicePrint with the reference one. This scenario scales infinitely.

In order to be able to provide a likelihood ratio, you also need a third ("max distance") reference. This is called a Universal Background Model.

Say there is a Reference VoicePrint "R" created during the enrolment phase and a new sample S coming along with the claimed identity. The ratio checks how far S is from R compared to UBM:

R----S----------UBM

You then have to decide above which threshold the sample is deemed close enough to validate the identity. The higher the threshold, the more false negatives, the lower the threshold, the more false positives.

The UBM is an average of a large portion of VoicePrints (like everyone talking at the same time). The UBM is language dependent.

I can't provide a relevant UBM for just any language, so it's up you to build it. (I'd hope that, working within a call center, this is not a major issue for you)

Now, with authentication comes another difficulty: fraud detection (or "is this a live recording of the user")
This might be quite tricky to assert and that's the reason why I first didn't intend to go that route. This said, I realize this is probably the predominant use case...
Things to check are numerous. E.g. recording was never sent before (can't say twice the same thing exactly the same way), recording was not edited (cuts in the background noise, time stretching / pitch shifting was not used).

In the case of a passphrase (password is way too short for a VoicePrint), you'd also need to check what was said and combine speech recognition with speaker recognition.

Doing so within a single library allows for more checks: tone (melody of the phrase), speed (length of the syllables)

Unfortunately, afaict, there is no open source lib offering passphrase authentication out-of-the-box. But, as you probably know, there's a plethora of vendors providing such service and they all integrate with major call center platforms (avaya, ...). I'm afraid that, at this very moment, it's the only way to guarantee a high level of security for your users using a passphrase.

Continuously checking customers voice while they talk to the agent is another possibility (text independent verification). The agent can assert that the person is real as it's extremely difficult to counterfeit a full conversation in real time and the system gets more time to decide if the user's voice matches the previously extracted reference (20 seconds of speech should provide very accurate results).

Whatever the solution you choose, you'll always need an alternate way of authenticating the users as there will be errors. The biggest enemy of speaker recognition being background noise.

HTH

Cheers

Amaury

vikram sareen

unread,

Aug 28, 2016, 4:47:19 AM8/28/16

to Recognito, denis.w...@gmail.com

hi dennis,

did you crack this problem? we also have a customer asking for the same. google speech to text is good for getting passphrase but still voice recognition is important. i am for getting voice prints matched with +ve or -ve sample...it is always pass... :(

Haritha Burra

unread,

Jun 26, 2018, 6:53:30 AM6/26/18

to Recognito

Hi Vikram and Dennis,

Were you able to do the authentication? i too have the same goal and problem . The voice prints match even though they are different.