Hi Denis,
Thanks for your kind words!
You're right, it would be pure nonsense to reload all the wav files whenever you need to authenticate someone. The VoicePrint class (containing id and array of doubles) is indeed meant to be stored for later use.
How you do this depends on the storage system of your choice. I've made the class serializable and it should be easy to configure some ORM framework like Hibernate to store them if you wish to.
I must admit I haven't made it easy to store them using pure JDBC (no getters/setters/public constructors) as I'd prefer users do not directly access the internals of this class and start building custom logic around it.
For the use cases I've had so far, I serialised a HashMap containing <Id, VoicePrint> to disk.
When I started Recognito, I didn't intend to do authentication but rather identification. The difference might seem subtle but it's not the same thing :-)
- Identification means I don't know who's talking and I'll check against every VoicePrint I've encountered so far if it's close to current one. A bit like in "CSI Manhattan" when they give the fingerprint to the system and the system tries to match it against all previously stored fingerprint. This scenario doesn't scale very well as more processing is required with each VoicePrint added to the system.
- Authentication means I have both a claimed identity and something to prove that "the user is indeed who he says he is". In this case I don't need to compare against each and every VoicePrint ever created, I'd just need to compare the new VoicePrint with the reference one. This scenario scales infinitely.
In order to be able to provide a likelihood ratio, you also need a third ("max distance") reference. This is called a Universal Background Model.
Say there is a Reference VoicePrint "R" created during the enrolment phase and a new sample S coming along with the claimed identity. The ratio checks how far S is from R compared to UBM:
R----S----------UBM
You then have to decide above which threshold the sample is deemed close enough to validate the identity. The higher the threshold, the more false negatives, the lower the threshold, the more false positives.
The UBM is an average of a large portion of VoicePrints (like everyone talking at the same time). The UBM is language dependent.
I can't provide a relevant UBM for just any language, so it's up you to build it. (I'd hope that, working within a call center, this is not a major issue for you)
Now, with authentication comes another difficulty: fraud detection (or "is this a live recording of the user")
This might be quite tricky to assert and that's the reason why I first didn't intend to go that route. This said, I realize this is probably the predominant use case...
Things to check are numerous. E.g. recording was never sent before (can't say twice the same thing exactly the same way), recording was not edited (cuts in the background noise, time stretching / pitch shifting was not used).
In the case of a passphrase (password is way too short for a VoicePrint), you'd also need to check what was said and combine speech recognition with speaker recognition.
Doing so within a single library allows for more checks: tone (melody of the phrase), speed (length of the syllables)
Unfortunately, afaict, there is no open source lib offering passphrase authentication out-of-the-box. But, as you probably know, there's a plethora of vendors providing such service and they all integrate with major call center platforms (avaya, ...). I'm afraid that, at this very moment, it's the only way to guarantee a high level of security for your users using a passphrase.
Continuously checking customers voice while they talk to the agent is another possibility (text independent verification). The agent can assert that the person is real as it's extremely difficult to counterfeit a full conversation in real time and the system gets more time to decide if the user's voice matches the previously extracted reference (20 seconds of speech should provide very accurate results).
Whatever the solution you choose, you'll always need an alternate way of authenticating the users as there will be errors. The biggest enemy of speaker recognition being background noise.
HTH
Cheers
Amaury