Hi Andrew,
Thanks a lot for these kind word. Much appreciated :-)
I'm afraid Recognito won't be the end of your quest for the right lib...
Indeed, stuff happened and I moved on... the main reason for not continuing is that I don't have the math background required. Of all the scientific papers I read I would understand the plain English parts but would remain clueless in how to solve some of the provided formula's... I had hoped to meet someone with the required mathematical understanding but it didn't happen...
Recognito uses outdated technology that works fine when the quality of the recordings is very good and the tone of the speaker is identical between recordings.
Now maybe I can help by sharing some ideas on the options you have
Thinking out loud... bear with me ^^
First, why definitely not Recognito
Using MPCC or LPCC should yield much better results than Recognito's LPC, according to the papers I read
Also, the statistical models should improve the accuracy
One of the issues with Alize / Spear / speaker-recognition is that they provide a plethora of algorithms to extract features and create models.
It means you have to plug it all together and configure it using the combination that works best for your purpose.
The purpose of these apps is to let other researchers improve on the current state of the art.
How to compare these libs?
There are actually a few known and expensive test databases called "corpus" (and a very few free ones like VoxForge)
Usually the Equal Error Rate (EER) value is used to compare results between libs
EER means the value where there are as much false positives as false negatives, the point where the 2 curves meet in the graph. (google it you'll see what I mean :-) )
Now an absolute EER value doesn't mean much. You have to be familiar with the test set in order to get an idea of what the EER value really means.
https://github.com/ppwwyyxx/speaker-recognition conducted tests based on a population of 100 speakers. This is already difficult to obtain but, unfortunately not enough...
Quality of the recorded audio is everything. Removing noise without hurting the voice signal is extremely difficult. Most difficult noises are transient (short) noises. Constant background noise is easier to lower.
If I was to add speaker recognition to an existing project, I'd suppose I'd first try to gather test data from the user base in order to be as close as possible to the actual audio quality I can expect.
I'd also download all free test dbs I can and listen to them to get an idea of how usable they are in my context.
After data collection, I'd compare all 3 implementations mentioned above using various combinations (there are papers published for these algorithms explaining how they were used together and the results they yield)
I'm pretty sure that taking the time to contact the mailing list explaining your goal and where you're stuck should provide the help you'll need to set this up.
This is a long and tedious process...
Once I know which combination would currently work best for me, I'd consider
- forking their project and re-architect it my way. Kind of the biggest refactoring of my life ^^
- writing a wrapper as you suggested (which I also considered before writing Recognito)
Depending on your requirements, you might be forced to opt for a rewrite. Response time comes first to my mind.
If you have some understanding of digital audio and reading quite a few scientific papers doesn't frighten you (they contain explanations on how to best tune and use the algos), I believe this is quite doable.
Please keep me informed on how you're doing or provide a link where I can check status :-)
Cheers
Amaury