* Alexey Panteleev
>
> My WeightedLevenshtein was simply increasing the l-distance for short strings:
> [...]
Ah, I see. You don't need the full weighted Levenshtein for that.
> I want my comparator for last names to be pretty strict but still not ExactComparator.
Note that in Duke 0.6 the probability calculation has changed, so all comparators (other than exact) are more strict now.
> For example my current comparator computes 0.5 for these two names:
> Decasper vs. Welanber whereas you can see they are completely different names.
>
> I encountered many examples like that recently.
>
> Decker vs. Tucker
> Dodson vs. Wilson
> Galligan vs. Saltzman
Weighted Levenshtein can help with this, by considering early edits and consonant edits to be more important.
> Or maybe I should change my overall config thresholds so that this 0.5 on a last name would result in the below “sure” threshold value.
> Any recommendations?
All of this is possible, but I think you should beware of focusing too much on any one field. The data in the other fields should contradict the name field when there's really no match, and that should take care of this kind of situation.
> Could you please explain how to run the config auto-generation? Basically I have close to a hundred test name pairs and the outcomes that I desire.
> I’d like to run your genetic algo to see what kind of config options it will suggest. Is there a doc for this?
There's no documentation, but it's actually pretty simple. I'm writing up a wiki page on it now:
http://code.google.com/p/duke/wiki/GeneticAlgorithm
--
Lars Marius Garshol | Consultant
Bouvet ASA Sandakerveien 24C D11 Postboks 4430 Nydalen NO-0403 Oslo
Phone:
+47 23 40 60 00 | Fax:
+47 23 40 60 01 | Mobile:
+47 98 21 55 50
http://www.bouvet.no