We'll be making some changes to how you pick up Duke with Maven. I've gotten Duke into Maven Central, so that means the 0.6 release will be going there, and the local repository in Google Code will be taken away at some point.
Looking forward to this. I finally deployed the PersonNameCleaner and it
does improve matching for me, so Iąll be updating the list of names going
forward.
I also would like to try your various new comparators. Will there be a short
description which one is good for what?
I am currently using a custom WeightedLevenstein comparator which adjusts
distance for short strings, will your WeightedLevenstein be doing that also?
-Alexey
On 8/3/12 4:58 AM, "Lars Garshol" <lar...@gmail.com> wrote:
> We'll be making some changes to how you pick up Duke with Maven. I've gotten
> Duke into Maven Central, so that means the 0.6 release will be going there,
> and the local repository in Google Code will be taken away at some point.
> Looking forward to this. I finally deployed the PersonNameCleaner and it > does improve matching for me, so I’ll be updating the list of names going > forward.
Good to hear that it's also working for others.
> I also would like to try your various new comparators. Will there be a > short description which one is good for what?
I'll add them to the documentation around release time.
Norphone is good for Norwegian names.
Metaphone is a rather coarse comparator for Anglo-Saxon names. Use it if you want to make sure relatively different names match.
The Jaccard index comparator is really a set comparator. It tokenizes strings, then compares the resulting sets of tokens. It can use other comparators to compare the tokens. It's good for when you can't trust the order of tokens in the strings.
Weighted Levenshtein is really a better, slower Levenshtein where you can change how important you consider changes to various pairs of characters. For example, you can say that replacing "i" with "y" has a low cost, but replacing "k" with "u" has a high cost.
I've used it to deal with names that are almost the same, except for numbers, and where the numbers are crucially important. Many of the organizations in the database I'm dealing with are homeowner's associations for all the owners living in a certain city block. So I'll have "Homeowners Association Whatever Street 12" and "Homeowners Association Whatever Street 14", where the addresses are obviously almost entirely the same. Clearly, the 12 != 14 is really important, so I've used Weighted Levenshtein with a weight of 10.0 for digit edits. Works beautifully.
> I am currently using a custom WeightedLevenstein comparator which adjusts > distance for short strings, will your WeightedLevenstein be doing that also?
It doesn't do that now, but if you explain what you mean, perhaps I can add it.
My WeightedLevenshtein was simply increasing the l-distance for short
strings:
int sl = s1.length() + s2.length();
if (d > 0 && sl <= 8) {
if (sl <= 4)
d *= 4;
elseif (sl <= 6)
d *= 3;
elseif (sl <= 8)
d *= 2;
}
But I am finding that even that may not be good enough.
I want my comparator for last names to be pretty strict but still not
ExactComparator.
For example my current comparator computes 0.5 for these two names:
Decasper vs. Welanber whereas you can see they are completely different
names.
I encountered many examples like that recently.
Decker vs. Tucker
Dodson vs. Wilson
Galligan vs. Saltzman
I think Iąd be ok with a few typo in longer last names (distance<0.3) but
when half of the string is different it should trigger a mismatch.
I guess I can adjust my comparator to do just that: if (distance<0.3) then
return 0.0
Or maybe I should change my overall config thresholds so that this 0.5 on a
last name would result in the below łsure˛ threshold value.
Any recommendations?
Could you please explain how to run the config auto-generation? Basically I
have close to a hundred test name pairs and the outcomes that I desire.
Iąd like to run your genetic algo to see what kind of config options it will
suggest. Is there a doc for this?
Thank you,
Alexey
On 8/29/12 11:17 PM, "Lars Garshol" <lar...@gmail.com> wrote:
> * Alexey Panteleev
>> Looking forward to this. I finally deployed the PersonNameCleaner and it does
>> improve matching for me, so Iąll be updating the list of names going forward.
> Good to hear that it's also working for others.
>> I also would like to try your various new comparators. Will there be a short
>> description which one is good for what?
> I'll add them to the documentation around release time.
> Norphone is good for Norwegian names.
> Metaphone is a rather coarse comparator for Anglo-Saxon names. Use it if you
> want to make sure relatively different names match.
> The Jaccard index comparator is really a set comparator. It tokenizes strings,
> then compares the resulting sets of tokens. It can use other comparators to
> compare the tokens. It's good for when you can't trust the order of tokens in
> the strings.
> Weighted Levenshtein is really a better, slower Levenshtein where you can
> change how important you consider changes to various pairs of characters. For
> example, you can say that replacing "i" with "y" has a low cost, but replacing
> "k" with "u" has a high cost.
> I've used it to deal with names that are almost the same, except for numbers,
> and where the numbers are crucially important. Many of the organizations in
> the database I'm dealing with are homeowner's associations for all the owners
> living in a certain city block. So I'll have "Homeowners Association Whatever
> Street 12" and "Homeowners Association Whatever Street 14", where the
> addresses are obviously almost entirely the same. Clearly, the 12 != 14 is
> really important, so I've used Weighted Levenshtein with a weight of 10.0 for
> digit edits. Works beautifully.
>> I am currently using a custom WeightedLevenstein comparator which adjusts
>> distance for short strings, will your WeightedLevenstein be doing that also?
> It doesn't do that now, but if you explain what you mean, perhaps I can add
> it.
> We'll be making some changes to how you pick up Duke with Maven. I've gotten
> Duke into Maven Central, so that means the 0.6 release will be going there,
> and the local repository in Google Code will be taken away at some point.
> My WeightedLevenshtein was simply increasing the l-distance for short strings:
> [...]
Ah, I see. You don't need the full weighted Levenshtein for that.
> I want my comparator for last names to be pretty strict but still not ExactComparator.
Note that in Duke 0.6 the probability calculation has changed, so all comparators (other than exact) are more strict now.
> For example my current comparator computes 0.5 for these two names:
> Decasper vs. Welanber whereas you can see they are completely different names.
> I encountered many examples like that recently.
> Decker vs. Tucker
> Dodson vs. Wilson
> Galligan vs. Saltzman
Weighted Levenshtein can help with this, by considering early edits and consonant edits to be more important.
> Or maybe I should change my overall config thresholds so that this 0.5 on a last name would result in the below “sure” threshold value.
> Any recommendations?
All of this is possible, but I think you should beware of focusing too much on any one field. The data in the other fields should contradict the name field when there's really no match, and that should take care of this kind of situation.
> Could you please explain how to run the config auto-generation? Basically I have close to a hundred test name pairs and the outcomes that I desire.
> I’d like to run your genetic algo to see what kind of config options it will suggest. Is there a doc for this?
The comparison was based on 4 parameters (3 enough for a match if 4th does
nto contradict): first name, last name, phone or email. But what happened is
that in this database all records had the same bad phone number '800' and
many of those similar sounding last name had the same first name. So my
comparison was firing "sure" matches for all of them, mostly because of the
'800' phone.
Since then I made a few changes:
1. Ignore any phone number of length <6
2. Make the name comparison much stricter. I basically now allow typos to be
<20% (Lowenstein distance .2 or less). Anything with more typos is not a
match for sure.
On 10/5/12 1:56 AM, "Lars Marius Garshol" <lars.gars...@bouvet.no> wrote:
> All of this is possible, but I think you should beware of focusing too much on
> any one field. The data in the other fields should contradict the name field
> when there's really no match, and that should take care of this kind of
> situation.
>> For example my current comparator computes 0.5 for these two names:
>> Decasper vs. Welanber whereas you can see they are completely different
>> names.
>> I encountered many examples like that recently.
>> Decker vs. Tucker
>> Dodson vs. Wilson
>> Galligan vs. Saltzman
> Weighted Levenshtein can help with this, by considering early edits and
> consonant edits to be more important.
>> Could you please explain how to run the config auto-generation? Basically I
>> have close to a hundred test name pairs and the outcomes that I desire.
>> Iąd like to run your genetic algo to see what kind of config options it will
>> suggest. Is there a doc for this?
> I can't find it in there by searching, either. I must have done something > wrong. Thanks for letting me know! I'll look into it now.
This was much harder than expected, but I managed to finally hit all the right buttons, and Duke is now on its way into Maven central. I'm told it should be there within 2 hours.