Duke going into Maven central

75 views
Skip to first unread message

Lars Marius Garshol

unread,
Aug 3, 2012, 7:58:41 AM8/3/12
to duke-...@googlegroups.com
We'll be making some changes to how you pick up Duke with Maven. I've gotten Duke into Maven Central, so that means the 0.6 release will be going there, and the local repository in Google Code will be taken away at some point.

This page describes how to use Duke with your Maven configuration:

Note that when 0.6 comes out you will be able to get it without having to declare a <repository> at all, because it will be in Maven Central.

I've made quite a few improvements, so I'm thinking of doing a new release soon.

--Lars M.

Alexey Panteleev

unread,
Aug 29, 2012, 10:54:29 PM8/29/12
to duke-...@googlegroups.com
Looking forward to this. I finally deployed the PersonNameCleaner and it does improve matching for me, so I’ll be updating the list of names going forward.
I also would like to try your various new comparators. Will there be a short description which one is good for what?
I am currently using a custom WeightedLevenstein comparator which adjusts distance for short strings, will your WeightedLevenstein be doing that also?

-Alexey

Lars Marius Garshol

unread,
Aug 30, 2012, 2:17:28 AM8/30/12
to duke-...@googlegroups.com
* Alexey Panteleev
Looking forward to this. I finally deployed the PersonNameCleaner and it does improve matching for me, so I’ll be updating the list of names going forward.

Good to hear that it's also working for others.
 
I also would like to try your various new comparators. Will there be a short description which one is good for what?

I'll add them to the documentation around release time.

Norphone is good for Norwegian names.

Metaphone is a rather coarse comparator for Anglo-Saxon names. Use it if you want to make sure relatively different names match.

The Jaccard index comparator is really a set comparator. It tokenizes strings, then compares the resulting sets of tokens. It can use other comparators to compare the tokens. It's good for when you can't trust the order of tokens in the strings.

Weighted Levenshtein is really a better, slower Levenshtein where you can change how important you consider changes to various pairs of characters. For example, you can say that replacing "i" with "y" has a low cost, but replacing "k" with "u" has a high cost.

I've used it to deal with names that are almost the same, except for numbers, and where the numbers are crucially important. Many of the organizations in the database I'm dealing with are homeowner's associations for all the owners living in a certain city block. So I'll have "Homeowners Association Whatever Street 12" and "Homeowners Association Whatever Street 14", where the addresses are obviously almost entirely the same. Clearly, the 12 != 14 is really important, so I've used Weighted Levenshtein with a weight of 10.0 for digit edits. Works beautifully.
 
 I am currently using a custom WeightedLevenstein comparator which adjusts distance for short strings, will your WeightedLevenstein be doing that also?

It doesn't do that now, but if you explain what you mean, perhaps I can add it.

--Lars M. 

Alexey Panteleev

unread,
Sep 26, 2012, 4:57:17 PM9/26/12
to duke-...@googlegroups.com
My WeightedLevenshtein was simply increasing the l-distance for short strings:

int sl = s1.length() + s2.length();
if (d > 0 && sl <= 8) {
  if (sl <= 4)
    d *= 4;
  elseif (sl <= 6)
    d *= 3;
  elseif (sl <= 8)
    d *= 2;
}

 But I am finding that even that may not be good enough.
I want my comparator for last names to be pretty strict but still not ExactComparator.

 For example my current comparator computes 0.5 for these two names:
Decasper vs. Welanber whereas you can see they are completely different names.

 I encountered many examples like that recently.

Decker vs. Tucker
Dodson vs. Wilson
Galligan vs. Saltzman

 I think I’d be ok with a few typo in longer last names (distance<0.3) but when half of the string is different it should trigger a mismatch.
I guess I can adjust my comparator to do just that: if (distance<0.3) then return 0.0
Or maybe I should change my overall config thresholds so that this 0.5 on a last name would result in the below “sure” threshold value.
Any recommendations?

 Could you please explain how to run the config auto-generation? Basically I have close to a hundred test name pairs and the outcomes that I desire.
I’d like to run your genetic algo to see what kind of config options it will suggest. Is there a doc for this?

Thank you,
Alexey

Alexey Panteleev

unread,
Sep 26, 2012, 6:32:19 PM9/26/12
to duke-...@googlegroups.com
Strange, but my maven does not find duke–0.6.



On 8/3/12 4:58 AM, "Lars Garshol" <lar...@gmail.com> wrote:

Lars Marius Garshol

unread,
Oct 5, 2012, 4:56:43 AM10/5/12
to duke-...@googlegroups.com

* Alexey Panteleev
>
> My WeightedLevenshtein was simply increasing the l-distance for short strings:
> [...]

Ah, I see. You don't need the full weighted Levenshtein for that.

> I want my comparator for last names to be pretty strict but still not ExactComparator.

Note that in Duke 0.6 the probability calculation has changed, so all comparators (other than exact) are more strict now.

> For example my current comparator computes 0.5 for these two names:
> Decasper vs. Welanber whereas you can see they are completely different names.
>
> I encountered many examples like that recently.
>
> Decker vs. Tucker
> Dodson vs. Wilson
> Galligan vs. Saltzman

Weighted Levenshtein can help with this, by considering early edits and consonant edits to be more important.

> Or maybe I should change my overall config thresholds so that this 0.5 on a last name would result in the below “sure” threshold value.
> Any recommendations?

All of this is possible, but I think you should beware of focusing too much on any one field. The data in the other fields should contradict the name field when there's really no match, and that should take care of this kind of situation.

> Could you please explain how to run the config auto-generation? Basically I have close to a hundred test name pairs and the outcomes that I desire.
> I’d like to run your genetic algo to see what kind of config options it will suggest. Is there a doc for this?

There's no documentation, but it's actually pretty simple. I'm writing up a wiki page on it now:
http://code.google.com/p/duke/wiki/GeneticAlgorithm

--
Lars Marius Garshol | Consultant
Bouvet ASA Sandakerveien 24C D11 Postboks 4430 Nydalen NO-0403 Oslo
Phone: +47 23 40 60 00 | Fax: +47 23 40 60 01 | Mobile: +47 98 21 55 50
http://www.bouvet.no


Lars Marius Garshol

unread,
Oct 5, 2012, 5:51:25 AM10/5/12
to duke-...@googlegroups.com

* Alexey Panteleev
>
> Strange, but my maven does not find duke–0.6.

I can't find it in there by searching, either. I must have done something wrong. Thanks for letting me know! I'll look into it now.

Alexey Panteleev

unread,
Oct 5, 2012, 12:59:55 PM10/5/12
to duke-...@googlegroups.com
The comparison was based on 4 parameters (3 enough for a match if 4th does
nto contradict): first name, last name, phone or email. But what happened is
that in this database all records had the same bad phone number '800' and
many of those similar sounding last name had the same first name. So my
comparison was firing "sure" matches for all of them, mostly because of the
'800' phone.

Since then I made a few changes:

1. Ignore any phone number of length <6
2. Make the name comparison much stricter. I basically now allow typos to be
<20% (Lowenstein distance .2 or less). Anything with more typos is not a
match for sure.

Alexey Panteleev

unread,
Oct 5, 2012, 1:01:31 PM10/5/12
to duke-...@googlegroups.com
How do I do that? I guess I'll have to study the code.


On 10/5/12 1:56 AM, "Lars Marius Garshol" <lars.g...@bouvet.no> wrote:

Alexey Panteleev

unread,
Oct 5, 2012, 1:02:16 PM10/5/12
to duke-...@googlegroups.com
Thank you.


On 10/5/12 1:56 AM, "Lars Marius Garshol" <lars.g...@bouvet.no> wrote:

Lars Marius Garshol

unread,
Nov 8, 2012, 12:29:23 PM11/8/12
to duke-...@googlegroups.com
* lars.garshol

I can't find it in there by searching, either. I must have done something  wrong. Thanks for letting me know! I'll look into it now.

This was much harder than expected, but I managed to finally hit all the right buttons, and Duke is now on its way into Maven central. I'm told it should be there within 2 hours.

--Lars M.
Reply all
Reply to author
Forward
0 new messages