Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Duke going into Maven central
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  11 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Lars Marius Garshol  
View profile  
 More options Aug 3 2012, 7:58 am
From: Lars Marius Garshol <lar...@gmail.com>
Date: Fri, 3 Aug 2012 04:58:41 -0700 (PDT)
Local: Fri, Aug 3 2012 7:58 am
Subject: Duke going into Maven central

We'll be making some changes to how you pick up Duke with Maven. I've
gotten Duke into Maven Central, so that means the 0.6 release will be going
there, and the local repository in Google Code will be taken away at some
point.

This page describes how to use Duke with your Maven configuration:
http://code.google.com/p/duke/wiki/MavenSetup

Note that when 0.6 comes out you will be able to get it without having to
declare a <repository> at all, because it will be in Maven Central.

I've made quite a few improvements, so I'm thinking of doing a new release
soon.

--Lars M.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alexey Panteleev  
View profile  
 More options Aug 29 2012, 10:54 pm
From: Alexey Panteleev <ale...@yoxel.com>
Date: Wed, 29 Aug 2012 19:54:29 -0700
Local: Wed, Aug 29 2012 10:54 pm
Subject: Re: Duke going into Maven central

Looking forward to this. I finally deployed the PersonNameCleaner and it
does improve matching for me, so Iąll be updating the list of names going
forward.
I also would like to try your various new comparators. Will there be a short
description which one is good for what?
I am currently using a custom WeightedLevenstein comparator which adjusts
distance for short strings, will your WeightedLevenstein be doing that also?

-Alexey

On 8/3/12 4:58 AM, "Lars Garshol" <lar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lars Marius Garshol  
View profile  
 More options Aug 30 2012, 2:17 am
From: Lars Marius Garshol <lar...@gmail.com>
Date: Wed, 29 Aug 2012 23:17:28 -0700 (PDT)
Local: Thurs, Aug 30 2012 2:17 am
Subject: Re: Duke going into Maven central

* Alexey Panteleev

>  Looking forward to this. I finally deployed the PersonNameCleaner and it
> does improve matching for me, so I’ll be updating the list of names going
> forward.

Good to hear that it's also working for others.

> I also would like to try your various new comparators. Will there be a
> short description which one is good for what?

I'll add them to the documentation around release time.

Norphone is good for Norwegian names.

Metaphone is a rather coarse comparator for Anglo-Saxon names. Use it if
you want to make sure relatively different names match.

The Jaccard index comparator is really a set comparator. It tokenizes
strings, then compares the resulting sets of tokens. It can use other
comparators to compare the tokens. It's good for when you can't trust the
order of tokens in the strings.

Weighted Levenshtein is really a better, slower Levenshtein where you can
change how important you consider changes to various pairs of characters.
For example, you can say that replacing "i" with "y" has a low cost, but
replacing "k" with "u" has a high cost.

I've used it to deal with names that are almost the same, except for
numbers, and where the numbers are crucially important. Many of the
organizations in the database I'm dealing with are homeowner's associations
for all the owners living in a certain city block. So I'll have "Homeowners
Association Whatever Street 12" and "Homeowners Association Whatever Street
14", where the addresses are obviously almost entirely the same. Clearly,
the 12 != 14 is really important, so I've used Weighted Levenshtein with a
weight of 10.0 for digit edits. Works beautifully.

>  I am currently using a custom WeightedLevenstein comparator which adjusts
> distance for short strings, will your WeightedLevenstein be doing that also?

It doesn't do that now, but if you explain what you mean, perhaps I can add
it.

--Lars M.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alexey Panteleev  
View profile  
 More options Sep 26 2012, 4:57 pm
From: Alexey Panteleev <ale...@yoxel.com>
Date: Wed, 26 Sep 2012 13:57:17 -0700
Local: Wed, Sep 26 2012 4:57 pm
Subject: Re: Duke going into Maven central

My WeightedLevenshtein was simply increasing the l-distance for short
strings:

int sl = s1.length() + s2.length();
if (d > 0 && sl <= 8) {
  if (sl <= 4)
    d *= 4;
  elseif (sl <= 6)
    d *= 3;
  elseif (sl <= 8)
    d *= 2;

}

 But I am finding that even that may not be good enough.
I want my comparator for last names to be pretty strict but still not
ExactComparator.

 For example my current comparator computes 0.5 for these two names:
Decasper vs. Welanber whereas you can see they are completely different
names.

 I encountered many examples like that recently.

Decker vs. Tucker
Dodson vs. Wilson
Galligan vs. Saltzman

 I think Iąd be ok with a few typo in longer last names (distance<0.3) but
when half of the string is different it should trigger a mismatch.
I guess I can adjust my comparator to do just that: if (distance<0.3) then
return 0.0
Or maybe I should change my overall config thresholds so that this 0.5 on a
last name would result in the below łsure˛ threshold value.
Any recommendations?

 Could you please explain how to run the config auto-generation? Basically I
have close to a hundred test name pairs and the outcomes that I desire.
Iąd like to run your genetic algo to see what kind of config options it will
suggest. Is there a doc for this?

Thank you,
Alexey

On 8/29/12 11:17 PM, "Lars Garshol" <lar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alexey Panteleev  
View profile  
 More options Sep 26 2012, 6:32 pm
From: Alexey Panteleev <ale...@yoxel.com>
Date: Wed, 26 Sep 2012 15:32:19 -0700
Local: Wed, Sep 26 2012 6:32 pm
Subject: Re: Duke going into Maven central

Strange, but my maven does not find duke­0.6.

On 8/3/12 4:58 AM, "Lars Garshol" <lar...@gmail.com> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lars Marius Garshol  
View profile  
 More options Oct 5 2012, 4:53 am
From: Lars Marius Garshol <lars.gars...@bouvet.no>
Date: Fri, 5 Oct 2012 08:56:43 +0000
Local: Fri, Oct 5 2012 4:56 am
Subject: Re: Duke going into Maven central

* Alexey Panteleev

> My WeightedLevenshtein was simply increasing the l-distance for short strings:
> [...]

Ah, I see. You don't need the full weighted Levenshtein for that.

> I want my comparator for last names to be pretty strict but still not ExactComparator.

Note that in Duke 0.6 the probability calculation has changed, so all comparators (other than exact) are more strict now.

> For example my current comparator computes 0.5 for these two names:
> Decasper vs. Welanber whereas you can see they are completely different names.

> I encountered many examples like that recently.

> Decker vs. Tucker
> Dodson vs. Wilson
> Galligan vs. Saltzman

Weighted Levenshtein can help with this, by considering early edits and consonant edits to be more important.

> Or maybe I should change my overall config thresholds so that this 0.5 on a last name would result in the below “sure” threshold value.
> Any recommendations?

All of this is possible, but I think you should beware of focusing too much on any one field. The data in the other fields should contradict the name field when there's really no match, and that should take care of this kind of situation.

> Could you please explain how to run the config auto-generation? Basically I have close to a hundred test name pairs and the outcomes that I desire.
> I’d like to run your genetic algo to see what kind of config options it will suggest. Is there a doc for this?

There's no documentation, but it's actually pretty simple. I'm writing up a wiki page on it now:
  http://code.google.com/p/duke/wiki/GeneticAlgorithm

--
Lars Marius Garshol  |  Consultant
Bouvet ASA Sandakerveien 24C D11 Postboks 4430 Nydalen NO-0403 Oslo
Phone: +47 23 40 60 00 | Fax: +47 23 40 60 01 | Mobile: +47 98 21 55 50
http://www.bouvet.no


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lars Marius Garshol  
View profile  
 More options Oct 5 2012, 5:48 am
From: Lars Marius Garshol <lars.gars...@bouvet.no>
Date: Fri, 5 Oct 2012 09:51:25 +0000
Local: Fri, Oct 5 2012 5:51 am
Subject: Re: Duke going into Maven central

* Alexey Panteleev

> Strange, but my maven does not find duke–0.6.

I can't find it in there by searching, either. I must have done something  wrong. Thanks for letting me know! I'll look into it now.

--
Lars Marius Garshol  |  Consultant
Bouvet ASA Sandakerveien 24C D11 Postboks 4430 Nydalen NO-0403 Oslo
Phone: +47 23 40 60 00 | Fax: +47 23 40 60 01 | Mobile: +47 98 21 55 50
http://www.bouvet.no


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alexey Panteleev  
View profile  
 More options Oct 5 2012, 1:00 pm
From: Alexey Panteleev <ale...@yoxel.com>
Date: Fri, 05 Oct 2012 09:59:55 -0700
Local: Fri, Oct 5 2012 12:59 pm
Subject: Re: Duke going into Maven central
 The comparison was based on 4 parameters (3 enough for a match if 4th does
nto contradict): first name, last name, phone or email. But what happened is
that in this database all records had the same bad phone number '800' and
many of those similar sounding last name had the same first name. So my
comparison was firing "sure" matches for all of them, mostly because of the
'800' phone.

 Since then I made a few changes:

1. Ignore any phone number of length <6
2. Make the name comparison much stricter. I basically now allow typos to be
<20% (Lowenstein distance .2 or less). Anything with more typos is not a
match for sure.

On 10/5/12 1:56 AM, "Lars Marius Garshol" <lars.gars...@bouvet.no> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alexey Panteleev  
View profile  
 More options Oct 5 2012, 1:01 pm
From: Alexey Panteleev <ale...@yoxel.com>
Date: Fri, 05 Oct 2012 10:01:31 -0700
Local: Fri, Oct 5 2012 1:01 pm
Subject: Re: Duke going into Maven central
How do I do that? I guess I'll have to study the code.

On 10/5/12 1:56 AM, "Lars Marius Garshol" <lars.gars...@bouvet.no> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Alexey Panteleev  
View profile  
 More options Oct 5 2012, 1:02 pm
From: Alexey Panteleev <ale...@yoxel.com>
Date: Fri, 05 Oct 2012 10:02:16 -0700
Local: Fri, Oct 5 2012 1:02 pm
Subject: Re: Duke going into Maven central
Thank you.

On 10/5/12 1:56 AM, "Lars Marius Garshol" <lars.gars...@bouvet.no> wrote:


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Lars Marius Garshol  
View profile  
 More options Nov 8 2012, 12:29 pm
From: Lars Marius Garshol <lar...@gmail.com>
Date: Thu, 8 Nov 2012 09:29:23 -0800 (PST)
Local: Thurs, Nov 8 2012 12:29 pm
Subject: Re: Duke going into Maven central

* lars.garshol

> I can't find it in there by searching, either. I must have done something
>  wrong. Thanks for letting me know! I'll look into it now.

This was much harder than expected, but I managed to finally hit all the
right buttons, and Duke is now on its way into Maven central. I'm told it
should be there within 2 hours.

--Lars M.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »