Re: Dedupe for record linkage

Forest Gregg

unread,

Nov 23, 2013, 7:00:21 PM11/23/13

to open-source-...@googlegroups.com

Hi Jordan,

So we have a couple of branches to do what you are looking for.

https://github.com/open-city/dedupe/tree/data_matching is for when you are sure that each record set is unique, which sounds like what you want. Check out the dataset matching example https://github.com/open-city/dedupe/blob/data_matching/examples/dataset_matching/dataset_matching.py

I've also been looking on softer solution, which allows you to basically learn a three different probability for matching depending on if a pair or records are from both dataset A, dataset B, or one in each. It's not as far along. https://github.com/open-city/dedupe/tree/soft_data_matching

Both are branches are definitely works in progress.

Best,

Forest

On Fri, Nov 22, 2013 at 2:08 PM, Jordan Bates <jtba...@asu.edu> wrote:

Hi Forest,

I'd like to make sure that rows from one source are not linked to other rows from the same source (I assume that the inputs do not have duplicates). I'm not sure how the training in dedupe works, but if I include a UID column for each source that is blank for the the other sources, and vice versa, would dedupe then learn in training not to link rows from the same source?

Thanks for your reply and for developing this software!

Best,
Jordan

On Fri, Nov 22, 2013 at 12:29 PM, Forest Gregg <fgr...@gmail.com> wrote:

Hi Jordan,

There's not an exact tutorial for record linkage but you could look at the csv example for the general idea http://open-city.github.io/dedupe/doc/csv_example.html

In that data we are combining data from different sources and then figuring out which records are referring to the same thing. Because the task is deduplication, we collapse that information, but if you were doing record linkage, everything could be the same up to the point where you identify the clusters of records. http://open-city.github.io/dedupe/doc/csv_example.html#section-25

This might even be easier to see in the README for csvdedupe https://github.com/datamade/csvdedupe/

Good luck.

Best,

Forest

On Fri, Nov 22, 2013 at 12:34 PM, Jordan Bates <jtba...@asu.edu> wrote:

Hello Forest and Derek,

Is there a tutorial on how dedupe could be used for record linkage or do you suggest another package for this task? Thank you.

Best,

Jordan

--
773.888.2718
2231 N. Monticello Ave.
Chicago, IL 60647

--
773.888.2718
2231 N. Monticello Ave.
Chicago, IL 60647

Message has been deleted

Jordan T Bates

unread,

Nov 27, 2013, 8:49:03 PM11/27/13

to open-source-...@googlegroups.com

Hi Forest,

Thank you so much for your help! I did end up using the data_matching branch and it worked pretty well.

I had to replace the greedyMatching function in the clustering module to get things to work. I'm not sure if this is a bug or if I did something incorrectly that created the problem. It looks like the greedyMatching function is presuming that the first vertex of each pair in dupes comes from dataset 0 and the second comes from the other dataset. This wasn't true for me, so the clusters it returned included repeats like (1,2) and (2,1).

I replaced

def greedyMatching(dupes, threshold=0.5):
    covered_vertex_A = set([])
    covered_vertex_B = set([])
    clusters = []

    sorted_dupes = sorted(dupes, key=lambda score: score[1], reverse=True)
    dupes_list = [dupe for dupe in sorted_dupes if dupe[1] >= threshold]

    for dupe in dupes_list:
        vertices = dupe[0]
        if vertices[0] not in covered_vertex_A and vertices[1] not in covered_vertex_B:
            clusters.append(set(vertices))
            covered_vertex_A.update([vertices[0]])
            covered_vertex_B.update([vertices[1]])

    return clusters

with

def greedyMatching(dupes, threshold=0.5):
    covered_vertex = set([])
    clusters = []

    sorted_dupes = sorted(dupes, key=lambda score: score[1], reverse=True)
    dupes_list = [dupe for dupe in sorted_dupes if dupe[1] >= threshold]

    for dupe in dupes_list:
        vertices = dupe[0]
        if vertices[0] not in covered_vertex and vertices[1] not in covered_vertex:
            clusters.append(set(vertices))
            covered_vertex.update([vertices[0]])
            covered_vertex.update([vertices[1]])

return clusters

Forest Gregg

unread,

Dec 1, 2013, 7:52:54 PM12/1/13

to open-source-...@googlegroups.com

On Wed, Nov 27, 2013 at 7:49 PM, Jordan T Bates <jtb...@gmail.com> wrote:
> It looks like the greedyMatching function is presuming that the first
> vertex of each pair in dupes comes from dataset 0 and the second comes from
> the other dataset. This wasn't true for me, so the clusters it returned
> included repeats like (1,2) and (2,1).

In what way was this assumption not true?

You said that "I'd like to make sure that rows from one source are not
linked to other rows from the same source." This would seem to imply
that you only want to consider possible links where one record was
from source A and the other record was from source B.

If you make your code available on github I can take a look.

Best,

Forest

2231 N. Monticello Ave
Chicago, IL 60647

Reply all

Reply to author

Forward