Set-Similarity Join on Multiple Attributes

64 views

Skip to first unread message

SSS

unread,

May 29, 2012, 12:05:31 PM5/29/12

to cascadi...@googlegroups.com

I saw a previous post regarding set-similarity joins for deduplicating data. How would I implement this fuzzy join on multiple attributes? I am trying to implement an entity resolution solution using Hadoop. I could, for instance, use the set-similarity example for matching names by parsing the name into a set of q-grams and doing a set-similarity match on the q-gram sets. But, how would I do this sort of join using multiple attributes? I guess if I wanted to match on name, and say birth date, I could create one set of q-grams from the data in all of the attributes I'm matching, but I want to be able to report the similarity of each attribute. Any ideas on how I could do a multi-attribute similarity join using Cascading?

Ken Krugler

unread,

May 30, 2012, 4:51:02 PM5/30/12

to cascadi...@googlegroups.com

On May 29, 2012, at 9:05am, SSS wrote:

I saw a previous post regarding set-similarity joins for deduplicating data. How would I implement this fuzzy join on multiple attributes? I am trying to implement an entity resolution solution using Hadoop. I could, for instance, use the set-similarity example for matching names by parsing the name into a set of q-grams and doing a set-similarity match on the q-gram sets. But, how would I do this sort of join using multiple attributes? I guess if I wanted to match on name, and say birth date, I could create one set of q-grams from the data in all of the attributes I'm matching, but I want to be able to report the similarity of each attribute. Any ideas on how I could do a multi-attribute similarity join using Cascading?

Based on my reading of the paper, I assume you could generate a "join key" that included both types of prefix tokens - say user names and birth date information, along with a "join type" field.

Then in the GroupBy step you'd use both the join key and the join type fields, so that you're not mixing apples & oranges when applying the principle of "at least one token in the join key prefix must match, for two records to be similar".

-- Ken

--------------------------

Ken Krugler

http://www.scaleunlimited.com

custom big data solutions & training

Hadoop, Cascading, Mahout & Solr

Reply all

Reply to author

Forward

0 new messages