Set-Similarity Join on Multiple Attributes

64 views
Skip to first unread message

SSS

unread,
May 29, 2012, 12:05:31 PM5/29/12
to cascadi...@googlegroups.com
I saw a previous post regarding set-similarity joins for deduplicating data.  How would I implement this fuzzy join on multiple attributes?  I am trying to implement an entity resolution solution using Hadoop.  I could, for instance, use the set-similarity example for matching names by parsing the name into a set of q-grams and doing a set-similarity match on the q-gram sets.  But, how would I do this sort of join using multiple attributes?  I guess if I wanted to match on name, and say birth date, I could create one set of q-grams from the data in all of the attributes I'm matching, but I want to be able to report the similarity of each attribute.  Any ideas on how I could do a multi-attribute similarity join using Cascading?

Ken Krugler

unread,
May 30, 2012, 4:51:02 PM5/30/12
to cascadi...@googlegroups.com
On May 29, 2012, at 9:05am, SSS wrote:

I saw a previous post regarding set-similarity joins for deduplicating data.  How would I implement this fuzzy join on multiple attributes?  I am trying to implement an entity resolution solution using Hadoop.  I could, for instance, use the set-similarity example for matching names by parsing the name into a set of q-grams and doing a set-similarity match on the q-gram sets.  But, how would I do this sort of join using multiple attributes?  I guess if I wanted to match on name, and say birth date, I could create one set of q-grams from the data in all of the attributes I'm matching, but I want to be able to report the similarity of each attribute.  Any ideas on how I could do a multi-attribute similarity join using Cascading?

Based on my reading of the paper, I assume you could generate a "join key" that included both types of prefix tokens - say user names and birth date information, along with a "join type" field.

Then in the GroupBy step you'd use both the join key and the join type fields, so that you're not mixing apples & oranges when applying the principle of "at least one token in the join key prefix must match, for two records to be similar".

-- Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply all
Reply to author
Forward
0 new messages