Cascading to help dedupe data

ANithian

unread,

Feb 6, 2011, 12:16:27 AM2/6/11

to cascading-user

Hi all,

I have a dataset (say about 100K rows) that I wish to perform some
deduplication on and I may be thinking about this incorrectly but I
wish to understand how to use cascading/M-R to help with this task.
Here are my initial thoughts:
1) Use an aspect of the data to partition knowing that duplicates
won't happen across the partitions.
2) Assign a partition_id to each data element using an operation
3) Self Join the pipe on the partition_id so that you have effectively
a cartesian product of data per partition_id
4) Write a custom operation that operates on a tuple containing two
data elements in one "row" and either mark as duplicate, assign some
score etc
[partition_id, <LHS Data Elements Fields 1-N>, <RHS Data elements
Fields 1-N>]
5) Filter on pipe and emit necessary fields.

Problem is that I have long running reduce tasks and I want to
understand if this general approach is correct or not. Does each
reducer get a unique partition_id group worth of data?

Any insights would be helpful.

Thanks!
Amit

Chris K Wensel

unread,

Feb 6, 2011, 12:26:12 AM2/6/11

to cascadi...@googlegroups.com

The short answer is to use the Unique PipeAssembly.

http://www.cascading.org/1.2/javadoc/cascading/pipe/assembly/Unique.html

The long answer depends on the data and what defines a duplicate row.

that said, 100k rows really isn't worth using Hadoop or Cascading for.

cat file.txt | sort | uniq > result-file.txt

would be best i think.

ckw

> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading

ANithian

unread,

Feb 6, 2011, 12:59:51 AM2/6/11

to cascading-user

Chris,

I guess I wasn't framing the problem properly. When I say dedupe, I
don't mean literal duplicates but similar enough to be considered
duplicates (i.e. two pieces of user generated content are unique
entries in our data set but say differ by 1 edit distance).

I also want to use such a similar data flow to help compute similarity
scores between these two pieces of data (i.e. find out how similar X
is relative to everything else in my dataset).

100K isn't a lot you're right.. maybe I shoulda said 10M? :-)

Thanks!
Amit

> > For more options, visit this group athttp://groups.google.com/group/cascading-user?hl=en.
>
> --
> Chris K Wensel
> ch...@concurrentinc.comhttp://www.concurrentinc.com

Chris K Wensel

unread,

Feb 6, 2011, 1:37:10 AM2/6/11

to cascadi...@googlegroups.com

still small, but large if you were considering a cross join.

I recommend looking at
http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

I've implemented this in Cascading before, haven't tried their code.

ckw

> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

ANithian

unread,

Feb 6, 2011, 1:50:40 AM2/6/11

to cascading-user

yes it's the cross-join of the 10M that causing me problems hence why
I am trying to partition the dataset on some defined boundaries.

Regarding the co-grouping/self join on a partition_id, does each
unique partition_id's group of tuples get sent to a different reducer?
I remember a conversation about the cartesian product where using an
inserted join key would cause one reducer to process the entire joined
dataset which is what I am trying to avoid.

Any code you have regarding this fuzzy join implementation in
cascading you can share?

Thanks
Amit

On Feb 5, 10:37 pm, Chris K Wensel <ch...@wensel.net> wrote:
> still small, but large if you were considering a cross join.
>

> I recommend looking athttp://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

Chris K Wensel

unread,

Feb 6, 2011, 6:43:29 PM2/6/11

to cascadi...@googlegroups.com

> Any code you have regarding this fuzzy join implementation in
> cascading you can share?

No sorry, I don't have any code I can share at the time.

ckw

--
Chris K Wensel
ch...@concurrentinc.com

ANithian

unread,

Feb 7, 2011, 1:15:57 AM2/7/11

to cascading-user

Thanks.. sorry to pester on this point but what about my understanding
of this

"Regarding the co-grouping/self join on a partition_id, does each
unique partition_id's group of tuples get sent to a different
reducer?
I remember a conversation about the cartesian product where using an
inserted join key would cause one reducer to process the entire
joined
dataset which is what I am trying to avoid. "

Thanks
Amit

On Feb 6, 3:43 pm, Chris K Wensel <ch...@wensel.net> wrote:
> > Any code you have regarding this fuzzy join implementation in
> > cascading you can share?
>
> No sorry, I don't have any code I can share at the time.
>
> ckw
>
> --
> Chris K Wensel

> ch...@concurrentinc.comhttp://www.concurrentinc.com

Ken Krugler

unread,

Feb 7, 2011, 10:02:31 AM2/7/11

to cascadi...@googlegroups.com

On Feb 6, 2011, at 10:15pm, ANithian wrote:

> Thanks.. sorry to pester on this point but what about my understanding
> of this
>
> "Regarding the co-grouping/self join on a partition_id, does each
> unique partition_id's group of tuples get sent to a different
> reducer?

if this "partition_id" is an integer, then its hash code is its value,
and that's what Hadoop uses (by default) for partitioning.

So if you have N unique partition_id values, you'll get N reduce tasks
that get executed by your reduce function, using (in parallel) the
reducers your cluster supports.

Which reducer executes a given reduce task is something that's not
under your control.

If you're asking whether each group of tuples for a given partition_id
value is processed by a separate reduce task, then that is true.

-- Ken

> I remember a conversation about the cartesian product where using an
> inserted join key would cause one reducer to process the entire
> joined
> dataset which is what I am trying to avoid. "
>
> Thanks
> Amit
>
> On Feb 6, 3:43 pm, Chris K Wensel <ch...@wensel.net> wrote:
>>> Any code you have regarding this fuzzy join implementation in
>>> cascading you can share?
>>
>> No sorry, I don't have any code I can share at the time.
>>
>> ckw
>>
>> --
>> Chris K Wensel
>> ch...@concurrentinc.comhttp://www.concurrentinc.com
>>
>> -- Concurrent, Inc. offers mentoring, support, and licensing for
>> Cascading
>

> --
> You received this message because you are subscribed to the Google
> Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en
> .
>

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g

Reply all

Reply to author

Forward