http://www.cascading.org/1.2/javadoc/cascading/pipe/assembly/Unique.html
The long answer depends on the data and what defines a duplicate row.
that said, 100k rows really isn't worth using Hadoop or Cascading for.
cat file.txt | sort | uniq > result-file.txt
would be best i think.
ckw
> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>
--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com
-- Concurrent, Inc. offers mentoring, support, and licensing for Cascading
I recommend looking at
http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/
I've implemented this in Cascading before, haven't tried their code.
ckw
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
No sorry, I don't have any code I can share at the time.
ckw
--
Chris K Wensel
ch...@concurrentinc.com
> Thanks.. sorry to pester on this point but what about my understanding
> of this
>
> "Regarding the co-grouping/self join on a partition_id, does each
> unique partition_id's group of tuples get sent to a different
> reducer?
if this "partition_id" is an integer, then its hash code is its value,
and that's what Hadoop uses (by default) for partitioning.
So if you have N unique partition_id values, you'll get N reduce tasks
that get executed by your reduce function, using (in parallel) the
reducers your cluster supports.
Which reducer executes a given reduce task is something that's not
under your control.
If you're asking whether each group of tuples for a given partition_id
value is processed by a separate reduce task, then that is true.
-- Ken
> I remember a conversation about the cartesian product where using an
> inserted join key would cause one reducer to process the entire
> joined
> dataset which is what I am trying to avoid. "
>
> Thanks
> Amit
>
> On Feb 6, 3:43 pm, Chris K Wensel <ch...@wensel.net> wrote:
>>> Any code you have regarding this fuzzy join implementation in
>>> cascading you can share?
>>
>> No sorry, I don't have any code I can share at the time.
>>
>> ckw
>>
>> --
>> Chris K Wensel
>> ch...@concurrentinc.comhttp://www.concurrentinc.com
>>
>> -- Concurrent, Inc. offers mentoring, support, and licensing for
>> Cascading
>
> --
> You received this message because you are subscribed to the Google
> Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en
> .
>
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g