CoGroup vs. HashJoin

1,311 views
Skip to first unread message

Xinkun Nie

unread,
Jun 1, 2013, 11:49:08 AM6/1/13
to cascadi...@googlegroups.com
I'm new to Cascading, and I'm confused when to use CoGroup/HashJoin. What's the difference between the two? Can somebody give me an example?

Thanks!

Ken Krugler

unread,
Jun 1, 2013, 2:34:50 PM6/1/13
to cascadi...@googlegroups.com
Hi there,

On Jun 1, 2013, at 8:49am, Xinkun Nie wrote:

I'm new to Cascading, and I'm confused when to use CoGroup/HashJoin. What's the difference between the two? Can somebody give me an example?

These are roughly identical, except that CoGroup is a reduce-side join, while HashJoin does a map-side join.

This means that when using HashJoin, the values from the right side pipe(s) should fit in memory.


-- Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Xinkun Nie

unread,
Jun 1, 2013, 6:40:01 PM6/1/13
to cascadi...@googlegroups.com
Thanks for the reply! I was looking at this example specifically: https://github.com/Cascading/Impatient/blob/master/part5/src/main/java/impatient/Main.java

I don't get why we used a HashJoin first then CoGroup. Can we invert the two, or maybe just use two HashJoin or two CoGroup?

Thanks again!

Ken Krugler

unread,
Jun 1, 2013, 9:57:41 PM6/1/13
to cascadi...@googlegroups.com
Hi there,

On Jun 1, 2013, at 3:40pm, Xinkun Nie wrote:

Thanks for the reply! I was looking at this example specifically: https://github.com/Cascading/Impatient/blob/master/part5/src/main/java/impatient/Main.java

I don't get why we used a HashJoin first then CoGroup. Can we invert the two, or maybe just use two HashJoin or two CoGroup?

HashJoin is going to be more performant that a CoGroup in almost every case, because it's a map-side operation - so you don't need to have a reduce phase, and thus this can be "chained" into other map operations as part of a single job.

The use of HashJoin in the code you referenced is basically joining a single count (number of documents) with the document count for every unique term found.
    // join to bring together all the components for calculating TF-IDF
    // the D side of the join is smaller, so it goes on the RHS
    Pipe idfPipe = new HashJoin( dfPipe, lhs_join, dPipe, rhs_join );

So the single count (number of documents) goes on the right side, and is a great use case for HashJoin.

The subsequent CoGroup is used to join the term frequency for each document/term combination with the doc count (& total docs) for each term.
    // the IDF side of the join is smaller, so it goes on the RHS
    Pipe tfidfPipe = new CoGroup( tfPipe, tf_token, idfPipe, df_token );

Here neither of these two pipes is likely to be well bounded and small in their size, so a CoGroup makes sense.

You could maybe argue that the number of unique terms should be in the 10K - 100K range, and thus it could be a HashJoin. But in my experience that assumption usually winds up hurting you in the end, when your workflow is used to process data with "unexpected" characteristics. With enough data, every possible edge case you didn't consider winds up having a finite probability of occurring.

-- Ken


On Saturday, June 1, 2013 2:34:50 PM UTC-4, kkrugler wrote:
Hi there,

On Jun 1, 2013, at 8:49am, Xinkun Nie wrote:

I'm new to Cascading, and I'm confused when to use CoGroup/HashJoin. What's the difference between the two? Can somebody give me an example?

These are roughly identical, except that CoGroup is a reduce-side join, while HashJoin does a map-side join.

This means that when using HashJoin, the values from the right side pipe(s) should fit in memory.


-- Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Paco Nathan

unread,
Jun 2, 2013, 1:13:33 AM6/2/13
to cascadi...@googlegroups.com
Hi Xinkun,

Are you talking about the HashJoin on line 121 of this example? https://github.com/Cascading/Impatient/blob/master/part5/src/main/java/impatient/Main.java

By definition the size of DF is going to be larger than the size of D.  Even so, it could be the case that D grows large.  If so, then that HashJoin on line 121 should be replaced with a CoGroup.

In any case, it's going to be more efficient to perform the join of DF and D first, before introducing the TF. In other words, we're assuming that each document has more than one keyword, and that keywords often repeat within a document.

Paco


Reply all
Reply to author
Forward
0 new messages