Should I avoid joins to improve performance?

Alexander Kehayias

unread,

Feb 27, 2013, 11:07:21 AM2/27/13

to cascal...@googlegroups.com

Right now I'm using data sets that reflect the tables they were dumped from. The queries I'm making does a bunch of joins. For performance reasons, should I dump the data into a more denormalized format where common datasets are already joined? How performant are inner joins?

Andy Xue

unread,

Feb 28, 2013, 12:59:12 AM2/28/13

to cascal...@googlegroups.com

yea, joins in mapreduce aren't terribly performant. basically, you collect rows from both sides on the join keys and do a cross product in the reducer. i don't see inner joins being much more efficient ... like i don't think you can figure out to throw out the rows during the the map stage, so it would still be the same # of rows going into the reducer (i might be wrong on this).

however, if one side of the data is very small, you can try to implement a map only join where all the data from the smaller side is fed into each mapper and the join occurs there. basically would have to fit in the memory of the mapper.

Bragil Massoud

unread,

Feb 28, 2013, 5:46:40 PM2/28/13

to cascal...@googlegroups.com

Depending on how many keys from your left hand side survive the join, the bloom filter based join from https://github.com/LiveRamp/cascading_ext could help speed things up.

cheers,

simon

Alexander Kehayias

unread,

Mar 2, 2013, 5:30:12 PM3/2/13

to cascal...@googlegroups.com

Looks promising, any idea how I would set that up with Cascalog? Looks like a drop in replacement, but need to add some configuration https://github.com/LiveRamp/cascading_ext/blob/master/src/main/java/com/liveramp/cascading_ext/example/BloomJoinExampleWithoutCascadingUtil.java.

Reply all

Reply to author

Forward