CloudBase architecture

Pavan Lanka

unread,

Jun 20, 2011, 7:33:07 PM6/20/11

to CloudBase

Hi,

I am interested in understanding the CloudBase architecture and its
interaction with Hadoop.
For example questions of interest to me are how is a JOIN on two large
tables handled? Will this be a MAP side join because of co-locating
the data to the same nodes, or will this be a reduce side join etc.

I have tried the documentation available, but it does not provide the
architecture (or I might have missed it) on the CloudBase and Hadoop
integration.

Please advice.

Regards,
Pavan

Tarandeep Singh

unread,

Jun 22, 2011, 2:53:51 AM6/22/11

to cloudba...@googlegroups.com

The join process is explained in the code comments. For your convenience I am pasting the comment from the code here-

------------------------------

----------------------
/**
   * Executes Map/Reduce job to handle join (inner or outer join) between two
   * tables/subqueries.
   * <p>
   * It uses semi-join approach to optimize the join process. Join algorithm for
   * inner join is as follows- first smaller table is reduced using the columns
   * participating in join condition as Keys (for MR job). Also, bloom-filter
   * for columns participating in join is constructed which is used to filter
   * out the rows of bigger table during the map process(later). Bigger table is also
   * reduced using the join columns as Keys. When rows of bigger table reach
   * reducers, rows of smaller table are read back from HDFS and join is
   * performed.
   *<p>
   * This algorithm makes use of one property of Map-Reduce paradigm - Same keys
   * go to one reducer. So this means, rows (of both tables) having same values
   * of columns participating in join will go to same reducer (as join columns
   * are used as Keys) and hence a join can be easily performed at reducer.
   * Although, it seems reading rows of smaller table from HDFS and performing
   * join will be in-efficient, it is not. There are two reasons for the same-
   * first, Hadoop tries to store a copy of the data at the node which produced
   * that data. So this means, a reducer will be reading a local copy of the
   * rows of smaller table. Second reason is, a Map-Reduce job, sort the keys
   * so this means, rows of both smaller and bigger table are sorted on the
   * join columns. This makes it easier to read rows and perform join.
   * <p>
   * In case of left outer join, left table is always reduced first. In case
   * of right outer join, right table is reduced first.
   *
   * @see http://en.wikipedia.org/wiki/Relational_algebra#Semijoin
   * @see http://en.wikipedia.org/wiki/Bloom_filter
   */
----------------------------------------------------

--
You received this message because you are subscribed to the Google Groups "CloudBase" group.
To post to this group, send email to cloudba...@googlegroups.com.
To unsubscribe from this group, send email to cloudbase-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cloudbase-users?hl=en.

Pavan Lanka

unread,

Jun 23, 2011, 2:34:28 PM6/23/11

to CloudBase

Hi Tarandeep,

Thanks for the details, if I understood the logic correctly we are
only performing a reduce side join here. The example here talks about
a small table join with a big table join.

However looks like from the information that we only do a reduce side
join, so in case we are looking at a big table big table join we have
the problem of lot of data transfer or do we logic something similar
to this:
Table1 has an index on column A
Table2 has an index on column A
JOIN is table1 and Table2 on column A
Each mapper gets only a certain range of values from the index so the
join is executed on the MAP side (I see that slowly it is getting more
and more complicated)
Here another nice to have will be the ability to co-locate data sets
on the same nodes based on the ranges so that during this join there
is minimal network traffic.

Regards,
Pavan

> * @seehttp://en.wikipedia.org/wiki/Bloom_filter

Reply all

Reply to author

Forward