The join process is explained in the code comments. For your convenience I am pasting the comment from the code here-
----------------------
/**
* Executes Map/Reduce job to handle join (inner or outer join) between two
* tables/subqueries.
* <p>
* It uses semi-join approach to optimize the join process. Join algorithm for
* inner join is as follows- first smaller table is reduced using the columns
* participating in join condition as Keys (for MR job). Also, bloom-filter
* for columns participating in join is constructed which is used to filter
* out the rows of bigger table during the map process(later). Bigger table is also
* reduced using the join columns as Keys. When rows of bigger table reach
* reducers, rows of smaller table are read back from HDFS and join is
* performed.
*<p>
* This algorithm makes use of one property of Map-Reduce paradigm - Same keys
* go to one reducer. So this means, rows (of both tables) having same values
* of columns participating in join will go to same reducer (as join columns
* are used as Keys) and hence a join can be easily performed at reducer.
* Although, it seems reading rows of smaller table from HDFS and performing
* join will be in-efficient, it is not. There are two reasons for the same-
* first, Hadoop tries to store a copy of the data at the node which produced
* that data. So this means, a reducer will be reading a local copy of the
* rows of smaller table. Second reason is, a Map-Reduce job, sort the keys
* so this means, rows of both smaller and bigger table are sorted on the
* join columns. This makes it easier to read rows and perform join.
* <p>
* In case of left outer join, left table is always reduced first. In case
* of right outer join, right table is reduced first.
*
* @see
http://en.wikipedia.org/wiki/Relational_algebra#Semijoin
* @see
http://en.wikipedia.org/wiki/Bloom_filter
*/
----------------------------------------------------