Architecture and Working of Rhipe

1,133 views
Skip to first unread message

Shekhar

unread,
Apr 19, 2011, 1:32:11 AM4/19/11
to Bangalore R Users - BRU
Hi,
In the last post "Working of R on Hadoop" (http://groups.google.com/
group/brumail/browse_thread/thread/a87d708ed060c182 ) we have seen
that to exploit the statistical and graphical capabilities of R on
Hadoop we are using the Rhipe package. The above mentioned post deals
with the installation of Rhipe on all the nodes of Hadoop.

This post presents the architecture and working of Rhipe package.

To understand the archtecture of Rhipe and its working we will answer
the following questions:

Q1. What is Rhipe?

Answer: Rhipe is a Java package that integrates the R environment with
Hadoop, the open source implementation of Google’s MapReduce. Using
Rhipe, it is possible to write MapReduce algorithms in R.

Q2. What does Rhipe do?

Answer: To run the MapReduce jobs on Hadoop, two functions namely Map
and Reduce functions are used. In “R” the developer writes R
expressions to achieve map and reduce functionality. Rhipe
encapsulates this map and reduce expressions and other Hadoop related
parameters (like combiner, num of reduce and map tasks etc) provided
by the user and submits the job to Hadoop.
The encapsulation is done by Rhipe function “rhmr”, and the job is
submitted by “rhex” function.
For more information on these functions refer the supporting document
for R.

Q3. What the Rhipe is doing behind the scene?

Answer: When R user submits the job to the Hadoop using “rhex”, Rhipe
serializes the code and the data and launches the job. The
serialization is done by using Google’s protocol buffer which comes as
a part of Rhipe installation.

Since Rhipe is a Java package, so it acts like a Java bridge between
Hadoop and R. During serialization Rhipe converts the input data into
java types, so that it can be interpreted by Hadoop.

Hadoop sends the key value pairs to "Rhipe’s Map function" which is
then passed to the Rhipe C Engine. The Rhipe C Engine then starts the
instance of R, and passes the data to the user provided R Map and
reduce expression.

The key value pairs send by the Hadoop is converted into R lists
because the map and reduce expressions expects the data in the form of
lists.

NOTE: Rhipe’s map function is internal to the Rhipe library. This
function shouldn’t be confused with user provided map and reduce
expression.
Again when the data is returned from the Map and Reduce expression,
Rhipe converts the data understandable by the Hadoop.

Q4. How Rhipe serialization is different from R serialization?

Answer: Serialization means byte representation of the objects. The
in-built R serialization is very slow and consumes more space. For
example: Numeric vectors require (22 + 8n bytes). Moreover it is very
difficult to read the data serialized by R in other languages.
Rhipe uses Google’s protocol buffer (protobuf.pc) for fast and compact
serialization. For example: Here numeric vectors require (4 + 8n
bytes). Moreover the data serialized can be read by other languages.
The following table compares the R serialization and Rhipe
serialization in terms of space occupied:
R Serialization
(bytes)
Rhipe Serialization

(bytes)
NumericVector 22 + 8n 4 + 8n
Logicals 22 + 4n 2
+ 2n
Integers 22 + 4n + (var ). Means
smaller number will occupy less byte.
Strings Depends on the length Depends on the length

Reply all
Reply to author
Forward
0 new messages