Hi,
As we all are well aware that ,industries are now facing big data
problem, where the data size is tremendously increasing ( Peta scale),
and they are trying their best to solve this big data problem.
Since existing stand alone systems are failing to scale,so they are
approaching towards cluster based solutions like Hadoop etc.
This post is aimed for the beginners where he/she wants to leverage
the statistical and graphical capabilities of R on distributed
environment like Hadoop.
Before going through the step by step installation, let me give you
the system configuration details.
--------------------------------------------------------------------------------------------------------
System Configuration |
Version
------------------------------------------------------------------------------------------------------
OS | Centos
5.5
-----------------------------------------------------------------------------------------------------
Hadoop |
0.20.2
------------------------------------------------------------------------------------------------------
Java |
1.6.0_22
-----------------------------------------------------------------------------------------------------
R |
2.12.0
---------------------------------------------------------------------------------------------------
Rhipe |
0.63
---------------------------------------------------------------------------------------------------
protobuf.pc |
2.3.0
--------------------------------------------------------------------------------------------------
I have two node cluster, the RAM and the processor speed for both of
the machines are 1GB and 2.66GHz respectively.
1. Installing R.
R needs to be installed on all of the nodes where Hadoop is running
(A) Run the following command for installing R
yum install R
(B) Installation might fail due to the absence of RPM_GPG_KEY_EPEL.
Download and create the key under /etc/pki/rpm-gpg
(C) Check whether the R command is working or not, by typing "R" on
the terminal window, it should start the R terminal
2. Installing Rhipe
Pre-Requisites
The following packages are the requirements before installing
RHIPE package on Hadoop:
I. R
II. protobuf.pc (version 2.3.0)
It is Google’s protocol buffer which is used by RHIPE for
serialization of the R objects. A benefit of using this is that data
produced by RHIPE can be read in languages such as Python, C and Java,
although RHIPE cannot serialize all the R objects.
Installing protobuf.pc
I. Download protobuf.pc (protobuf-2.3.0.tar.gz) from
http://code.google.com/p/protobuf/downloads/list .
II. Execute the following commands serially after unzipping the
contents
A. sh configure
B. make
C. make install
Create an environment variable PKG_CONFIG_PATH=/usr/local/lib/
pkgconfig .
III. Check for the proper installation of protobuf.pc
pkg-config --modversion protobuf
The output should be 2.3.0
IV. Check for the proper installation ofprotobuf libraries.
pkg-config --libs protobuf
The output should be –pthread –L/usr/local/lib –lprotobuf –l2
Installing RHIPE
I. Download RHIPE from
http://www.stat.purdue.edu/~sguha/rhipe/
and place the contents in a directory Rhipe. For example : /home/user/
Rhipe
II. Create the following environment variables in .bashrc
HADOOP = location to hadoop installation
HADOOP_LIB = $HADOOP/lib
HADOOP_CONF_DIR=$HADOOP/conf
III. Execute the following command from the directory /home/user/
Rhipe.
R CMD INSTALL Rhipe
As a check open the R console and load the Rhipe library using
the command library(Rhipe),it shouldn’t throw error.
The installation will fail if the libraries are not properly linked.
The error is as shown below:
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared object '/usr/lib64/R/library/Rhipe/libs/
Rhipe.so':
libprotobuf.so.6: cannot open shared object file: No such file or
directory
ERROR: loading failed
* removing â/usr/lib64/R/library/Rhipeâ
In order to link the libraries to LD_LIBRARY_PATH, create the file
Protobuf-x86.conf under /etc/ld.so.conf.d/ and add the line /usr/local/
lib to the file.
Sometimes even after creating the file Protobuf-x86.conf, library
libprotobuf.so is not found. This is because of the presence of stale
cache ld.so.cache. For removing the stale cache execute the command
ldconfig.
Now start the R console and type the library(Rhipe), it should not
throw any error. This concludes the Rhipe Installation.
Regards,
Som Shekhar