Integrating R and Hive (RHive Package)

1,100 views
Skip to first unread message

Shekhar

unread,
Feb 18, 2012, 2:26:54 AM2/18/12
to Bangalore R Users - BRU
Unlike RHIPE which allows to write MR (MapReduce) jobs, RHive
provides access to HDFS (Hadoop Distributed File System) and Hive
(Hadoop DataWarehouse) aswell.
For this you need to have following things:

1) Build R from source as shared library. (Follow this link
http://groups.google.com/group/brumail/browse_thread/thread/39e32c7df63bc5c1?hl=en_US
)

2)Install Hadoop

3) Build Hive from source

Check out the Hive source code and build it.

svn co http://svn.apache.org/repos/asf/hive/trunk hive
cd hive
ant clean package

Set HIVE_HOME as /home/username/hive/build/dist. In the current set up
it is /home/dev/hive/build/dist.

####### Hive Configuration##############################
Create hive-site.xml under $HIVE_HOME/conf directory and add the
following contents.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- Hive Configuration can either be stored in this file or in the
hadoop configuration files -->
<!-- that are implied by Hadoop setup
variables. -->
<!-- Aside from Hadoop setup variables - this file is provided as a
convenience so that Hive -->
<!-- users do not have to edit hadoop configuration files (that may be
managed as a centralized -->
<!--
resource).
-->

<!-- Hive Execution Parameters -->

<property>
<name>hive.metastore.local</name>
<value>true</value>
<description>controls whether to connect to remove metastore server
or open a new metastore server in Hive Client JVM</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive/warehouse</value>
<description>location of default database for the warehouse</
description>
</property>

<property>
<name>hive.hwi.listen.host</name>
<value>0.0.0.0</value>
<description>This is the host address the Hive Web Interface will
listen on</description>
</property>

<property>
<name>hive.hwi.listen.port</name>
<value>9999</value>
<description>This is the port the Hive Web Interface will listen on</
description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://namenode:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>namenode:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>8</value>
<description>The default number of reduce tasks per job.
Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local". Hadoop set this to 1 by default,
whereas hive uses -1 as its default value.
By setting this property to -1, Hive will automatically figure out
what should be the number of reducers.
</description>
</property>
</configuration>

Create environment variable for HIVE_HOME, HIVE_CONF_DIR
################.bashrc file#######################
export HIVE_HOME=/home/dev/hive/build/dist
export HIVE_CONF=$HIVE_HOME/conf
export HIVE_LIB=$HIVE_HOME/lib
export CLASSPATH=$CLASSPATH:$HADOOP_LIB:$HIVE_LIB

In addition, create /tmp and /user/hive/warehouse (aka
hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a
table can be created in Hive.
Commands to perform this setup
$HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

###############Running and Testing Hive Thrift Server
Once the hive is built, hive thrift server can be run using the
following command
$cd $HIVE_HOME/bin
$HIVE_PORT=10000 ./hive –service hiveserver --verbose

Unlike Hadoop services, it doesn’t demonize, we need to write a script
to demonize.
Open a new console and run the following commands to test the proper
working of Hive Thrift Server.
Before running the test make sure that Hadoop’s name node or head node
is not in safe mode. Run the following command to check whether name
node is in safe mode or not
$HADOOP_HOME/bin/hadoop dfsadmin -report

If the above command shows that name node is in safe mode, then you
either can wait for some time for coming out of the safe mode or you
can run the following command to force it to come from safe mode.
(Hadoop remains in safe mode till all the blocks have been reported to
name node and all the block replication has been done successfully, so
it is not recommended to force Hadoop to come out from safe mode
because some data blocks might lost.)
$HADOOP_HOME/bin/Hadoop dfsadmin –safemode leave

First Test
(NOTE: Make sure NameNode is not in safe mode)
ant test -Dtestcase=TestHiveServer -Dstandalone=true

Second Test
Run the following command to test JDBC driver
ant test -Dtestcase=TestJdbcDriver -Dstandalone=true

Once everything is ready, open the R console and run the following
command to install RHive

install.packages("RHive") , it will install RHive along with its
dependencies rJava..

Now start hadoop and hive services:

$HADOOP_HOME/bin/start-all.sh

##wait till all gets replicated and namenode is not in safemode...Run
the following command to check whether the namenode is in safemode or
not

$HADOOP_HOME/bin/hadoop dfsadmin -report

You can forcefully turn off the safemode by executing the following
command

$HADOOP_HOME/bin/hadoop dfsadmin -safemode leave

Now run the Hive Thrift server

cd $HIVE_HOME/bin
HIVE_PORT=10000 ./hive --service hiveserver --verbose

it doesn't demonize,


Now in the R console:
library(RHive)
rhive.connect()
#the output of the command is as follows:

> rhive.connect()
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/dev/hive/build/dist/lib/slf4j-
log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/dev/hadoop-0.20.2/lib/slf4j-
log4j12-1.4.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.


The output on the console where Hive Thrift server is running is as
follows:

Starting Hive Thrift Server
Starting hive server on port 10000 with 100 min worker threads and
2147483647 max worker threads
Hive history file=/tmp/dev/hive_job_log_dev_201202161437_739096211.txt
converting to local hdfs:///rhive/lib/rhive_udf.jar
Added /tmp/dev/hive_resources/rhive_udf.jar to class path
Added resource: /tmp/dev/hive_resources/rhive_udf.jar
OK
OK
OK

Now you can list, query, write UDF,UDAF,UDTF within R..Will try and
post some of these soon.
Hope you find it useful..

Regards,
Som Shekhar
8197243810
Reply all
Reply to author
Forward
0 new messages