installing R-hadoop in 3-node cluster

283 views
Skip to first unread message

Evan Cutler

unread,
Dec 13, 2013, 5:56:33 PM12/13/13
to rha...@googlegroups.com
HI everyone...
I'm having issues understanding the instruction set.

I have CDH4 on a 3-node cluster.  1 namenode and 2 data nodes.
This is on Ubuntu 12.04 LTS 64-bit.

I have R-server on the name node.  Right now, I'm using rHadoopClient and it works.
I have run HIVE queries and HDFS.READ queries; making plots from there.

I'd like to upgrade to rmr, rbase, and all of the other stuff that rhadoop offers.
I see mac instructions, but ubuntu instructions seem a bit sparse.

Anyone already go down this road?  Has anyone been able to see a clear set of instructions for Ubuntu/Linux?

Thanks.

Antonio Piccolboni

unread,
Dec 13, 2013, 6:05:05 PM12/13/13
to RHadoop Google Group
Just for my education, what is rHadoopClent?
rmr or rhbase are not un upgrade on that.


Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Evan Cutler

unread,
Dec 13, 2013, 11:11:54 PM12/13/13
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio...

TO answer your question:  

This lets me read/write files and perform Hive Queries....
The cost to this is putting RStudio Server on my namenode.
I wanted to put on rmr, rhdfs, etc...on the cluster so I can do my calculations on a different server.

Any ideas?
thanks

Antonio Piccolboni

unread,
Dec 14, 2013, 12:04:01 AM12/14/13
to RHadoop Google Group
On Fri, Dec 13, 2013 at 8:11 PM, Evan Cutler <arce...@gmail.com> wrote:

OK, good to know. For the record, rHadoopClient and the RHadoop discussed here are two independent projects, despite the similarity in name.


This lets me read/write files and perform Hive Queries....
The cost to this is putting RStudio Server on my namenode.
I wanted to put on rmr, rhdfs, etc...on the cluster so I can do my calculations on a different server.

Any ideas?

I wrote the documentation for rmr2 and it's clear enough for me, but it's left deliberately general so that people can adapt it to their setups. Maybe somebody will come up with specific Ubuntu instructions. Wait a second I have an idea, let me google it. Bingo: http://www.meetup.com/Learning-Machine-Learning-by-Example/pages/Installing_R_and_RHadoop/
Maybe not the most up to date, but it may help


Antonio

Evan Cutler

unread,
Dec 15, 2013, 5:49:42 PM12/15/13
to rha...@googlegroups.com, ant...@piccolboni.info
Thank you so much.
If I could ask you two more questions...that would be great.

ok, so I have R-studio on my namenode of the cluster.
Your instructions indicate how to use R on the datanodes...so basically I duplicate the installation, correct?

Now, so here's my other question.
I have R-Studio server on my namenode, and I would love to take it off.
If I created a linux box, put R and R-Studio on that box, can I attack the cluster remotely then?
THanks
Evan

Antonio Piccolboni

unread,
Dec 16, 2013, 12:16:09 AM12/16/13
to RHadoop Google Group
On Sun, Dec 15, 2013 at 2:49 PM, Evan Cutler <arce...@gmail.com> wrote:
Thank you so much.
If I could ask you two more questions...that would be great.

ok, so I have R-studio on my namenode of the cluster.
Your instructions indicate how to use R on the datanodes...so basically I duplicate the installation, correct?

I don't know of any product where a duplicate installation is not an error. I don't think rmr2 is the exception. One installation is complete when several items have been installed on each node. That's one installation, not two or three. R-studio is neither dependent on nor necessary for rmr2. Let's  leave it out for clarity's sake.
 

Now, so here's my other question.
I have R-Studio server on my namenode, and I would love to take it off.
If I created a linux box, put R and R-Studio on that box, can I attack the cluster remotely then?

I am more than a little uneasy about your use of the word "attack", so I will assume a typo and replace it with the word "attach" which I will replace with a more commonly used "connect". The answer is, yes, just install hadoop on the linux box so that it can connect with the cluster. The linux box doesn't have to be a name node, data node or have any role in the cluster, but still has to have hadoop installed, and will take only the duty of a client. rmr only communicates with a local hadoop installation. You can find information on this type of configuration in the general Hadoop literature.


Antonio

Evan Cutler

unread,
Dec 16, 2013, 2:05:30 PM12/16/13
to rha...@googlegroups.com, ant...@piccolboni.info
I apologize for making this a kindergarden lesson, but I don't want to screw this up.
I am confused when you say: That's one installation, not two or three. 

I have three computers.  hdfs1.domain.org, hdfs2.domain.org, and hdfs3.domain.org.
These three computers are working as my cluster.  hdfs1 has R and is the Hadoop Name Node.
The installation I used in hdfs1 was per your instructions.

Is that it? or do I install R and all the packages on hdfs2 and hdfs3?

Thanks.

Antonio Piccolboni

unread,
Dec 16, 2013, 2:31:15 PM12/16/13
to RHadoop Google Group
Well, there is installing packages on each node and there is installing rmr2 as a whole. I wish I had two separate words. To install rmr2 to a properly working state, rmr2 as an overall system, you need to install R and a number of packages on each node, in the sense of running R CMD INSTALL or equivalent. One of those packages is called rmr2 as well.  But installing rmr2 the package on a single machine doesn't complete the larger installation procedure, that sequence of steps that allows you to use rmr2. It may appear that there is a package called rmr2 installed on this or that node, but unless every node has R, rmr2 and dependencies you won't be in a working state. This requirement is common to rmr2 and plyrmr, but different from rhdfs and rhbase that can be installed on a single node and work. I wish we had an "install on cluster" command. Until then we need to make do with available tools and concepts.

Antonio


Reply all
Reply to author
Forward
0 new messages