Executing R (using rmr2) deploying R and R packages on the fly

153 views
Skip to first unread message

Bernardo Hermont

unread,
May 19, 2014, 12:38:38 AM5/19/14
to rha...@googlegroups.com
Hi folks,

I've been studying some strategy for executing a R code in a hadoop cluster with no R (and also its packages) installed on each Hadoop node of the cluster.
So, for this Hadoop environment there is no way to have R already previously installed.

So I need to ship all the R and R libs/packages every time I need my R code to be executed.

Does anyone ever had a similar requirement? Any ideas how to implement it?

Bernardo


Antonio Piccolboni

unread,
May 19, 2014, 1:20:28 PM5/19/14
to RHadoop Google Group
Hi Bernardo,
interesting question. There is a branch, now completely out-of-date, called 0-install, where we tried to achieve the same. It kind of worked, but we judged that getting all the dependencies right for every package users may want to use and flouting approved and supported installation procedure for R and packages was more than we were willing to support, particularly with the possibility of having heterogeneous clusters as far the OS version. There is no doubt in my mind that if we had this feature it would vastly simplify the deployment of rmr2 and I got this request from several users. More in general, there are companies that adopted Hadoop while having C++ code base they could not toss and that they needed to redeploy on demand, and thus had a similar problem albeit not in the rmr2 context. Both companies mentioned it in a public setting, so I can name names, it's splunk and flickr (now yahoo). In both cases they packed an executable with the normal Hadoop mechanism. In the case of splunk, they eliminated the dependency issue by creating a large static executable which is placed in HDFS and accessed from the tasks. In the case of flickr, they used strace to assemble the dependencies and placed them in a deps tree to be distributed and unpacked for the tasks. For R there is the additional issues that it is aware of where it's been installed (I mean the root of the installation) and if you just move it around Rscript stops working. In the 0-install branch we worked around this by building R into the tmp directory and then unpacking it to each node's tmp directory. Somebody else found that a little postprocessing on an R install makes it possible to unpack it wherever you need and I can dig out some details if you need them (the trick is to google something like "R relocatable", even if it may not be the most accurate use of the word). My take is that the more controlled the cluster env is the more likely one is to pull this one off. On the other hand, as it requires quite a big patch and we are still concerned about the support nightma ... challenges I don't see such a patch being merged in the near future even if it's high quality. So you'd have to maintain it in parallel with the main version, which is not supposed to change radically in the near future. Maybe that way we'll accumulate enough experience in a variety of environments that we can eventually promote the patch to the mainline, but I can't promise anything. As far as shipping R and libs every time, that's probably too costly. One alternative is to use the distributed cache, which should be smart about not redistributing an unchanged file. You'd have to put your R and libs archive in HDFS, then specify the URLS in the distributed cache options. I hope this helps


Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Bernardo Hermont

unread,
May 19, 2014, 9:44:29 PM5/19/14
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio, thank you for your email.

My first idea was to use DistributedCache, but I could not manage how to do it executing R code, since I`m using rmr.
I could find only java examples of mapreduce jobs accessing the job cache in Java.

Is there any way of using distribute cache in RHadoop?

Bernardo

Antonio Piccolboni

unread,
May 20, 2014, 1:26:01 AM5/20/14
to RHadoop Google Group
On Mon, May 19, 2014 at 6:44 PM, Bernardo Hermont <bher...@gmail.com> wrote:
Hi Antonio, thank you for your email.

My first idea was to use DistributedCache, but I could not manage how to do it executing R code, since I`m using rmr.
I could find only java examples of mapreduce jobs accessing the job cache in Java.

rmr uses streaming. You can use the distributed cache in streaming, so you can use it in rmr. You can either patch rmr to specify additional streaming arguments or use the backend.parameters options. It's the same mechanism used in the 0-install branch, if you want to see an example.

Antonio

Bernardo Hermont

unread,
May 21, 2014, 1:48:57 PM5/21/14
to rha...@googlegroups.com, ant...@piccolboni.info
Got it now..
Thanks for the blazing fast reply..

Seems it is going to work...having another issues now, not related to rmr.

Thanks

Nishant Nagpal

unread,
Jun 4, 2014, 5:39:24 AM6/4/14
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,

Referring to the mail thread I want to know how to change the backend.parameter arguement of the rmr.options function in rmr2. My requirement is here to get install rhdfs, rmr2 and other R libraries on each node of the cluster. Can you please share an example how to do that. 

Nishant Nagpal

unread,
Jun 11, 2014, 8:03:46 AM6/11/14
to rha...@googlegroups.com
Hi,

Referring to the mail thread I have come up with an idea to use rmr2 without installing rmr2 package on each node of the cluster. I don't whether this will work or not but i want to discuss here.
 
We can write a driver in java code having all job configurations and code for invoking R terminal. 

We will run the java code from the linux terminal as "hadoop jar jarname.jar -archives "R libraries folders" Now these libraries will be present in Distributed Cache

As per the java code R terminal will open and now we install rmr2 and rhdfs package from the Distributed cache using library(rmr2) and library(rhdfs)

Then we will write mapreduce function to run the map reduce job.

Can you please comment on this process if it good to go.

Thanks,
Nishant



You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/uuCD-2gjY-4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.

Antonio Piccolboni

unread,
Jun 11, 2014, 3:37:41 PM6/11/14
to RHadoop Google Group
Please check the 0-install branch first.

On Wed, Jun 11, 2014 at 5:03 AM, Nishant Nagpal <nishant....@gmail.com> wrote:
Hi,

Referring to the mail thread I have come up with an idea to use rmr2 without installing rmr2 package on each node of the cluster. I don't whether this will work or not but i want to discuss here.
 
We can write a driver in java code having all job configurations and code for invoking R terminal. 

We will run the java code from the linux terminal as "hadoop jar jarname.jar -archives "R libraries folders" Now these libraries will be present in Distributed Cache

You need to be able to run it from the R prompt or it's out of scope for this project. Of course you can use the system
call and related.
 

As per the java code R terminal

 R has been presumably unarchived somewhere, but Rscript will break because it's not the original build directory. R has  a number of library dependencies (object libraries) that are not guaranteed to be available on the nodes. Each package can have its own. You need to find them and find a version that's compatible with everything else that's on the nodes. In one sentence, you are trying to flout the installation instructions of each resource involved in this process. It's a big hack. As I said, we decided not to go down that path. It may work in a controlled environment where there is one platform and a limited list of needed packages.

will open and now we install rmr2 and rhdfs package from the Distributed cache using library(rmr2) and library(rhdfs)

library attaches, doesn't install anything.
 

Then we will write mapreduce function to run the map reduce job.

Can you please comment on this process if it good to go.

You seem to be trying to ignore the 0-install branch. I am not sure that's the best use of your time.

Antonio 

Nishant Nagpal

unread,
Jun 12, 2014, 6:51:56 AM6/12/14
to rha...@googlegroups.com
Hi Antonio,

Problem here is that I don't have admin rights on the cluster. So I don't think that 0-install will work for my case. Let me know if I am wrong.

Thanks,
Nishant

Antonio Piccolboni

unread,
Jun 12, 2014, 10:28:38 AM6/12/14
to RHadoop Google Group

0 install doesn't require special privileges. It only writes to work directory and /tmp, which normally exists and is writable by anyone.

Bernardo Hermont

unread,
Jun 17, 2014, 3:49:05 PM6/17/14
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,

I`m back to the original issue.
I took a look at the backend.parameters option but I don`t know how it could be used to pass a zipped file as a parameter.
To make it possible I`ll need some parameter that will be further translated  to -archives option so the zipped file could be automatically extracted before Hadoop job execution.

Is this possible without a patch?

Bernardo

Antonio Piccolboni

unread,
Jun 17, 2014, 3:59:13 PM6/17/14
to RHadoop Google Group
backend.parameters = list(hadoop = list(archives = "somefile"))

This should work. My attempt required multiple patches because the map and reduce scripts were replaced with shell script (from R scripts) that moved R from the working directory to a standard location (/tmp, for lack of a better option). R won't work if moved to a different directory then where it was installed. Also, some Hadoop core developer recommended that I use distributedCache because it's pretty big stuff we are trying to move around. I suspect with -archives it will be moved every time, but things may have improved since the last time I checked.



Antonio

Bernardo Hermont

unread,
Jun 17, 2014, 6:50:19 PM6/17/14
to rha...@googlegroups.com, ant...@piccolboni.info
Thanks Antonio!

I ended up finding this option just after posting the reply looking at the mapreduce source file on git...

The files are being extracted as expected to some job folder inside {mapreduce.cluster.local.dir}/../../../unziped folder
It`s still not clear to me how rmr invokes hadoop streaming...
  1. I suppose the script files for map and reduce created by rmr when invoking Hadoop streaming will be executed inside this unziped folder, right?
  2. When rmr internally passes the mapper and reducer parameters to Hadoop streaming jar, which path (PATH_TO_RSCRIPT) is defined for the Rscript executable? Example: ...-files  "..., ..."   -archives "..." -mapper "PATH_TO_RSCRIPT/Rscript.exe mapper.r" -reducer "PATH_TO_RSCRIPT/Rscript.exe reducer.r" ...
  3. In case no Rscript is passed, would the best option be setting environment variables in order to the OS find the dir of the executable?

Bernardo

Antonio Piccolboni

unread,
Jun 17, 2014, 7:01:02 PM6/17/14
to RHadoop Google Group
On Tue, Jun 17, 2014 at 3:50 PM, Bernardo Hermont <bher...@gmail.com> wrote:
Thanks Antonio!

I ended up finding this option just after posting the reply looking at the mapreduce source file on git...

The files are being extracted as expected to some job folder inside {mapreduce.cluster.local.dir}/../../../unziped folder
It`s still not clear to me how rmr invokes hadoop streaming...
  1. I suppose the script files for map and reduce created by rmr when invoking Hadoop streaming will be executed inside this unziped folder, right?
The scripts are never zipped.  Their current directory for execution is a the task working directory. All archives should be expanded under there.

  1. When rmr internally passes the mapper and reducer parameters to Hadoop streaming jar, which path (PATH_TO_RSCRIPT) is defined for the Rscript executable? Example: ...-files  "..., ..."   -archives "..." -mapper "PATH_TO_RSCRIPT/Rscript.exe mapper.r" -reducer "PATH_TO_RSCRIPT/Rscript.exe reducer.r" ...
None. Rscript is expected to be on the search path, as it should after a complete R or RevoR installation.
  1. In case no Rscript is passed,would the best option be setting environment variables in order to the OS find the dir of the executable?
That variable exists, it's called PATH, no need to reinvent the wheel. May or may not be the best option, but it's how it currently works. I am not sure what problem you are trying to solve here.


Antonio

Nishant Nagpal

unread,
Aug 1, 2014, 10:28:20 AM8/1/14
to rha...@googlegroups.com
Hi Antonio,

I want to set search PATH variable on the cluster using backend.parameters(). What is the argument for passing ENV variable in backend.parameters()

Thanks,
Nishant Nagpal

Antonio Piccolboni

unread,
Aug 1, 2014, 1:08:59 PM8/1/14
to RHadoop Google Group
Hi
backend.parameters is not a function, so I am a confused about the parentheses. Per streaming documentation, you can set env variables with the cmdenv option. That would look like list(hadoop = list(cmdenv="PATH=/some/path/"))


Antonio


Nishant Nagpal

unread,
Aug 2, 2014, 3:01:10 AM8/2/14
to rha...@googlegroups.com
Hi Antonio,

Actually I tried this backend.parametres=list(hadoop=list(cmdenv="PATH=/path"))
It shows that backend.parameters is deprecated and I am using rmr2  v3.1.1
 
Thanks,
Nishant Nagpal


You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/uuCD-2gjY-4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.

Antonio Piccolboni

unread,
Aug 2, 2014, 11:33:08 AM8/2/14
to RHadoop Google Group

But did it work? Deprecation is just a warning that the feature may go away in future releases

Antonio

Nishant Nagpal

unread,
Aug 3, 2014, 1:50:50 PM8/3/14
to rha...@googlegroups.com
Antonio,

It did not worked and it showed me suggestion on hadoop command line options such as -D, -libjars , -archives.
I think there is some issue in command which i am firing in R.
backend.parameters=list(hadoop=list(cmdenv="PATH=/path"))


Thanks,
Nishant Nagpal

Antonio Piccolboni

unread,
Aug 4, 2014, 5:04:46 PM8/4/14
to RHadoop Google Group
I can reproduce it. This may be an old problem with streaming wanting its options in a specific order. I thought I had fixed it, but maybe not completely. Let me look into it.



Antonio Piccolboni

unread,
Aug 5, 2014, 2:32:57 PM8/5/14
to RHadoop Google Group
I reintroduced an old bug while eliminating a warning, yikes! Fix in the works. Ultimately, it's a problem with hadoop streaming cmd line and the odd requirement to sort options in a certain way. There are so called generic options that have to go first, and the rest later. If you mix them in the backend.parameter option, you need to preserve this order. If you use both backend.parameters argument to rmr.options and mapreduce, you may end up in trouble as the former's content are put on the cmd line always before the latter. I was hoping over the years streaming would shed this oddity, but it doesn't seem to be happening and a permanent fix would require to embed knowledge of which options are generic and which aren't in rmr2, which is something I am loath to do since it's a permanent coupling with required maintenance and potential for trouble.

Antonio

Antonio Piccolboni

unread,
Aug 5, 2014, 2:38:10 PM8/5/14
to RHadoop Google Group
If you know how to build from the dev branch, the fix should be in. 



Reply all
Reply to author
Forward
0 new messages