rmr on hadoop 2.2.0

93 views

Skip to first unread message

Curtis Combs

unread,

Nov 7, 2013, 12:17:13 AM11/7/13

to rha...@googlegroups.com

I can't get rmr2 to work on hadoop. It seems to be doing so many things incorrectly.

1) On install, it searches for some hbase files..through the entire filesystem...even if it's mounted on an NFS share. Why is this necessary and is there a way to get it to not do that? I found that I can hit Ctrl-C and kill the search (because if I didn't, it'd be searching through 20TB of backups....)

2) After getting it to install and setting the Renv, it tries to use "hadoop <cmd>". This generates depreciation messages and my client will probably think that there is something wrong. Hadoop has two commands: yarn and hdfs. Is there a place to set this so that this API correctly uses those commands?

3) rmr.options( backend = "hadoop" ) returns NULL . I really don't know why that doesn't work, maybe it has to do with the above issues, maybe it's something wrong elsewhere.

4) I downloaded the github release and built it in R (R CMD build). That seems to be the version.

About my configuration:

Hadoop 2.2.0 (not the latest, but next to latest)

CentOS 6.4

Hadoop/HBase/Pig/Derby/Hive installed in /opt/hadoop

Java is at jdk1.7.0_25

Please help? Thanks!

Antonio Piccolboni

unread,

Nov 7, 2013, 12:51:54 AM11/7/13

to rha...@googlegroups.com

On Wednesday, November 6, 2013 9:17:13 PM UTC-8, Curtis Combs wrote:

I can't get rmr2 to work on hadoop. It seems to be doing so many things incorrectly.

1) On install, it searches for some hbase files..through the entire filesystem...even if it's mounted on an NFS share. Why is this necessary and is there a way to get it to not do that? I found that I can hit Ctrl-C and kill the search (because if I didn't, it'd be searching through 20TB of backups....)

This is not the expected behavior, you can check the Makevars file in the src directory of the package to see which commands are executed to build hbase related classes. When the necessary files are not present it normally fails in a matter of seconds and moves on. So please share the relevant parts of the log (I guess before the alleged search starts) and we'll try to get to the bottom of this.

2) After getting it to install and setting the Renv, it tries to use "hadoop <cmd>". This generates depreciation messages and my client will probably think that there is something wrong. Hadoop has two commands: yarn and hdfs. Is there a place to set this so that this API correctly uses those commands?

It's a deprecation message, if that makes your client feel better. The price has always been 0, so depreciation is unlikely. Hadoop has a command named hadoop, contrary to your statement. We could replace it with hdfs, but that would leave other users stranded (hdfs is a newer command, contrary to your timeless assertions). We could add another configuration variable, but two are enough to generate a steady stream of support cases. We could add some self detect logic to guess the right command. If you want to take a stab at it, we recently started to accept community contributions. Pull requests are welcome.

3) rmr.options( backend = "hadoop" ) returns NULL . I really don't know why that doesn't work, maybe it has to do with the above issues, maybe it's something wrong elsewhere.

It is supposed to return NULL. NULL is not an error code or something of that sort.

4) I downloaded the github release and built it in R (R CMD build). That seems to be the version.

There is no github release. The repo is tagged to mark each release. If you want to build a specific release you need to check out the corresponding tag. The repo has multiple branches. At any given time the master branch is moving towards the next bugfix release, so it's in between releases, and dev is moving towards the next minor or major release. The number in the DESCRIPTION file, which is updated as reasonably soon as possible after a release, is the number of the next expected release, but can change at any time (a major release will preempt a minor one which will preempt a hotfix). What you build therefore could contain more errors than a release build: we do not test the package for every new commit. I don't want to discourage people from participating in development by forking the repo and building from it, quite the opposite, but you need to know what you are getting yourself into. Maybe starting from the official release and making sure you've got everything set up right before going into development could be a good intermediate step.