problem with HADOOP_HOME, HADOOP_CMD, HADOOP_STREAMING path

1,298 views
Skip to first unread message

Ankit Sangwan

unread,
Mar 5, 2015, 3:03:37 AM3/5/15
to rha...@googlegroups.com

This is the first time I am working on RHadoop. I have installed Hadoop on my machine (single node cluster).

I have set all the paths as follows:
> system("java -version")
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)

> Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
> Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
> Sys.setenv(HADOOP_STREAMING=/usr/local/hadoop) ????????????????????????

But I could not find HADOOP_STREAMING jar file.

Is there anything missing in hadoop installation directory or Do I have to install something else?

> library(rmr2)
Please review your hadoop settings. See help(hadoop.settings)
Warning message:
S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found

> library(rJava)
> library(rhdfs)

HADOOP_CMD=/usr/local/hadoop/bin/hadoop

Be sure to run hdfs.init()

> input = to.dfs(lapply(1:100, function(i) keyval(NULL, cbind(sample(reviews, 0.2*nrow(reviews), replace = T)))))
Not a valid JAR: /usr/local/hadoop


Please tell me if I am missing anything!!!!!!!!!

Thanks in advance


Ankit

Ankit Sangwan

unread,
Mar 5, 2015, 6:57:53 AM3/5/15
to rha...@googlegroups.com
I think I have figured out that one. But while loading data into HDFS. I got error again.

I am pasting errors and warning showing in RStudio: (I have attached R code as well)


> system("java -version")
java version "1.7.0_75"
OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~trusty1)
OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
> Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
> Sys.getenv("HADOOP_CMD") # HADOOP_CMD points to the main hadoop command
[1] "/usr/local/hadoop/bin/hadoop"
> Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar")
> Sys.getenv("HADOOP_STREAMING") # points to the streaming jar, a file called something like hadoop-streaming*.jar that is part of most hadoop distributions
[1] "/usr/local/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar"

> library(rmr2)
Please review your hadoop settings. See help(hadoop.settings)
Warning message:
S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found
> library(rJava)
> library(rhdfs)

HADOOP_CMD=/usr/local/hadoop/bin/hadoop

Be sure to run hdfs.init()
> hdfs.init()
15/03/05 17:23:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> # function to read multiple json files
> library(jsonlite)

Attaching package: ‘jsonlite’

The following object is masked from ‘package:utils’:

    View

> read_json     <- function(x) {
+   json    <- fromJSON(x, flatten = T)
+   json    <- as.data.frame(json)
+   names(json)[names(json)=='Reviews.Ratings.Business.service..e.g...internet.access.'] <- 'Reviews.Ratings.Business.service'
+   return(json)
+ }
> library(plyr)
> reviews       <- ldply(list.files(path = "json", full.names = T), read_json)
> reviews_lines <- to.dfs(reviews$Reviews.Content[1:20])
Exception in thread "main" java.lang.ClassNotFoundException: loadtb
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    at java.lang.Class.forName0(Native Method)
    at java.lang.Class.forName(Class.java:274)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
>

Please provide some suggestions to remove these errors.

Thanks in advance
main.R

Antonio Piccolboni

unread,
Mar 5, 2015, 2:47:09 PM3/5/15
to rha...@googlegroups.com
You HADOOP_STREAMING variable now points to a jar instead of a directory. That's progress, but there is an untold numbers of jar files in the world and probably even on your own machine, only one of which, possibly, is the correct one. Now the problem is to find the right one. Hint: if it doesn't have streaming in its name, it is not the right one. It is self evident that the string hadoop-common-2.6.0.jar doesn't contain the string "streaming". Most modern operating systems have search features (e.g. find command in unix) that will provide invaluable help in your quest for the the hadoop streaming jar.


Antonio

Ankit Sangwan

unread,
Mar 9, 2015, 4:35:45 AM3/9/15
to rha...@googlegroups.com
I have set the HADOOP_STREAMING path which now contains streaming in it.
This issue is solved.

But now when I am initializing hdfs.init(), I am getting error:
> hdfs.init()
15/03/09 13:26:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Why is that showing up now?

Any idea/suggestion?

Thanks


On Thursday, March 5, 2015 at 1:33:37 PM UTC+5:30, Ankit Sangwan wrote:

Antonio Piccolboni

unread,
Mar 9, 2015, 12:52:18 PM3/9/15
to RHadoop Google Group
It isn't an error, it's a warning and everything will work with a performance penalty. This is completely RHadoop-independent. Please consult your hadoop documentation on how to install native libraries. 


 Antonio

--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ankit Sangwan

unread,
Mar 16, 2015, 11:09:43 AM3/16/15
to rha...@googlegroups.com
I am working with 2-node cluster and installed R on master node.

I have installed all packages on both nodes and tried to run the example R script. But I am getting error with code 1:

> system("java -version")
java version "1.7.0_75"
OpenJDK Runtime Environment (rhel-2.5.4.0.el6_6-x86_64 u75-b13)

OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
> Sys.getenv("HADOOP_HOME")
[1] "/usr/lib/hadoop-1.0.4"
> Sys.getenv("JAVA_HOME")
[1] "/usr/lib/jvm/jre-1.7.0-openjdk.x86_64"
> Sys.getenv("HADOOP_CMD")
[1] "/usr/lib/hadoop-1.0.4/bin/hadoop"
> Sys.getenv("HADOOP_STREAMING")
[1] "/usr/lib/hadoop-1.0.4/contrib/streaming/hadoop-streaming-1.0.4.jar"
> library(rmr2)
Please review your hadoop settings. See help(hadoop.settings)
Warning message:
S3 methods ‘gorder.default’, ‘gorder.factor’, ‘gorder.data.frame’, ‘gorder.matrix’, ‘gorder.raw’ were declared in NAMESPACE but not found 
> library(rhdfs)
Loading required package: rJava

HADOOP_CMD=/usr/lib/hadoop-1.0.4/bin/hadoop

Be sure to run hdfs.init()
> # hdfs.init()
> ints = to.dfs(1:100)
Warning: $HADOOP_HOME is deprecated.

15/03/16 20:32:48 INFO util.NativeCodeLoader: Loaded the native-hadoop library
15/03/16 20:32:48 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/03/16 20:32:48 INFO compress.CodecPool: Got brand-new compressor
> calc = mapreduce(input = ints, map = function(k, v) cbind(v, 2*v))
Warning: $HADOOP_HOME is deprecated.

packageJobJar: [/app/hadoop/tmp/hadoop-unjar3512781116409688683/] [] /tmp/streamjob139171783279071976.jar tmpDir=null
15/03/16 20:32:50 INFO mapred.FileInputFormat: Total input paths to process : 1
15/03/16 20:32:50 INFO streaming.StreamJob: getLocalDirs(): [/app/hadoop/tmp/mapred/local]
15/03/16 20:32:50 INFO streaming.StreamJob: Running job: job_201503131507_0011
15/03/16 20:32:50 INFO streaming.StreamJob: To kill this job, run:
15/03/16 20:32:50 INFO streaming.StreamJob: /usr/lib/hadoop-1.0.4/libexec/../bin/hadoop job  -Dmapred.job.tracker=IMPETUS-1466.impetus.co.in:54311 -kill job_201503131507_0011
15/03/16 20:32:50 INFO streaming.StreamJob: Tracking URL: http://IMPETUS-1466.impetus.co.in:50030/jobdetails.jsp?jobid=job_201503131507_0011
15/03/16 20:32:51 INFO streaming.StreamJob:  map 0%  reduce 0%
15/03/16 20:33:28 INFO streaming.StreamJob:  map 100%  reduce 100%
15/03/16 20:33:28 INFO streaming.StreamJob: To kill this job, run:
15/03/16 20:33:28 INFO streaming.StreamJob: /usr/lib/hadoop-1.0.4/libexec/../bin/hadoop job  -Dmapred.job.tracker=IMPETUS-1466.impetus.co.in:54311 -kill job_201503131507_0011
15/03/16 20:33:28 INFO streaming.StreamJob: Tracking URL: http://1466:50030/jobdetails.jsp?jobid=job_201503131507_0011
15/03/16 20:33:28 ERROR streaming.StreamJob: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201503131507_0011_m_000000
15/03/16 20:33:28 INFO streaming.StreamJob: killJob...
Streaming Command Failed!
Error in mr(map = map, reduce = reduce, combine = combine, vectorized.reduce,  : 
  hadoop streaming failed with error code 1

What am I doing wrong here and what is this error code 1 (I have read about this error code 1 in different posts but could not make out why it occurs?)

Please help me out

Thanks

Ankit


On Thursday, March 5, 2015 at 1:33:37 PM UTC+5:30, Ankit Sangwan wrote:

Antonio Piccolboni

unread,
Mar 16, 2015, 2:17:10 PM3/16/15
to RHadoop Google Group
What you are not doing is reporting the problem in a way that allows other people to help. There is a debugging guide pointed to in the very intro message to this group, which is I believe mandatory reading if you want to program with rmr, particularly if you need help. Error code 1 just says that the R program that implements the map phase failed multiple times. To troubleshoot that, we need to get more information about what went wrong, which is normally contained in the stderr log of a failing task. Where exactly that is depends on the specific hadoop distribution, but there should be a job web UI with a list of failed jobs, from there find a failing task then look for the logs. Thanks





 

--

anurag agrahari

unread,
Mar 17, 2015, 5:21:16 AM3/17/15
to rha...@googlegroups.com
check your HADOOP_STREAMING PATH...i had the same problem.i solved some kind of issue by exporting the value in  .bashrc or profile file.after this reboot the system.then R CMD javareconf command to check the rjava package is taking the right path of java.

 i think this may solve the issue

Ankit Sangwan

unread,
Mar 18, 2015, 10:06:10 AM3/18/15
to rha...@googlegroups.com
Now, I am able to run mapreduce job successfully. I have tried with some small examples like word_count, cbind, sort and so on on small datasets. 
Thanks to all

But when I tried to load large dataset (around 2GB), it's not getting loaded to HDFS.

> reviews     <- to.dfs(reviews)
Warning: $HADOOP_HOME is deprecated.

Warning: $HADOOP_HOME is deprecated.

Deleted hdfs://172.26.00.1:54310/tmp/file4a781e74abc3
Error in writeBin(.Call("typedbytes_writer", objects, native, PACKAGE = "rmr2"),  : 
  long vectors not supported yet: connections.c:4083

Is there any issue with RHadoop memory or is something else wrong?

Thanks


On Thursday, March 5, 2015 at 1:33:37 PM UTC+5:30, Ankit Sangwan wrote:

Antonio Piccolboni

unread,
Mar 18, 2015, 11:53:24 AM3/18/15
to rha...@googlegroups.com
The problem with using the same thread for a new, albeit related issue, is that this thread is marked complete because you selected the best answer to your previous problem -- thanks for that, that helps me knowing what to work on and others find help. But once a thread is marked complete, you should create a new one (next time, we can finish this one here if you want). 
Not to your point. to.dfs is supposed to chunk the data so that this doesn't have a chance to happen, but thinking about it I may not have considered the case of very large records. Could you please remind me of what is in the variable reviews? Thanks

Ankit

unread,
Mar 18, 2015, 1:01:43 PM3/18/15
to rha...@googlegroups.com
About your point of creating new thread, I will keep that in mind.

Now reviews in my case contains review details from trip advisor web. They contain details of author, ratings, hotel details (data.frame with 21 variables and 10 million records).

From: Antonio Piccolboni
Sent: ‎18-‎03-‎2015 21:23
To: rha...@googlegroups.com
Subject: [RHadoop:#2054] Re: problem with HADOOP_HOME, HADOOP_CMD,HADOOP_STREAMING path

--
You received this message because you are subscribed to a topic in the Google Groups "RHadoop" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rhadoop/Dn9O3AQqbJ0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rhadoop+u...@googlegroups.com.

Antonio Piccolboni

unread,
Mar 18, 2015, 5:13:40 PM3/18/15
to RHadoop Google Group
Works for me with randomly chosen columns. What are the column classes? What are your rmr2 and R versions?

You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.

Ankit

unread,
Mar 18, 2015, 10:24:49 PM3/18/15
to rha...@googlegroups.com
I am working with latest version of rmr2 and R:

rmr2 - 3.3.1
R - 3.1.1

From: Antonio Piccolboni
Sent: ‎19-‎03-‎2015 02:43
To: RHadoop Google Group
Subject: Re: [RHadoop:#2056] Re: problem with HADOOP_HOME,HADOOP_CMD,HADOOP_STREAMING path

anurag agrahari

unread,
Mar 19, 2015, 12:44:49 AM3/19/15
to rha...@googlegroups.com
# Search for where it is located via:
find / -name 'hadoop-streaming*.jar'

Run above command to find your streaming path

Ankit Sangwan

unread,
Mar 19, 2015, 3:53:00 AM3/19/15
to rha...@googlegroups.com
And forgot to mention column classes: Column classes varies from integer, character, factor, and numeric.
I am working with latest version of rmr2 and R:

rmr2 - 3.3.1
R version 3.1.2

Thanks



On Thursday, March 5, 2015 at 1:33:37 PM UTC+5:30, Ankit Sangwan wrote:

Antonio Piccolboni

unread,
Mar 23, 2015, 1:47:00 PM3/23/15
to RHadoop Google Group
I can't think of anything else. Can you give me remote access to a test system for debugging? Otherwise you need to debug to.dfs then keyval.writer then format$format. Therein, take a look at the sizes of ks and vs and figure out why they are so big. Unless you are familiar with this code, it can be a little daunting at the beginning, but not for long.

--
Reply all
Reply to author
Forward
0 new messages