problem with My first rhadoop mapreduce job

437 views
Skip to first unread message

Antonio Casella

unread,
Jul 2, 2013, 4:45:02 AM7/2/13
to rha...@googlegroups.com
Hi all, i've installed rhadoop on my cluster (Hadoop 2.0.0-cdh4.1.2 on Ubuntu 12.04)
R seems work fine but when i try to execute the example "My first mapreduce job" ( https://github.com/RevolutionAnalytics/rmr2/blob/master/docs/tutorial.md )
i have the following output:

> mapreduce(
+     input = small.ints, 
+     map = function(k, v) cbind(v, v^2))
13/07/02 10:17:23 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/RtmpssbXAf/rmr-local-env522a1cfbbd75, /tmp/RtmpssbXAf/rmr-global-env522a7d3aba9a, /tmp/RtmpssbXAf/rmr-streaming-map522a11c5342f] [] /tmp/streamjob5763747888714038276.jar tmpDir=null
13/07/02 10:17:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/02 10:17:25 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/02 10:17:26 INFO mapred.JobClient: Running job: job_201305290904_0826
13/07/02 10:17:27 INFO mapred.JobClient:  map 0% reduce 0%
13/07/02 10:17:36 INFO mapred.JobClient: Task Id : attempt_201305290904_0826_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1603)
at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:620)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:373)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1595)
13/07/02 10:17:36 INFO mapred.JobClient: Task Id : attempt_201305290904_0826_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1603)
at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:620)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:373)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1571)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1595)
13/07/02 10:17:42 INFO mapred.JobClient: Task Id : attempt_201305290904_0826_m_000000_1, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1603)
at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:620)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:373)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:327)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
at org.apache.hadoop.mapred.Child.main(Child.java:262)

it seems it cannot find the hadoop-streaming.jar library
but the environment variable HADOOP_STREAMING is properly set.

Thanks in advance
Antonio

Antonio Piccolboni

unread,
Jul 2, 2013, 12:00:00 PM7/2/13
to RHadoop Google Group

On Tue, Jul 2, 2013 at 1:45 AM, Antonio Casella <antonino...@gmail.com> wrote:
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found

This is the error, how can it possibly get to this point if it can't find the hadoop streaming jar? This class is specified as the inputformat argument to that jar. I think it's a CLASSPATH problem. Anyway the best thing would be to repro  this outside R launching a streaming job such as

hadoop jar <streaming-jar> -input <some-input>  -output <some-output> -mapper cat -inputformat org.apache.hadoop.streaming.AutoInputFormat 

That would give us certainty that this is a purely a java problem. Thanks


Antonio

Antonio Casella

unread,
Jul 3, 2013, 6:25:05 AM7/3/13
to rha...@googlegroups.com, ant...@piccolboni.info
Hi Antonio,
thank you for the prompt reply
I've executed the command:
hadoop jar $HADOOP_STREAMING -input /user/data/input -output /user/data/prova -mapper cat -inputformat org.apache.hadoop.streaming.AutoInputFormat
and it works !
all input was copied in prova directory without errors!

So i think that R doesn't see hadoop.streaming.jar
the environment variables are the following:
HADOOP_CMD=/usr/bin/hadoop
HADOOP_HOME=/usr/lib/hadoop
HADOOP_CONF=/etc/hadoop/conf
HADOOP_LIBEXEC_DIR=/usr/lib/hadoop/libexec
HADOOP_STREAMING=/usr/lib/hadoop/hadoop-streaming.jar

and  ls -l /usr/lib/hadoop/hadoop-streaming.jar
lrwxrwxrwx 1 root root 61 Jun 25 14:42 /usr/lib/hadoop/hadoop-streaming.jar -> /usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.1.2.jar

ls -l /usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.1.2.jar
-rw-r--r-- 1 root root 102365 Nov  2  2012 /usr/lib/hadoop-mapreduce/hadoop-streaming-2.0.0-cdh4.1.2.jar

what I can do?

Antonio Piccolboni

unread,
Jul 3, 2013, 1:49:28 PM7/3/13
to rha...@googlegroups.com, ant...@piccolboni.info
This is puzzling. What rmr2 does is calling system with a command not unlike the one you tested separately as working. So there must be some difference in the environment of your shell prompt and that created by a system call in R. Let's try to repeat the command that worked from inside R, that is

system("hadoop jar $HADOOP_STREAMING -input /user/data/input -output /user/data/prova-R -mapper cat -inputformat org.apache.hadoop.streaming.AutoInputFormat")

and see if that succeeds. If it does, we are stuck. If it doesn't, let's save the shell env in two ways

$ env |sort > /tmp/shellenv

and from R

system("env | sort  > /tmp/Renv")


and then compare the two with 

$ diff -y /tmp/shellenv /tmp/Renv 

There will be many R_ prefixed additional vars that shouldn't affect java and TERM related variables that are not exported, the question is  are there any Java related differences like in the JAVA_HOME and CLASSPATH variables, for example? I have to say that they are different in my environment and everything works anyway so I am not sure exactly what I am looking for, but it seems worthwhile to take a look.

Antonio

Antonio Casella

unread,
Jul 4, 2013, 5:43:19 AM7/4/13
to rha...@googlegroups.com, ant...@piccolboni.info
Very strange situation !


Il giorno mercoledì 3 luglio 2013 19:49:28 UTC+2, Antonio Piccolboni ha scritto:
This is puzzling. What rmr2 does is calling system with a command not unlike the one you tested separately as working. So there must be some difference in the environment of your shell prompt and that created by a system call in R. Let's try to repeat the command that worked from inside R, that is

system("hadoop jar $HADOOP_STREAMING -input /user/data/input -output /user/data/prova-R -mapper cat -inputformat org.apache.hadoop.streaming.AutoInputFormat")

it works correctly.


and see if that succeeds. If it does, we are stuck. If it doesn't, let's save the shell env in two ways

$ env |sort > /tmp/shellenv

and from R

system("env | sort  > /tmp/Renv")

you can see the attached files to compare with yours environment variables.
Renv.txt
shellenv.txt

Antonio Piccolboni

unread,
Jul 5, 2013, 1:38:36 PM7/5/13
to rha...@googlegroups.com, ant...@piccolboni.info
I am sorry to say I am left with no additional hypotheses. There is no substantial difference between the system call that succeeds and the one that fails from inside rmr2 at least from the point of view of java finding or not finding classes. If I could reproduce the error I would debug the function rmr.stream until the system call and then execute it manually  simplifying it one step at a time until it succeeds -- we know it must eventually from your last experiment. The change that makes it succeed may point us to the problem. Doing this one group message at a time may take us until the next year, so you need to take the lead. What you need to know, besides the R function debug, is  the ins and outs of the streaming cmd line, so that you can simplify the cmd line in a meaningful way. If you get to the system call step by step and post the value of the final.command variable here, I will try to explain the different parts and how to remove them. I am afraid this is one of the toughest installation issues we have faced so far.


Antonio

Antonio Casella

unread,
Jul 10, 2013, 10:48:08 AM7/10/13
to rha...@googlegroups.com, ant...@piccolboni.info
I'm very lucky !!
Before taking this drastic solution, I would try to make a new installation,
in the attached file I describe the steps I'll follow, making attention to possible error message,
maybe I did something wrong !
rhadoop.sh

Antonio Casella

unread,
Jul 11, 2013, 9:44:26 AM7/11/13
to rha...@googlegroups.com, ant...@piccolboni.info
I've performed a new installation and the situation seems more good but not totally resolved !
In my cluster, composed of 5 nodes, when I have to run:
R CMD check rmr2_2.2.1.tar.gz
I see that this operation only works for 4 node, where all examples are executed.
But in the first node it doesn't work,  and I see out usual error:

* checking examples ... ERROR
Running examples in ‘rmr2-Ex.R’ failed
The error most likely occurred in:

> ### Name: equijoin
> ### Title: Equijoins using map reduce
> ### Aliases: equijoin
> ### ** Examples
> ##---- Should be DIRECTLY executable !! ----
> ##-- ==>  Define data, use random,
> ##-- or do  help(data=index)  for the standard data sets.
>  from.dfs(equijoin(left.input = to.dfs(keyval(1:10, 1:10^2)), right.input = to.dfs(keyval(1:10, 1:10^3))))
13/07/11 15:09:59 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/07/11 15:09:59 INFO compress.CodecPool: Got brand-new compressor [.deflate]
13/07/11 15:10:02 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/07/11 15:10:02 INFO compress.CodecPool: Got brand-new compressor [.deflate]
13/07/11 15:10:03 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/tmp/Rtmpqv9EQl/rmr-local-env119251c93ba3, /tmp/Rtmpqv9EQl/rmr-global-env11925a6441e5, /tmp/Rtmpqv9EQl/rmr-streaming-map11927e6b23cb, /tmp/Rtmpqv9EQl/rmr-streaming-reduce119235d9923d] [] /tmp/streamjob4414613183423438889.jar tmpDir=null
13/07/11 15:10:05 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/11 15:10:05 INFO mapred.FileInputFormat: Total input paths to process : 2
13/07/11 15:10:06 INFO mapred.JobClient: Running job: job_201307111500_0004
13/07/11 15:10:07 INFO mapred.JobClient:  map 0% reduce 0%
13/07/11 15:10:17 INFO mapred.JobClient:  map 33% reduce 0%
13/07/11 15:10:18 INFO mapred.JobClient: Task Id : attempt_201307111500_0004_m_000001_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.AutoInputFormat not found

Do you have any idea ?
If I run:
 small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints, 
    map = function(k, v) cbind(v, v^2))
from node2 it work, but no task are assigned to node1, it seems that it not recognized, because of I see task assigned to node2, node3, node4 and node5 but not to node1.

Antonio Piccolboni

unread,
Jul 11, 2013, 2:43:13 PM7/11/13
to RHadoop Google Group
Reinstall hadoop on the problem node. That class should be available. Your other 4 tests show  it. I am not sure about this, but I think once a class is specified as input format  it is packaged in the job jar, so it is possible that a task could run on node1 if the job is launched from elsewhere where that class is available. It could also be that after multiple failures on node1 the node has been marked as bad and Hadoop doesn't send tasks to it anymore. 


Antonio


--
post: rha...@googlegroups.com ||
unsubscribe: rhadoop+u...@googlegroups.com ||
web: https://groups.google.com/d/forum/rhadoop?hl=en-US
---
You received this message because you are subscribed to the Google Groups "RHadoop" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rhadoop+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Antonio Casella

unread,
Jul 12, 2013, 4:39:20 AM7/12/13
to rha...@googlegroups.com, ant...@piccolboni.info
And finally I win !!!
The configuration is:
Ubuntu 12.04.1 LTS
Hadoop 2.0.0-cdh4.1.2
R version 3.0.1
Rmr2 2.2.1
Rhdfs 1.0.6

The installation instructions are in the attached file.
The problem on the node1 was due to the environment variable HADOOP_STREAMING. I don't know why but, sometimes, hadoop-streaming.jar is not found through HADOOP_STREAMING.
For this reason I added the library on my CLASSPATH and all works fine.
I hope the installation guide can be useful.
Thank you Antonio for your support.
R installation.tar.gz

Antonio Piccolboni

unread,
Jul 12, 2013, 1:27:27 PM7/12/13
to RHadoop Google Group
 I am happy it worked for you but I am highly skeptical it has anything to deal with the setting of CLASSPATH,  the setting you used doesn't even seem to exist under CDH4 and anyway things were working on 3 out of 4 nodes of your cluster without this additional configuration. I can believe you observed a correlation, but I doubt there is causation. Thanks for sharing, but I'd advise users to take with a grain of salt.


Antonio

Reply all
Reply to author
Forward
0 new messages