Scalding - how to use with hadoop?

2,453 views
Skip to first unread message

Sujit Pal

unread,
Dec 28, 2012, 5:09:40 PM12/28/12
to cascadi...@googlegroups.com
Hi,

Apologies if this question was answered before, I did search and found some similar threads but which did not answer my question, so asking...

Versions: Scala 2.9.2, Scalding 0.7.3, Hadoop 1.0.3

I am building scalding jobs in an sbt enabled project. I am able to run in local mode using "sbt run". Here is how I am set up. I have a generic caller where I drop calls of the form:

object Caller extends App {
..(new FreqDist(Args(List(
...."--local", "", 
...."--input", "/path/to/input/file",
...."--output", "/path/to/output/file"
..))).run
}

and here is an example of an actual job:

class FreqDist(args: Args) extends Job(args) {
..val input = Tsv(args("input"), ('term, 'docID, 'freq))
..val output = Tsv(args("output"))
..input.read.
....// more operations here
....write(output)
}

I now want to run this under Hadoop. I build a jar (using sbt package), call it myproject.jar, copy it to /tmp, and run the jar using hadoop jar, like this:

hadoop jar /tmp/myproject.jar mypackage.Caller  # replace "--local" with "--hdfs" in Caller.scala
or
hadoop jar /tmp/myproject.jar mypackage.FreqDist --hdfs --input /path/to/input/dir --output /path/to/output/dir

I have also copied the following JAR files from my application classpath (local ivy2 repo) to $HADOOP_HOME/lib:

-rw-r--r--@  1 root    hadoop  8857794 Aug 30 02:09 scala-library.jar
-rw-r--r--   1 hduser  hadoop  1411603 Dec 27 15:19 scalding_2.9.2.jar
-rw-r--r--   1 hduser  hadoop   510633 Dec 27 15:19 cascading-core-2.0.2.jar
-rw-r--r--   1 hduser  hadoop   212463 Dec 27 15:19 cascading-hadoop-2.0.2.jar
-rw-r--r--   1 hduser  hadoop    77807 Dec 27 15:47 maple-0.2.2.jar
-rw-r--r--   1 hduser  hadoop    36649 Dec 27 15:48 cascading-local-2.0.2.jar
-rw-r--r--   1 hduser  hadoop   234759 Dec 27 15:49 jgrapht-jdk1.6-0.8.1.jar
-rw-r--r--   1 hduser  hadoop  1501575 Dec 27 15:51 guava-10.0.1.jar

I have copied the input file to HDFS and I have verified that the input and output directories exist. When I start the job, the logs tell me that the flow started in "local" mode and it cannot find the input file (FileNotFoundException).

Additional Info:
1) I have seen threads which say that "scald.rb --hdfs" seems to work. But I would prefer to keep my jobs outside the Scalding project.
2) I have tried giving the paths as absolute, relative and with hdfs:// prefix when calling from the command line with hadoop jar, none of them worked.

Is there something I am missing in these steps above?

Thanks in advance for any help you can provide,

Sujit


Christopher Severs

unread,
Dec 28, 2012, 9:19:17 PM12/28/12
to cascadi...@googlegroups.com
I normally build fat jars with maven and run them like:
hadoop jar myjar.jar com.twitter.scalding.Tool my.class --hdfs otheroptions

If you build the jar as runnable then you can have com.twitter.scalding.Tool be the main class and skip that part. If you have a decent sbt assembly setup it should also work fine.

-----
Chris

Sujit Pal

unread,
Dec 30, 2012, 9:36:24 PM12/30/12
to cascadi...@googlegroups.com
Thank you. I guess the fat jar approach is preferable to adding individual JARs to $HADOOP_HOME/lib, finally got "sbt assembly" to work with all the merge dedups resolved, but now I am getting NoSuchMethodError on slf4j because of version mismatch... I guess I need to do some more trial-and-error before this will work..

-sujit

Paco Nathan

unread,
Dec 30, 2012, 10:20:25 PM12/30/12
to cascadi...@googlegroups.com
FWIW, here is a Gradle build script for Scalding far jars to run on Hadoop.
I'm sure there is more idiomatic Gradle that could be used, but this works.


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/m9JtMmwpMj8J.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Sujit Pal

unread,
Jan 1, 2013, 4:04:54 PM1/1/13
to cascadi...@googlegroups.com
Thank you. Went further this time than before but still no luck...

For testing I just used part8 of the Impatient project and tried to run Example3 on hadoop. Using scald.rb to run locally succeeds, but trying to run this on hdfs using hadoop doesn't.

Here is what I did for running in hdfs, I built the scalding fat jar then copied it to part8/lib, then used "gradle clean jar" to build the part8.jar, then copied it to /tmp so the Hadoop user can get to it. I then copied rain.txt to HDFS and ran the command as the hadoop user like so:

hduser@cyclone:~$ hadoop jar /tmp/part8.jar Example3 --hdfs --doc rain-input/rain.txt --wc rain-output
Warning: $HADOOP_HOME is deprecated.

13/01/01 12:17:50 INFO util.HadoopUtil: resolving application jar from found main method on: com.twitter.scalding.Tool$
13/01/01 12:17:50 INFO planner.HadoopPlanner: using application jar: /prod/hadoop/tmp/hadoop-unjar5174315492679253612/lib/scalding-assembly-0.8.2-SNAPSHOT.jar
13/01/01 12:17:50 INFO property.AppProps: using app.id: 7D72589BB8A9643824813D2996AD04FF
13/01/01 12:17:50 INFO flow.Flow: [Example3] starting
13/01/01 12:17:50 INFO flow.Flow: [Example3]  source: Hfs["TextDelimited[['doc_id', 'text']]"]["rain-input/rain.txt"]"]
13/01/01 12:17:50 INFO flow.Flow: [Example3]  sink: Hfs["TextDelimited[[UNKNOWN]->['token', 'count']]"]["rain-output"]"]
13/01/01 12:17:50 INFO flow.Flow: [Example3]  parallel execution is enabled: true
13/01/01 12:17:50 INFO flow.Flow: [Example3]  starting jobs: 1
13/01/01 12:17:50 INFO flow.Flow: [Example3]  allocating threads: 1
13/01/01 12:17:50 INFO flow.FlowStep: [Example3] starting step: (1/1) rain-output
13/01/01 12:17:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/01/01 12:17:52 WARN snappy.LoadSnappy: Snappy native library not loaded
13/01/01 12:17:52 INFO mapred.FileInputFormat: Total input paths to process : 1
13/01/01 12:17:52 INFO flow.FlowStep: [Example3] submitted hadoop job: job_201301011158_0002
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] task completion events identify failed tasks
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] task completion events count: 10
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000003_0, Status : SUCCEEDED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_0, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_0, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_1, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_1, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_2, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_2, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_3, Status : TIPFAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_3, Status : TIPFAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000002_0, Status : SUCCEEDED
13/01/01 12:21:43 INFO flow.Flow: [Example3] stopping all jobs
13/01/01 12:21:43 INFO flow.FlowStep: [Example3] stopping: (1/1) rain-output
13/01/01 12:21:43 INFO flow.Flow: [Example3] stopped all jobs
13/01/01 12:21:43 INFO util.Hadoop18TapUtil: deleting temp path rain-output/_temporary
Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) rain-output, with job id: job_201301011158_0002, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:193)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

The logs (rain-output/_logs/) shows this error...

Task TASKID="task_201301011158_0002_m_000001" TASK_TYPE="MAP" START_TIME="1357071656935" SPLITS="/default-rack/192\.168\.1\.13" .
MapAttempt TASK_TYPE="MAP" TASKID="task_201301011158_0002_m_000000" TASK_ATTEMPT_ID="attempt_201301011158_0002_m_000000_0" START_TIME="1357071656944" TRACKER_NAME="tracker_192\.168\.1\.13:localhost/127\.0\.0\.1:49907" HTTP_PORT="50060" .
MapAttempt TASK_TYPE="MAP" TASKID="task_201301011158_0002_m_000000" TASK_ATTEMPT_ID="attempt_201301011158_0002_m_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1357071665878" HOSTNAME="192\.168\.1\.13" ERROR="java\.lang\.RuntimeException: Error in configuring object
at org\.apache\.hadoop\.util\.ReflectionUtils\.setJobConf(ReflectionUtils\.java:93)
at org\.apache\.hadoop\.util\.ReflectionUtils\.setConf(ReflectionUtils\.java:64)
at org\.apache\.hadoop\.util\.ReflectionUtils\.newInstance(ReflectionUtils\.java:117)
at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:432)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:372)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:255)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1121)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:249)
Caused by: java\.lang\.reflect\.InvocationTargetException
at sun\.reflect\.NativeMethodAccessorImpl\.invoke0(Native Method)
at sun\.reflect\.NativeMethodAccessorImpl\.invoke(NativeMethodAccessorImpl\.java:39)
at sun\.reflect\.DelegatingMethodAccessorImpl\.invoke(DelegatingMethodAccessorImpl\.java:25)
at java\.lang\.reflect\.Method\.invoke(Method\.java:597)
at org\.apache\.hadoop\.util\.ReflectionUtils\.setJobConf(ReflectionUtils\.java:88)
\.\.\. 9 more
Caused by: cascading\.flow\.FlowException: internal error during mapper configuration
at cascading\.flow\.hadoop\.FlowMapper\.configure(FlowMapper\.java:96)
\.\.\. 14 more
Caused by: java\.io\.InvalidClassException: scala\.collection\.immutable\.Map$Map4; local class incompatible: stream classdesc serialVersionUID \= 7313668479060291035, local class serialVersionUID \= 1209906499091153147
at java\.io\.ObjectStreamClass\.initNonProxy(ObjectStreamClass\.java:560)
at java\.io\.ObjectInputStream\.readNonProxyDesc(ObjectInputStream\.java:1582)
at java\.io\.ObjectInputStream\.readClassDesc(ObjectInputStream\.java:1495)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1731)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.skipCustomData(ObjectInputStream\.java:1911)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1873)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.readObject(ObjectInputStream\.java:350)
at cascading\.flow\.hadoop\.util\.HadoopUtil\.deserializeBase64(HadoopUtil\.java:370)
at cascading\.flow\.hadoop\.util\.HadoopUtil\.deserializeBase64(HadoopUtil\.java:340)
at cascading\.flow\.hadoop\.FlowMapper\.configure(FlowMapper\.java:77)
\.\.\. 14 more
" .

It looks like I may have some mismatched components somewhere (hadoop 1.0.3, scalding 0.8.2-SNAPSHOT and scala 2.9.2) but I have no idea how to go forward from here... any help is appreciated.

Thanks
Sujit

Oscar Boykin

unread,
Jan 1, 2013, 5:49:01 PM1/1/13
to cascadi...@googlegroups.com
So the problem is a serialization version mismatch on the Map4 class (which is an instance of Map in scala for maps of size 4, which are better handled specially than by building an immutable hashtable for something that small).

My first question would be: are you using the same scala jar on the submitter as you are on the workers? You are serializing everything when you run the hadoop jar command. Then the workers deserialize the job from the data in the jobConf or the distributed cache (this is how cascading works). If there is a different version of scala in those places (submitted vs workers), you could be in trouble.

Secondly, I noticed the gradle build used scala 2.9.1, but we use scala 2.9.2 (I think) when we publish the scalding jar. It *should* be compatible, but who knows, there could be some minor issue there.

Hope this helps.


To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/Y2ezLnNMOvAJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Sujit Pal

unread,
Jan 3, 2013, 7:31:41 PM1/3/13
to cascadi...@googlegroups.com
Thank you for your help. Answers to your questions inline:

> My first question would be: are you using the same scala jar on the submitter as you are on the workers? 
I am using pseudo-distributed mode so the same jar is being used for both submitter and workers. Also the scala-library is being packaged along with part8.jar by gradle.

I did find that the dependency on the build.gradle in Impatient/part8 was scala 2.9.1 so I changed it locally to 2.9.2 and rebuilt the part8.jar.

dependencies {
  // Scala compiler, related tools, standard library
  scalaTools 'org.scala-lang:scala-compiler:2.9.2'
  compile 'org.scala-lang:scala-library:2.9.2'

  // Scalding build; better local, Twitter is slow to update ConJars.org
  // see README.md
  compile fileTree( dir: 'lib', includes: ['*.jar'] )
}

I also verified that the part8.jar file contains the following jar files in the lib/ directory:

34066568 Tue Jan 01 11:55:40 PST 2013 lib/scalding-assembly-0.8.2-SNAPSHOT.jar
8857794 Fri Jun 29 15:26:28 PDT 2012 lib/scala-library-2.9.2.jar

and scalding (downloaded form git) also points to 2.9.2 in build.sbt (scalaVersion).

I still get the same error on the console when I run it with "hadoop jar".

hduser@cyclone:~$ hadoop jar /tmp/part8.jar Example3 --hdfs --doc rain-input/rain.txt --wc rain-output
Warning: $HADOOP_HOME is deprecated.

13/01/03 15:21:48 INFO util.HadoopUtil: resolving application jar from found main method on: com.twitter.scalding.Tool$
13/01/03 15:21:48 INFO planner.HadoopPlanner: using application jar: /prod/hadoop/tmp/hadoop-unjar1118736656589291865/lib/scalding-assembly-0.8.2-SNAPSHOT.jar
13/01/03 15:21:48 INFO property.AppProps: using app.id: 5EBA9160ADF7C293BD55BD0728257D21
13/01/03 15:21:49 INFO flow.Flow: [Example3] starting
13/01/03 15:21:49 INFO flow.Flow: [Example3]  source: Hfs["TextDelimited[['doc_id', 'text']]"]["rain-input/rain.txt"]"]
13/01/03 15:21:49 INFO flow.Flow: [Example3]  sink: Hfs["TextDelimited[[UNKNOWN]->['token', 'count']]"]["rain-output"]"]
13/01/03 15:21:49 INFO flow.Flow: [Example3]  parallel execution is enabled: true
13/01/03 15:21:49 INFO flow.Flow: [Example3]  starting jobs: 1
13/01/03 15:21:49 INFO flow.Flow: [Example3]  allocating threads: 1
13/01/03 15:21:49 INFO flow.FlowStep: [Example3] starting step: (1/1) rain-output
13/01/03 15:21:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/01/03 15:21:50 WARN snappy.LoadSnappy: Snappy native library not loaded
13/01/03 15:21:50 INFO mapred.FileInputFormat: Total input paths to process : 1
13/01/03 15:21:50 INFO flow.FlowStep: [Example3] submitted hadoop job: job_201301031513_0002
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] task completion events identify failed tasks
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] task completion events count: 10
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000003_0, Status : SUCCEEDED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000000_0, Status : FAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000001_0, Status : FAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000000_1, Status : FAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000001_1, Status : FAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000000_2, Status : FAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000001_2, Status : FAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000000_3, Status : TIPFAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000001_3, Status : TIPFAILED
13/01/03 15:25:10 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301031513_0002_m_000002_0, Status : SUCCEEDED
13/01/03 15:25:10 INFO flow.Flow: [Example3] stopping all jobs
13/01/03 15:25:10 INFO flow.FlowStep: [Example3] stopping: (1/1) rain-output
13/01/03 15:25:10 INFO flow.Flow: [Example3] stopped all jobs
13/01/03 15:25:10 INFO util.Hadoop18TapUtil: deleting temp path rain-output/_temporary
Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) rain-output, with job id: job_201301031513_0002, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:193)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)

But the logs in the log file is a little different this time...notice no mention of Map4 like before, this time it seems to be complaining about Example3$$anonfun$3.

Task TASKID="task_201301031513_0002_m_000001" TASK_TYPE="MAP" START_TIME="1357255463802" SPLITS="/default-rack/192\.168\.1\.13" .
MapAttempt TASK_TYPE="MAP" TASKID="task_201301031513_0002_m_000000" TASK_ATTEMPT_ID="attempt_201301031513_0002_m_000000_0" START_TIME="1357255463812" TRACKER_NAME="tracker_192\.168\.1\.13:localhost/127\.0\.0\.1:50535" HTTP_PORT="50060" .
MapAttempt TASK_TYPE="MAP" TASKID="task_201301031513_0002_m_000000" TASK_ATTEMPT_ID="attempt_201301031513_0002_m_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1357255472706" HOSTNAME="192\.168\.1\.13" ERROR="java\.lang\.RuntimeException: Error in configuring object
at org\.apache\.hadoop\.util\.ReflectionUtils\.setJobConf(ReflectionUtils\.java:93)
at org\.apache\.hadoop\.util\.ReflectionUtils\.setConf(ReflectionUtils\.java:64)
at org\.apache\.hadoop\.util\.ReflectionUtils\.newInstance(ReflectionUtils\.java:117)
at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:432)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:372)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:255)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1121)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:249)
Caused by: java\.lang\.reflect\.InvocationTargetException
at sun\.reflect\.NativeMethodAccessorImpl\.invoke0(Native Method)
at sun\.reflect\.NativeMethodAccessorImpl\.invoke(NativeMethodAccessorImpl\.java:39)
at sun\.reflect\.DelegatingMethodAccessorImpl\.invoke(DelegatingMethodAccessorImpl\.java:25)
at java\.lang\.reflect\.Method\.invoke(Method\.java:597)
at org\.apache\.hadoop\.util\.ReflectionUtils\.setJobConf(ReflectionUtils\.java:88)
\.\.\. 9 more
Caused by: cascading\.flow\.FlowException: unable to deserialize data
at cascading\.flow\.hadoop\.util\.HadoopUtil\.deserializeBase64(HadoopUtil\.java:374)
at cascading\.flow\.hadoop\.util\.HadoopUtil\.deserializeBase64(HadoopUtil\.java:340)
at cascading\.flow\.hadoop\.FlowMapper\.configure(FlowMapper\.java:77)
\.\.\. 14 more
Caused by: java\.lang\.ClassNotFoundException: Example3$$anonfun$3
at java\.net\.URLClassLoader$1\.run(URLClassLoader\.java:202)
at java\.security\.AccessController\.doPrivileged(Native Method)
at java\.net\.URLClassLoader\.findClass(URLClassLoader\.java:190)
at java\.lang\.ClassLoader\.loadClass(ClassLoader\.java:306)
at sun\.misc\.Launcher$AppClassLoader\.loadClass(Launcher\.java:301)
at java\.lang\.ClassLoader\.loadClass(ClassLoader\.java:247)
at java\.lang\.Class\.forName0(Native Method)
at java\.lang\.Class\.forName(Class\.java:247)
at java\.io\.ObjectInputStream\.resolveClass(ObjectInputStream\.java:603)
at cascading\.flow\.hadoop\.util\.HadoopUtil$1\.resolveClass(HadoopUtil\.java:365)
at java\.io\.ObjectInputStream\.readNonProxyDesc(ObjectInputStream\.java:1574)
at java\.io\.ObjectInputStream\.readClassDesc(ObjectInputStream\.java:1495)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1731)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.readObject(ObjectInputStream\.java:350)
at java\.util\.HashMap\.readObject(HashMap\.java:1030)
at sun\.reflect\.NativeMethodAccessorImpl\.invoke0(Native Method)
at sun\.reflect\.NativeMethodAccessorImpl\.invoke(NativeMethodAccessorImpl\.java:39)
at sun\.reflect\.DelegatingMethodAccessorImpl\.invoke(DelegatingMethodAccessorImpl\.java:25)
at java\.lang\.reflect\.Method\.invoke(Method\.java:597)
at java\.io\.ObjectStreamClass\.invokeReadObject(ObjectStreamClass\.java:969)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1848)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.readObject(ObjectInputStream\.java:350)
at cascading\.flow\.hadoop\.util\.HadoopUtil\.deserializeBase64(HadoopUtil\.java:370)
\.\.\. 16 more
" .

I verified that the Example3$$anonfun$3 class is available in the part8.jar:

sujit@cyclone:libs$ jar tvf part8.jar | grep "Example3\$\$anonfun\$3"
  1275 Thu Jan 03 15:12:58 PST 2013 Example3$$anonfun$3.class

In my $HADOOP_HOME/lib I have no scala depedencies (or any jars apart from the ones that come with the Hadoop distribution).

Given Paco could run this without problems on Hadoop in non-local mode, I suspected something was broken/misconfigured/different with my Hadoop installation. To eliminate that, I tried running Impatient/part1 (Cascading java) and it ran without problems.

Let me investigate a bit and get back. I want to see if I can use the Cascading API directly from Scala and if I can run the resulting jar under Hadoop (non-local mode). If so, that would work for me as well.

-sujit

Paco Nathan

unread,
Jan 5, 2013, 1:33:58 AM1/5/13
to cascadi...@googlegroups.com
Hi Sujit,

I tried running my Scalding build on AWS and ran into a different problem -- the one which Chris Severs and Philippe Laflamme had both reported in: https://groups.google.com/forum/#!msg/cascading-user/80o-_b2Bmf4/Pl1RSQJHRiUJ

Gradle build script is revised now, to fix the version issue that Oscar pointed out, also to pull Scalding from the correct Maven repo. I've added a fix from Chris Severs showing how to workaround the ClassNotFoundException issue. Also, there's a script to run on EMR:


Kudos to Chris Severs for that workaround, cool stuff.

Paco


To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/1fTrHEEmMxcJ.

Sujit Pal

unread,
Jan 6, 2013, 9:50:03 PM1/6/13
to cascadi...@googlegroups.com
This worked! Yoohoo! Thanks very much Paco.

The only thing worth mentioning is that with the new build.gradle script we no longer need the scalding fat jar in the lib/ directory (in fact my first try with it in there gave me an IncompatibleClassChangeError: Class com.twitter.algebird.LongRing$ does not implement the requested interface com.twitter.algebird.Semigroup). I then noticed that gradle puts a flat list of scalding and its dependencies, so I removed the fat jar and it worked.

Thanks again,
Sujit
Reply all
Reply to author
Forward
0 new messages