Thank you. Went further this time than before but still no luck...
For testing I just used part8 of the Impatient project and tried to run Example3 on hadoop. Using scald.rb to run locally succeeds, but trying to run this on hdfs using hadoop doesn't.
Here is what I did for running in hdfs, I built the scalding fat jar then copied it to part8/lib, then used "gradle clean jar" to build the part8.jar, then copied it to /tmp so the Hadoop user can get to it. I then copied rain.txt to HDFS and ran the command as the hadoop user like so:
hduser@cyclone:~$ hadoop jar /tmp/part8.jar Example3 --hdfs --doc rain-input/rain.txt --wc rain-output
Warning: $HADOOP_HOME is deprecated.
13/01/01 12:17:50 INFO util.HadoopUtil: resolving application jar from found main method on: com.twitter.scalding.Tool$
13/01/01 12:17:50 INFO planner.HadoopPlanner: using application jar: /prod/hadoop/tmp/hadoop-unjar5174315492679253612/lib/scalding-assembly-0.8.2-SNAPSHOT.jar
13/01/01 12:17:50 INFO property.AppProps: using
app.id: 7D72589BB8A9643824813D2996AD04FF
13/01/01 12:17:50 INFO flow.Flow: [Example3] starting
13/01/01 12:17:50 INFO flow.Flow: [Example3] source: Hfs["TextDelimited[['doc_id', 'text']]"]["rain-input/rain.txt"]"]
13/01/01 12:17:50 INFO flow.Flow: [Example3] sink: Hfs["TextDelimited[[UNKNOWN]->['token', 'count']]"]["rain-output"]"]
13/01/01 12:17:50 INFO flow.Flow: [Example3] parallel execution is enabled: true
13/01/01 12:17:50 INFO flow.Flow: [Example3] starting jobs: 1
13/01/01 12:17:50 INFO flow.Flow: [Example3] allocating threads: 1
13/01/01 12:17:50 INFO flow.FlowStep: [Example3] starting step: (1/1) rain-output
13/01/01 12:17:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/01/01 12:17:52 WARN snappy.LoadSnappy: Snappy native library not loaded
13/01/01 12:17:52 INFO mapred.FileInputFormat: Total input paths to process : 1
13/01/01 12:17:52 INFO flow.FlowStep: [Example3] submitted hadoop job: job_201301011158_0002
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] task completion events identify failed tasks
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] task completion events count: 10
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000003_0, Status : SUCCEEDED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_0, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_0, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_1, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_1, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_2, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_2, Status : FAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000000_3, Status : TIPFAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000001_3, Status : TIPFAILED
13/01/01 12:21:43 WARN flow.FlowStep: [Example3] event = Task Id : attempt_201301011158_0002_m_000002_0, Status : SUCCEEDED
13/01/01 12:21:43 INFO flow.Flow: [Example3] stopping all jobs
13/01/01 12:21:43 INFO flow.FlowStep: [Example3] stopping: (1/1) rain-output
13/01/01 12:21:43 INFO flow.Flow: [Example3] stopped all jobs
13/01/01 12:21:43 INFO util.Hadoop18TapUtil: deleting temp path rain-output/_temporary
Exception in thread "main" cascading.flow.FlowException: step failed: (1/1) rain-output, with job id: job_201301011158_0002, please see cluster logs for failure messages
at cascading.flow.planner.FlowStepJob.blockOnJob(FlowStepJob.java:193)
at cascading.flow.planner.FlowStepJob.start(FlowStepJob.java:137)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:122)
at cascading.flow.planner.FlowStepJob.call(FlowStepJob.java:42)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
The logs (rain-output/_logs/) shows this error...
Task TASKID="task_201301011158_0002_m_000001" TASK_TYPE="MAP" START_TIME="1357071656935" SPLITS="/default-rack/192\.168\.1\.13" .
MapAttempt TASK_TYPE="MAP" TASKID="task_201301011158_0002_m_000000" TASK_ATTEMPT_ID="attempt_201301011158_0002_m_000000_0" START_TIME="1357071656944" TRACKER_NAME="tracker_192\.168\.1\.13:localhost/127\.0\.0\.1:49907" HTTP_PORT="50060" .
MapAttempt TASK_TYPE="MAP" TASKID="task_201301011158_0002_m_000000" TASK_ATTEMPT_ID="attempt_201301011158_0002_m_000000_0" TASK_STATUS="FAILED" FINISH_TIME="1357071665878" HOSTNAME="192\.168\.1\.13" ERROR="java\.lang\.RuntimeException: Error in configuring object
at org\.apache\.hadoop\.util\.ReflectionUtils\.setJobConf(ReflectionUtils\.java:93)
at org\.apache\.hadoop\.util\.ReflectionUtils\.setConf(ReflectionUtils\.java:64)
at org\.apache\.hadoop\.util\.ReflectionUtils\.newInstance(ReflectionUtils\.java:117)
at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:432)
at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:372)
at org\.apache\.hadoop\.mapred\.Child$4\.run(Child\.java:255)
at java\.security\.AccessController\.doPrivileged(Native Method)
at javax\.security\.auth\.Subject\.doAs(Subject\.java:396)
at org\.apache\.hadoop\.security\.UserGroupInformation\.doAs(UserGroupInformation\.java:1121)
at org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:249)
Caused by: java\.lang\.reflect\.InvocationTargetException
at sun\.reflect\.NativeMethodAccessorImpl\.invoke0(Native Method)
at sun\.reflect\.NativeMethodAccessorImpl\.invoke(NativeMethodAccessorImpl\.java:39)
at sun\.reflect\.DelegatingMethodAccessorImpl\.invoke(DelegatingMethodAccessorImpl\.java:25)
at java\.lang\.reflect\.Method\.invoke(Method\.java:597)
at org\.apache\.hadoop\.util\.ReflectionUtils\.setJobConf(ReflectionUtils\.java:88)
\.\.\. 9 more
Caused by: cascading\.flow\.FlowException: internal error during mapper configuration
at cascading\.flow\.hadoop\.FlowMapper\.configure(FlowMapper\.java:96)
\.\.\. 14 more
Caused by: java\.io\.InvalidClassException: scala\.collection\.immutable\.Map$Map4; local class incompatible: stream classdesc serialVersionUID \= 7313668479060291035, local class serialVersionUID \= 1209906499091153147
at java\.io\.ObjectStreamClass\.initNonProxy(ObjectStreamClass\.java:560)
at java\.io\.ObjectInputStream\.readNonProxyDesc(ObjectInputStream\.java:1582)
at java\.io\.ObjectInputStream\.readClassDesc(ObjectInputStream\.java:1495)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1731)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.skipCustomData(ObjectInputStream\.java:1911)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1873)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.defaultReadFields(ObjectInputStream\.java:1946)
at java\.io\.ObjectInputStream\.readSerialData(ObjectInputStream\.java:1870)
at java\.io\.ObjectInputStream\.readOrdinaryObject(ObjectInputStream\.java:1752)
at java\.io\.ObjectInputStream\.readObject0(ObjectInputStream\.java:1328)
at java\.io\.ObjectInputStream\.readObject(ObjectInputStream\.java:350)
at cascading\.flow\.hadoop\.util\.HadoopUtil\.deserializeBase64(HadoopUtil\.java:370)
at cascading\.flow\.hadoop\.util\.HadoopUtil\.deserializeBase64(HadoopUtil\.java:340)
at cascading\.flow\.hadoop\.FlowMapper\.configure(FlowMapper\.java:77)
\.\.\. 14 more
" .
It looks like I may have some mismatched components somewhere (hadoop 1.0.3, scalding 0.8.2-SNAPSHOT and scala 2.9.2) but I have no idea how to go forward from here... any help is appreciated.