Hey folks,
Running
scripts/scald.rb --local WordCountJob.scala --input WordCountJob.scala --output DELETEME
works beautifully, but now I am trying to run it with --hdfs and get actual MapReduce jobs running.
In the Ruby script I replaced "{ "host" => "my.host.here", #where the job is rsynced to and run" with "{ "host" => "<my dev vm address here>", #where the job is rsynced to and run"
I then run the same command, but instead of --local I use --hdfs and instead of WordCountJob.scala and DELETEME I point to hdfs paths (spelled like a regular path, but valid hdfs):
scripts/scald.rb --hdfs WordCountJob.scala --input <hdfs path here - no hdfs:// prefix> --output <hdfs path - no hdfs:// prefix>
I get this:
Exception in thread "main" java.lang.Throwable: If you know what exactly caused this error, please consider contributing to GitHub via following link.
at com.twitter.scalding.Tool$.main(Tool.scala:147)
at com.twitter.scalding.Tool.main(Tool.scala)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at com.twitter.scalding.Job$.apply(Job.scala:35)
at com.twitter.scalding.Tool.getJob(Tool.scala:51)
at com.twitter.scalding.Tool.run(Tool.scala:72)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at com.twitter.scalding.Tool$.main(Tool.scala:133)
... 1 more
Caused by: java.lang.RuntimeException: Please provide a value for --input
at scala.sys.package$.error(package.scala:27)
at com.twitter.scalding.Args.required(Args.scala:115)
at com.twitter.scalding.Args.apply(Args.scala:90)
at WordCountJob.<init>(WordCountJob.scala:4)
I'm also wary of running it this way as it looks like the build file is pulling in it's own Hadoop dependencies (0.20) whereas ideally I'd like to link against the Hadoop jars that come with CDH4.. perhaps packaging everything up in a fat jar and running a "regular" hadoop command is the better option? Could I get assistance with this as I can't seem to find anything online..
Thanks!