Scalding 0.8.6 on CDH4?

ach...@box.com

unread,

Jul 20, 2013, 2:11:24 AM7/20/13

to cascadi...@googlegroups.com

Hey folks,

I'm trying to run Scalding 0.8.6 on my VM that has CDH4 installed. I am trying to run the WordCountJob found here: https://github.com/twitter/scalding/wiki/Getting-Started

Running

scripts/scald.rb --local WordCountJob.scala --input WordCountJob.scala --output DELETEME

works beautifully, but now I am trying to run it with --hdfs and get actual MapReduce jobs running.

In the Ruby script I replaced "{ "host" => "my.host.here", #where the job is rsynced to and run" with "{ "host" => "<my dev vm address here>", #where the job is rsynced to and run"

I then run the same command, but instead of --local I use --hdfs and instead of WordCountJob.scala and DELETEME I point to hdfs paths (spelled like a regular path, but valid hdfs):

scripts/scald.rb --hdfs WordCountJob.scala --input <hdfs path here - no hdfs:// prefix> --output <hdfs path - no hdfs:// prefix>

I get this:

Exception in thread "main" java.lang.Throwable: If you know what exactly caused this error, please consider contributing to GitHub via following link.

https://github.com/twitter/scalding/wiki/Common-Exceptions-and-possible-reasons#javalangreflectinvocationtargetexception

at com.twitter.scalding.Tool$.main(Tool.scala:147)

at com.twitter.scalding.Tool.main(Tool.scala)

Caused by: java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

at com.twitter.scalding.Job$.apply(Job.scala:35)

at com.twitter.scalding.Tool.getJob(Tool.scala:51)

at com.twitter.scalding.Tool.run(Tool.scala:72)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)

at com.twitter.scalding.Tool$.main(Tool.scala:133)

... 1 more

Caused by: java.lang.RuntimeException: Please provide a value for --input

at scala.sys.package$.error(package.scala:27)

at com.twitter.scalding.Args.required(Args.scala:115)

at com.twitter.scalding.Args.apply(Args.scala:90)

at WordCountJob.<init>(WordCountJob.scala:4)

I'm also wary of running it this way as it looks like the build file is pulling in it's own Hadoop dependencies (0.20) whereas ideally I'd like to link against the Hadoop jars that come with CDH4.. perhaps packaging everything up in a fat jar and running a "regular" hadoop command is the better option? Could I get assistance with this as I can't seem to find anything online..

Thanks!

Haidar Hadi

unread,

Jul 21, 2013, 6:20:17 PM7/21/13

to cascadi...@googlegroups.com

Consider using this project to run the tutorials using the hadoop jar https://github.com/Cascading/scalding-tutorial

Haidar.

Andre Kelpe

unread,

Jul 22, 2013, 5:11:01 AM7/22/13

to cascadi...@googlegroups.com

Please be aware that cascading and thus scalding is only compatible
with hadoop distributions listed here:
http://www.cascading.org/support/compatibility/

We are running a comprehensive test suite against the apache versions
and we encourage vendors to run the same:
https://github.com/Cascading/cascading.compatibility

André

> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cascading-use...@googlegroups.com.
> To post to this group, send email to cascadi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/cascading-user.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

--
André Kelpe
an...@concurrentinc.com
http://concurrentinc.com

ach...@box.com

unread,

Jul 22, 2013, 12:05:32 PM7/22/13

to cascadi...@googlegroups.com

I just tried running the scalding-tutorial Tutorial0 and it works. Then I dropped in my WordCountJob from the Scalding tutorial wiki and compiled it and tried to run this:

hadoop jar target/scalding-tutorial-0.8.6.jar WordCountJob --hdfs --input /valid/hdfs/file/path --output /valid/hdfs/path/unexistingFilename

And get greeted with this:

Exception in thread "main" java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

at com.twitter.scalding.Job$.apply(Job.scala:35)

at com.twitter.scalding.Tool.getJob(Tool.scala:51)

at com.twitter.scalding.Tool.run(Tool.scala:72)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at JobRunner$.main(JobRunner.scala:27)

at JobRunner.main(JobRunner.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Caused by: java.lang.RuntimeException: Please provide a value for --input

at scala.sys.package$.error(package.scala:27)

at com.twitter.scalding.Args.required(Args.scala:115)

at com.twitter.scalding.Args.apply(Args.scala:90)

at WordCountJob.<init>(WordCountJob.scala:6)

@ Andre: I'm not sure what that means in terms of CDH4 - I believe they provide both Hadoop v1 and v2, I'm working with v1 I believe so.. it should be fine?

Right now my goal is to get a WordCount job running with reading/writing from/to HDFS on an existing CDH4 cluster.

Oscar Boykin

unread,

Jul 22, 2013, 12:32:21 PM7/22/13

to cascadi...@googlegroups.com

I wonder if what is going on is that the later version of hadoop is stripping the --input arg in the parsing here:

https://github.com/twitter/scalding/blob/master/scalding-core/src/main/scala/com/twitter/scalding/Tool.scala#L58

Perhaps this later version removes the --input.

Clearly the error is that the input arg is never making it into the job.

It is strange that it is finding it when you run in local mode.

Can you post a gist to the actual file you are running?

Oscar Boykin :: @posco :: http://twitter.com/posco

ach...@box.com

unread,

Jul 22, 2013, 12:43:27 PM7/22/13

to cascadi...@googlegroups.com

It works in local mode, things break down when I want to run it in HDFS and get an actual job running on the cluster. Here's the code I'm trying to run:

http://pastie.org/private/ueqzgatnynhwcezk26jlg

If the argument stripping is indeed the case, are folks on CDH4 at a dead end? The cluster is CDH4 but we run Hadoop MRv1.

Another issue I ran into doing the scalding-tutorial in the Cascading repo is "hadoop.util.ToolRunner" does not exist when I replace the hadoop dependencies to those in the CDH4 repo: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_31.html

Oscar Boykin

unread,

Jul 22, 2013, 12:56:49 PM7/22/13

to cascadi...@googlegroups.com

The stack trace says:

Caused by: java.lang.RuntimeException: Please provide a value for --input

at scala.sys.package$.error(package.scala:27)

at com.twitter.scalding.Args.required(Args.scala:115)

at com.twitter.scalding.Args.apply(Args.scala:90)

at WordCountJob.<init>(WordCountJob.scala:6)

but your paste does not access the args at that line:

http://pastie.org/private/ueqzgatnynhwcezk26jlg

has line 6:


  .groupBy('word) { _.size }

ach...@box.com

unread,

Jul 22, 2013, 12:59:57 PM7/22/13

to cascadi...@googlegroups.com

Sorry about the confusion, there was a package declaration at the top with a newline, so in that paste line 4 should be line 6: http://pastie.org/private/kkdniuoy4geckbmjorvezw

Oscar Boykin

unread,

Jul 22, 2013, 1:03:26 PM7/22/13

to cascadi...@googlegroups.com

change the args("input") to args("achang-input") and pass the arg: --achang-input

That will test if input is being stripped specially.

also, I'm confused as to why you don't need to give the whole class name in your command: e.g.: com.adelbertc.WordCountJob

ach...@box.com

unread,

Jul 22, 2013, 1:15:14 PM7/22/13

to cascadi...@googlegroups.com

Changed the flag, same error.

I do have to type the full classname, again I'm not sure why I decided to strip off the package, my mistake sorry for any confusion.

Oscar Boykin

unread,

Jul 22, 2013, 1:16:57 PM7/22/13

to cascadi...@googlegroups.com

Change the job to hardwire the path and not use the args.

Also, put println(args.toString)

and let's see what is making it to your job.

ach...@box.com

unread,

Jul 22, 2013, 1:37:12 PM7/22/13

to cascadi...@googlegroups.com

Aha. The plot thickens.. (using the Cascading tutorial version)

Command run: hadoop jar target/scalding-tutorial-0.8.6.jar com.adelbertc.WordCountJob --hdfs

New code:

http://pastie.org/private/46xv4t1xgw47mheici7w

args.toString gives: --hdfs

New error:

Caused by: java.lang.RuntimeException: Please provide a value for --/valid/hdfs/path

This is also interesting.. when running the new hardwired paths along with the original flags (not sure why I tried to do this, but it turned up something interesting):

$ hadoop jar target/scalding-tutorial-0.8.6.jar com.adelbertc.WordCountJob --hdfs --achang-input /hdfs/path/ds=2013-06-10/part-00000 --output /hdfs/path/ds=2013-06-10/DELETEME

args.toString:

--achang-input --/box/performance/ds 2013-06-10/part-00000 --output --hdfs

It seems to have stripped the '=' in the path and is getting confused with the rest of the args.. how do I adjust for this?

Perhaps this is also related to the error above?

ach...@box.com

unread,

Jul 22, 2013, 1:57:05 PM7/22/13

to cascadi...@googlegroups.com

I should probably clarify, the path I've been providing has that ds=2013-06-10 in them, I just omitted it for simplicity (questioning the results now). Is the equals sign messing up the parsing?

Oscar Boykin

unread,

Jul 22, 2013, 2:27:07 PM7/22/13

to cascadi...@googlegroups.com

in 0.8.5 = had special meaning.

a=b

is the same as --a b

that is being removed in 0.9.0.

ach...@box.com

unread,

Jul 22, 2013, 2:29:10 PM7/22/13

to cascadi...@googlegroups.com

So for 0.8.x the best way to work around it is to use a different directory structure that doesn't have an equals?

Oscar Boykin

unread,

Jul 22, 2013, 2:32:48 PM7/22/13

to cascadi...@googlegroups.com

Or write a function that decodes the arg before you use it.

something like substituting __ for = or something as a hack.

ach...@box.com

unread,

Jul 22, 2013, 2:35:50 PM7/22/13

to cascadi...@googlegroups.com

So I just created a new text file with a "clean" path on HDFS in my home directory: /home/achang/someText.txt

I modified the WordCountJob code: http://pastie.org/private/syow3khkskk9cfx58smcig

I get this:

$ hadoop jar target/scalding-tutorial-0.8.6.jar com.adelbertc.WordCountJob --hdfs

--hdfs

Exception in thread "main" java.lang.reflect.InvocationTargetException

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)

at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

at java.lang.reflect.Constructor.newInstance(Constructor.java:513)

at com.twitter.scalding.Job$.apply(Job.scala:35)

at com.twitter.scalding.Tool.getJob(Tool.scala:51)

at com.twitter.scalding.Tool.run(Tool.scala:72)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at JobRunner$.main(JobRunner.scala:27)

at JobRunner.main(JobRunner.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

Caused by: java.lang.RuntimeException: Please provide a value for --/home/achang/foo.txt

at scala.sys.package$.error(package.scala:27)

at com.twitter.scalding.Args.required(Args.scala:115)

at com.twitter.scalding.Args.apply(Args.scala:90)

at com.adelbertc.WordCountJob.<init>(WordCountJob.scala:10)

Of interest is probably this: Caused by: java.lang.RuntimeException: Please provide a value for --/home/achang/foo.txt

???

Oscar Boykin

unread,

Jul 22, 2013, 2:36:34 PM7/22/13

to cascadi...@googlegroups.com

Why did you write:


 .write( Tsv( args("/valid/hdfs/path2") ) )


rather than:


 .write( Tsv( "/valid/hdfs/path2" ) )

ach...@box.com

unread,

Jul 22, 2013, 2:37:20 PM7/22/13

to cascadi...@googlegroups.com

Ignore the previous post, realized I had args() surrounding the output path..

ach...@box.com

unread,

Jul 22, 2013, 3:49:46 PM7/22/13

to cascadi...@googlegroups.com

Fixed code: http://pastie.org/private/m9qncwajsnkkbzaqw7qtkg

Command: hadoop jar target/scalding-tutorial-0.8.6.jar com.adelbertc.WordCountJob --hdfs

Exception in thread "main" com.twitter.scalding.InvalidSourceException: [TextLine(/home/achang/someText.txt)] Data is missing from one or more paths in: List(/home/achang/someText.txt)

at com.twitter.scalding.FileSource.validateTaps(FileSource.scala:102)

at com.twitter.scalding.Job$$anonfun$validateSources$1.apply(Job.scala:158)

at com.twitter.scalding.Job$$anonfun$validateSources$1.apply(Job.scala:153)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1156)

at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

at com.twitter.scalding.Job.validateSources(Job.scala:153)

at com.twitter.scalding.Job.buildFlow(Job.scala:91)

at com.twitter.scalding.Job.run(Job.scala:126)

at com.twitter.scalding.Tool.start$1(Tool.scala:109)

at com.twitter.scalding.Tool.run(Tool.scala:125)

at com.twitter.scalding.Tool.run(Tool.scala:72)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at JobRunner$.main(JobRunner.scala:27)

at JobRunner.main(JobRunner.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.util.RunJar.main(RunJar.java:208)

It seems like it's saying the path is invalid or the text file is invalid or something.. if I do

$ hadoop fs -cat ~/someText.txt

I get a few words, which is what I want.

I created the text file on HDFS using copyFromLocal..

Any ideas?

Oscar Boykin

unread,

Jul 22, 2013, 4:00:10 PM7/22/13

to cascadi...@googlegroups.com

Hadoop loads directories (not files) at a time.

Pass the directory name.

Reply all

Reply to author

Forward