Running Cascading.JRuby on a cluster

105 views
Skip to first unread message

Christopher Lin

unread,
Aug 12, 2011, 4:50:44 AM8/12/11
to cascading-user
Hi, I'm trying to run cascading.jruby on a cluster. I've set
CASCADING_HOME, HADOOP_HOME, and HADOOP_CONF_DIR appropriately, but
when I run the samples/logwordcount.rb example, it appears to run
entirely in process and picks up the config form cascading/src/test/
hadoop-site.xml instead of the configs from my HADOOP_CONF_DIR. What
am I missing to make a job actually run on the cluster?

Thank you in advance for any help!

Chris K Wensel

unread,
Aug 12, 2011, 10:51:30 AM8/12/11
to cascadi...@googlegroups.com
I've not use the ruby stuff, but I can say the env vars are only used by the hadoop bash scripts.

you probably need to put the conf_dir at the top of your classpath (which is what the bash scripts do).

ckw

> --
> You received this message because you are subscribed to the Google Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.
>

--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

-- Concurrent, Inc. offers mentoring, support for Cascading

Ted Dunning

unread,
Aug 12, 2011, 1:03:35 PM8/12/11
to cascadi...@googlegroups.com
I have worked with a company that used Ruby heavily to run Cascading, but I think that they built their own DSL for the most part.  They were quite happy with the results.

Matt

unread,
Sep 6, 2011, 10:14:56 PM9/6/11
to cascadi...@googlegroups.com
Hey Christopher,

Sorry for the late reply; I don't frequently check this list.

The samples in cascading.jruby are hard-wired to run locally just to demonstrate the functionality of cascades built in cascading.jruby.  Note on line 22 of the logwordcount sample, sample_properties are passed into the Cascade#complete method:
https://github.com/mrwalker/cascading.jruby/blob/master/samples/logwordcount.rb#L22

The sample_properties method is defined in samples/cascading.rb:
https://github.com/mrwalker/cascading.jruby/blob/master/samples/cascading.rb

The "correct" way to execute a cascading.jruby script on a cluster is to build a jar containing your script and the libraries it depends upon, and bake the "runner" (https://github.com/mrwalker/cascading.jruby/blob/master/src/cascading/jruby/Main.java) into it so you can use the hadoop jar command:
http://hadoop.apache.org/common/docs/current/commands_manual.html#jar

The gem has a script called make_job that was intended to help you build the required jar:
https://github.com/mrwalker/cascading.jruby/blob/master/bin/make_job

Note that you're getting into a pretty crufty part of the gem at this point; at Etsy, we have our own internal "runner" and build process which we've been meaning to release but is still too tightly integrated to our own custom stack.  This is the part of the gem that I get the most questions about and needs the most work to make it truly useful out of the box.

Matt

unread,
Jun 4, 2012, 11:37:16 PM6/4/12
to cascadi...@googlegroups.com
I have a v0 of a built tool for cascading.jruby jars up on github: https://github.com/etsy/jading

Note that this isn't exactly what Etsy uses internally, but I'm hoping to migrate us as soon as I'm able.  There's no rocket science, here, just a straightforward build tool that allows you to combine your c.j scripts, Ruby gems, and jars into a single job jar for execution using hadoop jar.  This replaces make_job and a good chunk of the old rake tasks in cascading.jruby which I plan to clean out in the near future.

Ben Linsay

unread,
Aug 1, 2012, 2:34:07 AM8/1/12
to cascadi...@googlegroups.com
Hi Matt,

Sorry to resurrect the dead: I've been screwing around with jading a bunch after (unsuccessfully) screwing with 'make_jar' a bunch. It's definitely an improvement (good job!), but I still have a few questions.

* I'm screwing around on my laptop to run this, and having no problem running scripts that exist outside the jar. Scripts from inside the jar fail with a message about not being able to find 'cascading' (see below). Did I somehow mess up the paths in the jar?

Jul 31, 2012 11:28:23 PM com.etsy.jading.Main run
INFO: Requiring 'fuckery.rb'
Exception in thread "main" org.jruby.exceptions.RaiseException: (LoadError) no such file to load -- cascading
at org.jruby.RubyKernel.require(org/jruby/RubyKernel.java:1038)
at #<Class:0x7f1fc4b8>.(root)(/tmp/hadoop-benl/hadoop-unjar4747316224820723370/fuckery.rb:4)
at org.jruby.RubyKernel.require(org/jruby/RubyKernel.java:1038)
at #<Class:0x7f1fc4b8>.(root)(/tmp/hadoop-benl/hadoop-unjar4747316224820723370/fuckery.rb:1)


* I ran the logwordcount script from the c.j samples using jade.jar. If I change :mode from :local to :hadoop the Cascade runs twice. The following logging shows up just before things are run a second time:

INFO: Requiring 'com/etsy/jading/runner'
Found 1 Cascades in global registry
Jading is running the 'logwordcount' Cascade
Connecting flow 'logwordcount' with properties:

* Given a properly set up jade.jar and jruby scripts packaged with it, it seems like I get to live the dream of never having to install JRuby on a Hadoop cluster, since everything is running through the JRuby code packaged in the uber-jar. Is that off base?

I'm sure I'll have more questions soon (and sorry for the dumb one on twitter yesterday :P)

Matt

unread,
Aug 1, 2012, 7:20:51 AM8/1/12
to cascadi...@googlegroups.com
Hey Ben,

I'm traveling at the moment, but I'll do my best to get you unstuck and then give some more detailed answers when I get a chance.

You don't necessarily need the jar to run scripts on your laptop -- it's as simple as building a cascading and then calling the complete method on it.  That being said, to get the script to run on a cluster, a thing called the "runner" comes into play:

This is a pretty lame hack that's been with us since we first started with c.j, but haven't found a need to fix (yet).  The idea is simple: build your cascade, then let the runner pick up the last cascade in the global registry, and complete it.

The particular message about 'cascading' being missing is just that the cascading.jruby gem can't be found (it's called 'cascading').  This is something I haven't fully played with, yet, and something that I'd like to shield the developer from, but haven't gotten around to.  You might try including this at the top of your job script:
$: << 'vendor/gems'
require 'cascading'

Note that the samples put 'lib' on the load path before they require cascading.  That's because I want them to run based on the code in your local checkout of c.j, but they still have to require 'cascading' to get everything loaded.  In the jade.jar, all gems are stuffed under vendor/gems.

Your 2xs run thing seems like a bug.  Can you file a bug on github?  Note that the :local mode stuff (anything past 0.0.8) is "unreleased" as of yet and I haven't really integrated it into jading.  At the very least, you'll need to tick up the version numbers in jading's ivy.xml file to "2.0.0" for all the cascading components.

I'm not entirely sure about your dream =(  The reason: at Etsy, we chef (bootstrap in EMR) the installation of JRuby (and currently all gems) to our cluster.  I had intended to play around with jading early next week to try to wean us off this approach so that I can roll out the 0.0.9 upgrade completely independently for each job rather than having these global dependencies.  However, I haven't yet done the testing.  Here's the point in the Java "runner" code where we depend upon a JRuby install:

Given that jading throws jruby-1.6.5.jar into the jade.jar, it seems like it could be _possible_ to load the JRuby runtime from there, and not the system.  But that Java "runner" will likely have to change to avoid depending upon JRUBY_HOME/setup the JRuby env differently.

Hope this helps!  More to come as I have time to look and time to figure it out myself.

Ben Linsay

unread,
Aug 1, 2012, 9:13:07 PM8/1/12
to cascadi...@googlegroups.com
In poking things again, I noticed that I was calling 'complete' on the Cascades I was defining in my scripts which explains the 2X thing. I was staring at it a little too long last night to notice that. Any reason the runner only calls the last cascade?

Adding the vendor/gems path to $: didn't quite work. I got around it for now with "require 'rubygems'" and adding "Gem.path << File.join(File.dirname(__FILE__), 'vendor')" before I load anything. Seems like the right way to go if you're going to package gems with the file. I'm new to Ruby, so I have no idea if that's the ideal way to do things. :P

I get you on installing JRuby via chef/bootstrap actions. Does it only need to be installed on the master node, or does JRuby code get shipped/executed on the workers too?

I didn't notice the little bits of the code finding the local JRuby install. Thanks for pointing that out. I'd be curious about how easy it would be to package your entire runtime into the jar. I found a blogpost (http://spin.atomicobject.com/2010/02/01/running-a-ruby-application-with-jruby-complete/) that made it seem pretty straightforward, but I guess you've gotta give people a way to configure the runtime when packing up the jar.

Ben Linsay

unread,
Aug 1, 2012, 9:37:23 PM8/1/12
to cascadi...@googlegroups.com
I forgot to mention the other bit of funny business I've been running into. Somehow, I'm now getting the following error when supplying :mode to cascade: https://gist.github.com/3232319

What'd I break? As a sanity check to make sure the right version of c.j was getting pulled in, I printed it before doing anything, which is the "c.j version: 0.0.8" right before the stack trace.

Ben Linsay

unread,
Aug 1, 2012, 9:56:12 PM8/1/12
to cascadi...@googlegroups.com
Ah! Figure it out. Looks like both the code checked in on github that include local mode, and the gem you get with 'gem install cascading.jruby' are 0.0.8 even though they're not the same code. Problem solved.

Matt

unread,
Aug 2, 2012, 12:14:14 AM8/2/12
to cascadi...@googlegroups.com
Okay, good catch -- the runner calls complete so you don't have to, like you found.  The only reason the runner calls only the last cascade is that we only run single cascade jobs at Etsy.  Previously, the runner operated at the flow level and only called the last flow, so we stepped it up from there.  Having it complete all cascades seems like a reasonable idea.  The only things to make clear if we do that are that all the cascades share the same properties and args and that they would be run in the order they're stored in the global registry, which I believe is the order of definition.

Thanks for digging up the Gem.path hack.  I haven't played around with this aspect of gems, but it seems like you have something working, so we can go with it.  When I push the next release of c.j, I'll try to give jading some attention in this area.

JRuby only needs to be installed on the node that's running your job jar.  That might not even be a node in your cluster, but at Etsy we have a custom Oozie action that trips off the job and Oozie farms actions out via map tasks (to prevent you from DOSing the master node by running everything there).  Once the runner is actually executing, Cascading will ship tasks to the cluster that have no JRuby dependencies, so you don't otherwise need it installed in the cluster.

Matt

unread,
Aug 5, 2012, 2:00:59 PM8/5/12
to cascadi...@googlegroups.com
Hey Ben,

I played around with the Gem.path extension this morning, but couldn't get it to work.  I eventually got things to work locally without the gem installed by directly overwriting the load path to contain the full path up to cascading.jruby/lib, but this did not work on EMR.  The problem seems to be that extending the Gem.path doesn't make the c.j gem visible, in particular the lib/cascading.rb file which is the main require point for the rest of the code.

For the moment, I've pushed my work up to github (there's a nice change that keeps you from having to sit behind ivy:resolve every time you build), but it is currently incomplete until I get time to revisit.

Take care,
Matt

Ben Linsay

unread,
Aug 5, 2012, 6:40:40 PM8/5/12
to cascadi...@googlegroups.com
Bah! I didn't notice that 'jade' re-installed the gem every time! Yeah, the Gem.path thing definitely doesn't work. Screwing around to try and figure out how the hell TO make it work.

Nice update to jading. :) 

Ben Linsay

unread,
Aug 6, 2012, 2:24:03 AM8/6/12
to cascadi...@googlegroups.com
I haven't figured out how to modify the rubygems gem path at all (which has been maddening!) but I did find this: http://blog.nicksieger.com/articles/2009/01/10/jruby-1-1-6-gems-in-a-jar

Matt

unread,
Aug 6, 2012, 7:52:24 AM8/6/12
to cascadi...@googlegroups.com, cascadi...@googlegroups.com
Thanks for the reference; I'd actually seen this before but hadn't dug it up this round of struggling with the job jar.

From that, I'll either try direct installation into the jar or go with the GEM_PATH (set in the runner itself for you), which was my plan last night.

I should also point out that Hadoop does unjar the jar into /tmp, so this really shouldn't be that hard, just a matter of getting the load path set correctly. It would be harder if the jar remained intact, which I think is what this article deals with.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/0jWjHqg9fqkJ.

Matt

unread,
Aug 7, 2012, 1:24:45 AM8/7/12
to cascadi...@googlegroups.com
Ben, this is fixed: https://github.com/etsy/jading/commit/1168a9bff2432e070efc43fd8343625a5009c82c

I had to install the gems directly into the gem and then manipulate GEM_PATH based on the location Hadoop unjars the jade.jar.  This still feels a bit hackish, but it seemed better than the alternative of unpacking the gems and somehow managing my own manipulation of the Ruby $LOAD_PATH.

The next step is to switch to jruby-complete, which means jade.jar is a completely stand-alone.


On Monday, August 6, 2012 6:52:24 AM UTC-5, Matt wrote:
Thanks for the reference; I'd actually seen this before but hadn't dug it up this round of struggling with the job jar.

From that, I'll either try direct installation into the jar or go with the GEM_PATH (set in the runner itself for you), which was my plan last night.

I should also point out that Hadoop does unjar the jar into /tmp, so this really shouldn't be that hard, just a matter of getting the load path set correctly. It would be harder if the jar remained intact, which I think is what this article deals with.

Sent from my iPhone
To post to this group, send email to cascading-user@googlegroups.com.
To unsubscribe from this group, send email to cascading-user+unsubscribe@googlegroups.com.

Ben Linsay

unread,
Aug 7, 2012, 1:20:05 PM8/7/12
to cascadi...@googlegroups.com
Call me crazy, but I can't get that GEM_PATH change to work in my own scripts before updating. I keep forcefully uninstalling the cascading.jruby gem after packaging things up with Jade just to make sure. I borrowed from gem_path.rb just so we're consistent, and it looks like the gem path is being set correctly to ${unjar_dir}/vendor/gems.


I'll give the updated Jading a shot soon. It's kind of insanity-inducing that this won't work (and FWIW I can't get it to work from irb, either.).

Matt Walker

unread,
Aug 7, 2012, 3:14:41 PM8/7/12
to cascadi...@googlegroups.com
You're trying to get that patch alone to work?  Did you also include the installation part of the patch?  This will _not_ work if you simply unpack the gems into vendor/gems.  You must officially "install" them via RubyGems, which brings along some metadata that is required for making GEM_PATH function.

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/hdZGsCeLg_EJ.

To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.

Ben Linsay

unread,
Aug 7, 2012, 3:48:55 PM8/7/12
to cascadi...@googlegroups.com
Ah! That makes so much more sense. I thought RubyGems was a little simpler than it is.

I'll try the patch shortly.

Ben Linsay

unread,
Aug 8, 2012, 7:21:22 PM8/8/12
to cascadi...@googlegroups.com
Matt,

I had a funny problem with Jading. The wip-255 verison of the cascading jars was hanging out in /tmp/jading/build so it kept getting included in all of my jade.jars. Apparently Scope moved around during the lead-up to 2.0.0, so it kept throwing exceptions like this one my way:

Exception in thread "main" org.jruby.exceptions.RaiseException: (CascadingException) Exception computing outgoing scope
Cause 1: NativeException: java.lang.ClassCastException: cascading.flow.planner.Scope cannot be cast to cascading.flow.Scope
  cascading/tap/Tap.java:258:in `outgoingScopeFor'
  /tmp/hadoop-benl/hadoop-unjar6663432531169577789/vendor/gems/gems/cascading.jruby-0.0.9/lib/cascading/scope.rb:82:in `outgoing_scope_for'
  /tmp/hadoop-benl/hadoop-unjar6663432531169577789/vendor/gems/gems/cascading.jruby-0.0.9/lib/cascading/scope.rb:24:in `source_scope'
  /tmp/hadoop-benl/hadoop-unjar6663432531169577789/vendor/gems/gems/cascading.jruby-0.0.9/lib/cascading/flow.rb:38:in `source'
  ./seq_to_text.rb:10:in `(root)'
  org/jruby/RubyKernel.java:2062:in `instance_eval'
  /tmp/hadoop-benl/hadoop-unjar6663432531169577789/vendor/gems/gems/cascading.jruby-0.0.9/lib/cascading/cascading.rb:30:in `flow'
  ./seq_to_text.rb:9:in `(root)'
  org/jruby/RubyKernel.java:1038:in `require'
  ./seq_to_text.rb:1:in `(root)'
Cause 2: java.lang.ClassCastException: cascading.flow.planner.Scope cannot be cast to cascading.flow.Scope
  at cascading.tap.Tap.outgoingScopeFor(Tap.java:258)

Matt

unread,
Aug 8, 2012, 7:45:20 PM8/8/12
to cascadi...@googlegroups.com, cascadi...@googlegroups.com
Interesting. I'm not sure exactly how to expire jars in /tmp other than to clean occasionally. With no compilation required and no tests or anything run to build the jar, it's annoying that you have to hit it at runtime.

On the other hand, once stable, the dependencies inside jading should change rarely, and it's up to you to manage the ones outside it.

If you think of a clever way to detect stale jars, shoot me a pull request.

Sent from my iPhone
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To view this discussion on the web visit https://groups.google.com/d/msg/cascading-user/-/HFFYYx7bDE8J.
Reply all
Reply to author
Forward
0 new messages