ANN: Lemur for running your hadoop jobs on EMR

165 views
Skip to first unread message

Marc Limotte

unread,
May 17, 2012, 9:39:57 AM5/17/12
to cascal...@googlegroups.com
Hi Cascalog Users,

If you're using Cascalog and running your jobs on Amazon Elastic-Mapreduce, you may find this project interesting: Lemur.  This is a project open-sourced by The Climate Corporation (TCC).  Lemur processes a DSL which describes a cluster and job to be launched. 

Features

  • Launch EMR cluster and submit step(s); or run against local hadoop (usually hadoop standalone for dev and testing)
  • Basic configuration options include: Bootstrap actions, Hadoop config, Uploads (files to transfer to S3, or local), Cluster details (num instances, master instance type, etc), Output paths to use for data, logs, main jar, etc., Support for spot market instances
  • Profile support provides packages of options and functionality that can be enabled or disabled as one switch.  (e.g. you can have a :test profile or a :live profile)
  • Validation for your command line options and environment before launching EMR
  • Override configured options via command line
  • Hooks for actions that should be triggered before or after job launch.  For example:
    • One hook in use at TCC does a diff on the results of a local run (i.e. an integration test)
    • Another hook posts a detailed message to IRC (hipchat) when a new job is started
  • Optionally wait for an EMR job to complete
  • A dry-run feature, so you can check the final cluster configuration, hooks that will be executed, hadoop job arguments, etc.
  • All the details from dry-run (cluster/step config, etc) are persisted with each job run (to STDOUT and saved to a YAML file alongside your output)
  • All settings can be literal values, interpolated strings (e.g. set the S3 bucket to "com.your-co.${env}.hadoop"), or functions
  • Import ("inherit") common options, functionality and behavior to avoid duplication
  • Pass-through command-line options, allows you to specify extra args on the command line that are meaningful to your hadoop main function, but are unknown to lemur or your jobdef
  • Most of TCC's actual hadoop jobs are written with Cascalog.  But Lemur is agnostic to this detail.  They could be Cascading, Java, Hive, Pig, Scalding, Streaming, etc.



Marc Limotte


Andrew Xue

unread,
May 17, 2012, 5:34:21 PM5/17/12
to cascalog-user
this is amazing! is there a hook for when the job finishes? (like,
detect failure and dl and print logs or detect completed and like
download and format results from s3)

On May 17, 9:39 am, Marc Limotte <mslimo...@gmail.com> wrote:
> Hi Cascalog Users,
>
> If you're using Cascalog and running your jobs on Amazon Elastic-Mapreduce,
> you may find this project interesting:
> Lemur<http://entxtech.blogspot.com/2012/05/lemur-declarative-launching-of-h...>.
>  This is a project open-sourced by The Climate Corporation (TCC).  Lemur
> processes a DSL which describes a cluster and job to be launched.
> Features
>
>    - Launch EMR cluster and submit step(s); or run against local hadoop
>    (usually hadoop standalone for dev and testing)
>    - Basic configuration options include: Bootstrap actions, Hadoop
>    config, Uploads (files to transfer to S3, or local), Cluster details (num
>    instances, master instance type, etc), Output paths to use for data, logs,
>    main jar, etc., Support for spot market instances
>    - Profile support provides packages of options and functionality that
>    can be enabled or disabled as one switch.  (e.g. you can have a :test
>    profile or a :live profile)
>    - Validation for your command line options and environment before
>    launching EMR
>    - Override configured options via command line
>    - Hooks for actions that should be triggered before or after job
>    launch.  For example:
>    - One hook in use at TCC does a diff on the results of a local run (i.e.
>       an integration test)
>       - Another hook posts a detailed message to IRC (hipchat) when a new
>       job is started
>     - Optionally wait for an EMR job to complete
>    - A dry-run feature, so you can check the final cluster configuration,
>    hooks that will be executed, hadoop job arguments, etc.
>    - All the details from dry-run (cluster/step config, etc) are persisted
>    with each job run (to STDOUT and saved to a YAML file alongside your output)
>    - All settings can be literal values, interpolated strings (e.g. set the
>    S3 bucket to "com.your-co.${env}.hadoop"), or *functions*
>    - Import ("inherit") common options, functionality and behavior to avoid
>    duplication
>    - Pass-through command-line options, allows you to specify extra args on
>    the command line that are meaningful to your hadoop main function, but are
>    unknown to lemur or your jobdef
>    - Most of TCC's actual hadoop jobs are written with Cascalog.  But Lemur
>    is agnostic to this detail.  They could be Cascading, Java, Hive, Pig,
>    Scalding, Streaming, etc.
>
> *blog entry:*http://entxtech.blogspot.com/2012/05/lemur-declarative-launching-of-h...
>
> *project:*https://github.com/TheClimateCorporation/lemur
>
> Marc Limotte

Mayank Agarwal

unread,
May 17, 2012, 5:45:47 PM5/17/12
to cascal...@googlegroups.com
I have been using lemur for a while now and its really great! 

thanks marc!

mayank

Marc Limotte

unread,
May 18, 2012, 7:17:01 AM5/18/12
to cascal...@googlegroups.com
Hey Andrew.

Thanks for that.  If you don't mind lemur blocking while it waits for the job to finish...

Look for "wait-on-step" in examples/sample-jobdef.clj.  It will return a map like this:
{:success false
 :step-status-detail "FAILED"
 :job-status-detail "FAILED"}

After that, you can then get the jobflow-id like this:
(context-get :jobflow-id)

To do this, lemur polls EMR for your job status every 60 seconds.  It's not super-efficient.  For long running jobs, or if you find the blocking undesirable, you might consider using a shutdown-hook (which is a standard EMR feature).  The difference, of course, is that the shutdown hook will execute on the cluster.. so it can post process or send a notification somewhere, but wouldn't actually download data back to the client that launched the job.

Let me know if you try these out and if you need some help.

Marc 

Mike Stanley

unread,
May 18, 2012, 4:33:07 PM5/18/12
to cascal...@googlegroups.com
this looks pretty cool and handy.  i'll definitely check it out.  we currently use ruby/rake for a lot of the build/launch/automation stuff, but this definitely looks like a step in a better direction.
thanks!
... Mike

Andy Xue

unread,
May 22, 2012, 12:58:50 PM5/22/12
to cascalog-user
yea makes sense ... does each job firing action happen on its own
thread so that when its blocking other jobs can be managed in
parallel? if thats the case, blocking isn't really that bad

Marc Limotte

unread,
May 22, 2012, 1:29:16 PM5/22/12
to cascal...@googlegroups.com
Not threaded at the moment, although that would be an interesting direction.  The current model triggers jobs from the command line, so running multiple jobs would be multiple command line executions.

I can imagine more of an always-running, service model.  Where I plug in something like Quartz to trigger jobs and then it can manage more of the process (polling or listening for life-cycle events). 


Marc
Reply all
Reply to author
Forward
0 new messages