this is amazing! is there a hook for when the job finishes? (like,
detect failure and dl and print logs or detect completed and like
download and format results from s3)
On May 17, 9:39 am, Marc Limotte <
mslimo...@gmail.com> wrote:
> Hi Cascalog Users,
>
> If you're using Cascalog and running your jobs on Amazon Elastic-Mapreduce,
> you may find this project interesting:
> Lemur<
http://entxtech.blogspot.com/2012/05/lemur-declarative-launching-of-h...>.
> This is a project open-sourced by The Climate Corporation (TCC). Lemur
> processes a DSL which describes a cluster and job to be launched.
> Features
>
> - Launch EMR cluster and submit step(s); or run against local hadoop
> (usually hadoop standalone for dev and testing)
> - Basic configuration options include: Bootstrap actions, Hadoop
> config, Uploads (files to transfer to S3, or local), Cluster details (num
> instances, master instance type, etc), Output paths to use for data, logs,
> main jar, etc., Support for spot market instances
> - Profile support provides packages of options and functionality that
> can be enabled or disabled as one switch. (e.g. you can have a :test
> profile or a :live profile)
> - Validation for your command line options and environment before
> launching EMR
> - Override configured options via command line
> - Hooks for actions that should be triggered before or after job
> launch. For example:
> - One hook in use at TCC does a diff on the results of a local run (i.e.
> an integration test)
> - Another hook posts a detailed message to IRC (hipchat) when a new
> job is started
> - Optionally wait for an EMR job to complete
> - A dry-run feature, so you can check the final cluster configuration,
> hooks that will be executed, hadoop job arguments, etc.
> - All the details from dry-run (cluster/step config, etc) are persisted
> with each job run (to STDOUT and saved to a YAML file alongside your output)
> - All settings can be literal values, interpolated strings (e.g. set the
> S3 bucket to "com.your-co.${env}.hadoop"), or *functions*
> - Import ("inherit") common options, functionality and behavior to avoid
> duplication
> - Pass-through command-line options, allows you to specify extra args on
> the command line that are meaningful to your hadoop main function, but are
> unknown to lemur or your jobdef
> - Most of TCC's actual hadoop jobs are written with Cascalog. But Lemur
> is agnostic to this detail. They could be Cascading, Java, Hive, Pig,
> Scalding, Streaming, etc.
>
> *blog entry:*
http://entxtech.blogspot.com/2012/05/lemur-declarative-launching-of-h...
>
> *project:*
https://github.com/TheClimateCorporation/lemur
>
> Marc Limotte