Hi
I've a couple of questions about the different uses of the :step-jar which is step specific, and :run-time-jar and :jar-src-path which are cluster specific.
I have two uses cases which I'd like to manage through profiles which I think I know how to do but I am a little muddled.
First, the common bit. I define a cluster were I'd like to run several steps although at the moment several means two :) The first is a s3distcp to stage my data in HDFS and the second is my Cascalog job. The s3distcp step specifies a :step-jar of /home/hadoop/lib/emr-s3distcp-1.0.jar which is part of the EMR node definition.
Now the different modes of running. I'd like to have two variants the first is a development mode where my Cascalog step use the latest version of my jar which is uploaded from the local development version each time; and the second is the live mode where the step always uses the version of my jar hosted in S3 so no upload required.
I think for the cluster definition I need
(def cluster...
:dev {:jar-src-path "./my-latest-compile.jar"}
:live {:runtime-jar "s3://jar/path/my-versioned-compile.jar"}...)
and then specify :dev or :live at the command line e.g. lemur run :dev my-jobdef.clj or '(add-profiles [:dev])' in the jobdef itself with no command line argument.
Looking at the output from a dry-run if I don't specify a :step-jar for my Cascalog job then things will work as desired as it is taken from either :runtime-jar or the path the :jar-src-path is uploaded to for the :live and :dev profiles respectively. However, I'm seeing a warning in the :dev case:
"2014-05-02 15:09:43,529 WARN evaluating-map:? - No value for :step-jar. But this key does exist in step(s): ("log-aggregation"). Maybe you need to define the key in defcluster instead?"
and not the live case. Note "log-aggregation" is the s3distcp step I referred to earlier. I prefer to remove warnings and would like to be explicit about the behaviour I'm getting rather than relying on the default. You might change it one day to something better ;)
How do I specify my defstep such that the :step-jar is specified correctly in my :dev and :live cases, and there no warnings?
From experiments in the live case I can specify ':step-jar "${runtime-jar}" and that looks like it will work. In that case I'd prefer not to specify the :runtime-jar as it's step specific not cluster specific however lemur doesn't like that. I'm not sure what to specify for :step-jar in the :dev profile as it's a path constructed by lemur. I haven't tried specifying an upload in the :dev case but I assume lemur will get upset if I leave off the :jar-src-path and :runtime-jar from the cluster so it would need a dummy value for :runtime-jar as lemur doesn't do anything with that.
This brings me to my final question, how would I specify another jar to use in a third step when I'm using another Cascalog job I've written. I've checked and I can't use a list for :jar-src-path.
Sorry that's a long one! I think I can do what I want but would like to know what the best practice is.
Thanks
Gareth