:step-jar, :run-time-jar, :jar-src-path and profiles

已查看 33 次
跳至第一个未读帖子

Gareth Rogers

未读,
2014年5月2日 12:18:502014/5/2
收件人 lemur...@googlegroups.com
Hi

I've a couple of questions about the different uses of the :step-jar which is step specific, and :run-time-jar and :jar-src-path which are cluster specific.

I have two uses cases which I'd like to manage through profiles which I think I know how to do but I am a little muddled.

First, the common bit. I define a cluster were I'd like to run several steps although at the moment several means two :) The first is a s3distcp to stage my data in HDFS and the second is my Cascalog job. The s3distcp step specifies a :step-jar of /home/hadoop/lib/emr-s3distcp-1.0.jar which is part of the EMR node definition.

Now the different modes of running. I'd like to have two variants the first is a development mode where my Cascalog step use the latest version of my jar which is uploaded from the local development version each time; and the second is the live mode where the step always uses the version of my jar hosted in S3 so no upload required.

I think for the cluster definition I need
(def cluster...
  :dev {:jar-src-path "./my-latest-compile.jar"}
  :live {:runtime-jar "s3://jar/path/my-versioned-compile.jar"}...)

and then specify :dev or :live at the command line e.g. lemur run :dev my-jobdef.clj or '(add-profiles [:dev])' in the jobdef itself with no command line argument.

Looking at the output from a dry-run if I don't specify a :step-jar for my Cascalog job then things will work as desired as it is taken from either :runtime-jar or the path the :jar-src-path is uploaded to for the :live and :dev profiles respectively. However, I'm seeing a warning in the :dev case:

"2014-05-02 15:09:43,529  WARN evaluating-map:? - No value for :step-jar. But this key does exist in step(s): ("log-aggregation"). Maybe you need to define the key in defcluster instead?"

and not the live case. Note "log-aggregation" is the s3distcp step I referred to earlier. I prefer to remove warnings and would like to be explicit about the behaviour I'm getting rather than relying on the default. You might change it one day to something better ;)

How do I specify my defstep such that the :step-jar is specified correctly in my :dev and :live cases, and there no warnings?

From experiments in the live case I can specify ':step-jar "${runtime-jar}" and that looks like it will work. In that case I'd prefer not to specify the :runtime-jar as it's step specific not cluster specific however lemur doesn't like that. I'm not sure what to specify for :step-jar in the :dev profile as it's a path constructed by lemur. I haven't tried specifying an upload in the :dev case but I assume lemur will get upset if I leave off the :jar-src-path and :runtime-jar from the cluster so it would need a dummy value for :runtime-jar as lemur doesn't do anything with that.

This brings me to my final question, how would I specify another jar to use in a third step when I'm using another Cascalog job I've written. I've checked and I can't use a list for :jar-src-path.

Sorry that's a long one! I think I can do what I want but would like to know what the best practice is.
Thanks
Gareth

Marc Limotte

未读,
2014年5月4日 09:27:432014/5/4
收件人 lemur...@googlegroups.com
Hi Gareth,,

It sounds like you've pretty much got it figured out.  To recap:

:jar-src-path will be uploaded and used as your run jar
:runtime-jar will be used directly as your run jar (without being uploaded/copied)

Either one of these is required and is the default, but specific steps can override with:

:step-jar

The step-jar feature was added after the fact for a specific use-case, so it's not all completely consistent.  The WARNing can be ignored, I probably need some special case logic to avoid printing the warning in this scenario.

Regarding s3distcp.  Sounds like your source data is in S3 and you want to copy it to HDFS on the EMR cluster.  Do you have a special need that requires that you do this explicitly?  Otherwise, this is basically the default behavior of EMR when you give it an S3 path.  It will copy the data to HDFS before it starts the job.  And just to be clear, this is amazon EMR behavior, nothing to do with lemur.

marc



--
You received this message because you are subscribed to the Google Groups "Lemur User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to lemur-user+...@googlegroups.com.
To post to this group, send email to lemur...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/lemur-user/17aa596f-b5a2-44bb-ad5e-52c48a88336b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gareth Rogers

未读,
2014年5月5日 08:26:342014/5/5
收件人 lemur...@googlegroups.com
Hi Marc

If I wanted to add further steps to the cluster and have the same behaviour of being able to switch between running from an uploaded local copy and an S3 hosted copy what would be the recommended way of doing this in Lemur?

As we're running over really small files we're using s3distcp to aggregate and recompress them into more Hadoop friendly blocks. It made a significant difference to the run time :)

Thanks
Gareth

Marc Limotte

未读,
2014年5月5日 09:05:012014/5/5
收件人 lemur...@googlegroups.com
I haven't tried this myself, but I think you would want to do it like this:

You can specify a :step-jar in each step, and that will take precedence over jar-src-path/runtime-jar.  Only jar-src-path will be automatically uploaded, so the :step-jar settings should be s3:// paths for the :live profile,  and local paths for the :dev profile.  If you want the step-jar to be uploaded to S3, you will need to explicitly add an entry in uploads (probably as part of your :live profile).

marc



已删除帖子

Gareth Rogers

未读,
2014年5月5日 17:25:562014/5/5
收件人 lemur...@googlegroups.com
Thanks, I'll give it a try.

I did just delete a post as I realised I was just asking a question to which 1) probably obvious, and 2) easy enough to test myself tomorrow ;) However I then realised you probably got it in an email! Anyway, I'm going to experiment with the ${data-uri} variable and the :step-jar to see if I can suppress the warning.

Thanks, this has cleared things up.

Marc Limotte

未读,
2014年5月5日 18:15:462014/5/5
收件人 lemur...@googlegroups.com
:jar-uri might also be relevant for you


Gareth Rogers

未读,
2014年5月8日 12:59:532014/5/8
收件人 lemur...@googlegroups.com
I now have a cluster definition of the form (where I'm using ... to represent the rest of the code):

(def cluster ...
  :dev {:jar-name "my-latest-compile.jar"
        :jar-src-path "./target/${jar-name}"}
  :live {:runtime-jar "${base-uri}/jar/path/my-versioned-compile.jar"}...)

and step:

(def step ...
  :dev {:step-jar "${jar-uri}/${jar-name}"}
  :live {:step-jar "${runtime-jar}"} ...)

which works as I'd like :)

However I'm still seeing the warning:
2014-05-08 16:52:24,484  WARN evaluating-map:? - No value for :step-jar. But this key does exist in step(s): ("og-aggregation"). Maybe you need to define the key in defcluster instead?

However I'm still seeing the warning if I specify :jar-src-path but not if I specify :runtime-jar with or without the :dev profiles i.e. the above code generates the warning when using the :dev profile as does the below code:

(def cluster ...
  :jar-name "my-latest-compile.jar"
  :jar-src-path "./target/${jar-name}" ...)

(def step ...
  :step-jar "${jar-uri}/${jar-name}" ...)

but
(def cluster ...
  :runtime-jar "${base-uri}/jar/path/my-versioned-compile.jar"...)

(def step ...
  :step-jar "${runtime-jar}" ...)

does not.

Does this sound like a bug or have I missed something?

Thanks
Gareth

Marc Limotte

未读,
2014年5月8日 17:38:252014/5/8
收件人 lemur...@googlegroups.com
It looks like a non-fatal bug in Lemur.  You can ignore the warning, although, ideally Lemur would catch this case and not print the warning.  Feel free to create an issue in github for this (... and submit a fix pull-request if you're in the mood).

Marc



Gareth Rogers

未读,
2014年5月12日 04:45:472014/5/12
收件人 lemur...@googlegroups.com
Just for reference I've created a new issue: https://github.com/TheClimateCorporation/lemur/issues/26.

I've forked the code at least, we'll see if I can get the rest of the way to a fix...
回复全部
回复作者
转发
0 个新帖子