Ganglia and other applications of EMR-4.x with MRJob 0.5.0

56 views
Skip to first unread message

sir Gollum

unread,
Mar 29, 2016, 3:12:11 PM3/29/16
to mrjob
First of all, congratulations with the 0.5.0 release! 
I am now trying to use the new release together with emr-4.3.0 or emr.4.4.0. They changed a lot in EMR 4.x series and now the things that could be previously achieved with bootstrap actions no longer work. In particular, right now I am trying to set up Ganglia, but I guess this is a broader question of specifying preinstalled applications and their configurations.

The only possible way to do it that I found is through RunJobFlow API call params, which can be specified via the emr_api_params config option (or --emr-api-params cli option), e.g. like this: 
runners:
    emr:
        emr_api_params:
            Applications.member.1: 'Hadoop'
            Applications.member.2: 'Ganglia'

So far I've tried every config format I could think of, and neither seems to work (I get "MalformedInput" error with different messages). 
From the code of mrjob 0.5.0 I can see that it calls run_jobflow with boto args representing complex objects e.g. boto.emr.bootstrap_action.BootstrapAction, and probably for this application params to work it should also create boto.emr.emrobject.Application objects somewhere in _cluster_args(), but I don't see that Application class is used anywhere in mrjob.

Am I missing something? Maybe there is another way of installing EMR 4.x apps with MRJob?

Thanks for the good work and congrats again with the release! 

David Marin

unread,
Mar 30, 2016, 12:16:23 PM3/30/16
to mr...@googlegroups.com
Yeah, looks like we're just going to have to add an "emr_applications" option. I'll try to get that into 0.5.1, along with "emr_configurations" (this is one reason I picked a 3.x AMI as the default, even though AWS encourages everyone to be on 4.4.0).

I'm sad that emr_api_params doesn't offer a useful workaround for list-like parameters. Maybe the API is expecting JSONs or something? in any case, being able to pass a list as one of the values to emr_api_params seems like a useful feature.

-Dave
> --
> You received this message because you are subscribed to the Google Groups "mrjob" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

signature.asc

sir Gollum

unread,
Mar 31, 2016, 9:28:41 AM3/31/16
to mrjob
Thanks for the answer, looking forward to using 0.5.1 with the latest EMR then :)

Regarding the API call, it seems like boto constructs XML for the request internally, and expects objects with startElement / endElement methods to do that, so probably the best way to make use of the full power of this API would be to allow dynamic imports, e.g. like it is done with logging.dictConfig in the standard library:
handlers:
  console:
    class : logging.StreamHandler
    formatter: brief
    level   : INFO
    filters: [allow_foo]
    stream  : ext://sys.stdout     
  file:
    class : logging.handlers.RotatingFileHandler
    formatter: precise
    filename: logconfig.log
    maxBytes: 1024
    backupCount: 3
but this would probably require implementing a DSL and I'm not sure if it's worth the effort.


среда, 30 марта 2016 г., 18:16:23 UTC+2 пользователь David Marin написал:

David Marin

unread,
Apr 23, 2016, 7:32:47 PM4/23/16
to mr...@googlegroups.com
Poking at this now. Looks like if you used:

runners:
emr:
emr_api_params:
Applications.member.1.Name: 'Hadoop'
Applications.member.2.Name: ‘Ganglia'

it’d work!

-Dave

> On Mar 29, 2016, at 12:12 PM, sir Gollum <sir.g...@gmail.com> wrote:
>
signature.asc

sir Gollum

unread,
Apr 26, 2016, 3:23:24 PM4/26/16
to mrjob
Thanks for the update,

I also found this way after I had upgraded my pipeline to mrjob 0.5.0/Hadoop 2, just forgot to post the solution here.
While this works, it would be good to have ability to specify these parameters in a more convenient way.
Also, in my understanding the question of EMR 4.x application configurations (http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html) is still open.

воскресенье, 24 апреля 2016 г., 1:32:47 UTC+2 пользователь David Marin написал:
Reply all
Reply to author
Forward
0 new messages