The driving force behind these features was actually Jim Blomo,
another Yelp developer I work with. He's going to send out a more
detailed email about these features in a moment. :)
-Dave
P.S. I'm still plugging away at v0.3.0, which will include combiners,
among other things. If you're interested, feel free to check out the
development branch and/or the v0.3.0 Issues list on github
(https://github.com/Yelp/mrjob/issues?milestone=1&state=open).
--
Yelp is looking to hire great engineers! See http://www.yelp.com/careers.
Hadoop streaming has options for specifying plugins for jobs. We've
integrated two of these options in the latest release:
hadoop_input_format and hadoop_output_format (which correlate to the
inputformat and outputformat options to hadoop). By default, hadoop
streaming will output a text file, with lines formatted by the
protocol you specify with the various mrjob --protocol arguments. But
what happens when you want to use advanced hadoop file formats like
Sequence files or demultiplex your output into different
subdirectories? That's where input and output formats come in.
Hadoop Sequence files encode key/values in a binary format, optionally
compressed in a way that still allows them to be split between jobs.
Because it is a binary format, not delimited by newlines, we can't
directly read or write these files with a streaming job. But by
specifying
`--hadoop_output_format org.apache.hadoop.mapred.SequenceFileOutputFormat`
to your MRJob we can convert the python output into the Seuqence file
format. You can use this if you have other jobs that expect this
input, or if you find that TextOutputFormat is expensive to parse.
At Yelp, we have Hadoop jobs that extract data from log files.
Ideally, we'd only need to read through the logs once, and save the
extracted data to different files. For example, save data about
search to one directory, and data about business pages to another.
With streaming alone this is not possible: your output directory
contains all the data produced by your job. But with a custom
--hadoop_output_format implementing MultipleOutputFormat, your data
can be split into subdirectories depending on the key or value. Yelp
is testing this feature with 'oddjob,' an unsupported library which
can use a JSON key to determine which subdirectory to place your row
into.
Since the input and output formats require access to the JVM classes,
for custom formatters you must wrap up your custom classes with the
hadoop streaming implementation into one jar (at least for 0.18.3, new
features have not been thoroughly tested on newer versions). You can
specify the location of this jar with --hadop_streaming_jar for local
or S3 jars. --hadoop-streaming-jar-on-emr can be used if your jar is
already on the EMR instance, for example if you deploy it during
bootstraping.
More information:
http://hadoop.apache.org/common/docs/r0.18.3/streaming.html#Specifying+Other+Plugins+for+Jobs
http://wiki.apache.org/hadoop/SequenceFile
http://hadoop.apache.org/common/docs/r0.18.3/api/org/apache/hadoop/mapred/lib/MultipleTextOutputFormat.html
https://github.com/jblomo/oddjob
We hope these new options are useful, please send us any questions or comments.
Jim