Individual Step Size Limit

Shivkumar Shivaji

unread,

Jun 23, 2011, 10:47:12 PM6/23/11

to mr...@googlegroups.com

I ran a job with a huge number of GLOBs e.g. regexes like:

s3://machine_name/log_type/filename.2011.06.01.*

Does anyone know the limit allowed for a individual step within EMR? Could not find anything in the EMR doc.. I hope there is a way to raise this.

Thanks, Shiv

<ErrorResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009-03-31">
  <Error>
    <Type>Sender</Type>
    <Code>ValidationError</Code>
    <Message>Size of an individual step exceeded the maximum allowed</Message>
  </Error>
  <RequestId></RequestId>
</ErrorResponse>

Dave Marin

unread,

Jun 24, 2011, 2:24:50 PM6/24/11

to mr...@googlegroups.com

I can't find anything either, but if I had to guess based on
experience, I'd say the limit is probably 10000 characters.

-Dave

--

Yelp is looking to hire great engineers! See http://www.yelp.com/careers.

Steve Johnson

unread,

Jun 24, 2011, 2:55:13 PM6/24/11

to mr...@googlegroups.com

If your log files are named by date and you're trying to process logs from consecutive days, you might find this useful for cutting down the number of characters it takes to express a range of log files:

https://github.com/yelp/dateglob

-Steve Johnson

Shivkumar Shivaji

unread,

Jun 24, 2011, 3:04:42 PM6/24/11

to mr...@googlegroups.com

The dateglob was something that transpired from a patch I sent to Dave Marin for doing date patterns.

I issue I have is using my original patch code, someone at my workplace is getting the exception after trying it for 18 days. However, our s3 input sources are disorganized in that we have 8 machines where we get the data from:

s3://machine1/traffic-log/2011-06/2011-06-01.01.txt

s3://machine2/traffic-log/2011-06/2011-06-01.01.txt

s3://machine3/traffic-log/2011-06/2011-06-01.01.txt

..

One option is to move all files to one place on s3 for the log type e.g.

s3://traffic-log/2011-06/m1-2011-06-01.01.txt

s3://traffic-log/2011-06/m2-2011-06-01.01.txt

However, I will also try to debug the character limits to see if they can be raised, I guess not. There might be an alternative too.

We have data split into hours, days, month (by folder), and machines (by folder). There might be a glob trick to reduce this length.

Shiv

Mat Kelcey

unread,

Jun 24, 2011, 3:28:10 PM6/24/11

to mrjob

The parameters limit is 10240 characters; I'll make sure it's added to
the docs.
Another approach to globbing is to use a manifest; ie pass one (or
more) files that contain the paths inside them (1 path per line)
This approach is common and though it requires an extra step of
processing it will scale to as many paths as you need.
Mat (SDE on EMR)

Shivkumar Shivaji

unread,

Jun 24, 2011, 4:45:09 PM6/24/11

to mr...@googlegroups.com

Thanks Mat.

I will try the manifest. This will likely solve the problem well.

Shiv

Shivkumar Shivaji

unread,

Jul 28, 2011, 5:44:42 AM7/28/11

to mr...@googlegroups.com

Got back to trying this out. Is there any documentation for the manifest?

I tried creating a file with s3 paths and it did not work. How exactly do I tell EMR to process the s3 paths in a given file?

Thanks, Shiv

On Jun 24, 2011, at 12:28 PM, Mat Kelcey wrote:

Jim Blomo

unread,

Aug 3, 2011, 5:43:36 PM8/3/11

to mr...@googlegroups.com

Matt may correct me if I'm wrong, but what I think he's suggesting is
a mapper that takes a filename, downloads the file, and yields lines
from that file. eg:

def manifest_mapper(self, _, url):
filestream = read_url(url)
for line in filestream:
yield line

def normal_mapper(self, _, line):
... do log processing ...

Jim

Shivkumar Shivaji

unread,

Aug 3, 2011, 9:01:18 PM8/3/11

to mr...@googlegroups.com

I entirely missed this possibility as I was thinking manifest is a manifest.mf file.

I do have questions about the performance of this approach. If one downloads a file from s3 and maps the line, we are not using emr api to process s3 file locations. Thus, we lose out on both storage and performance optimizations. Is there a solution that can continue to leverage EMR's efficient handling of s3 inputs?

Thanks, Shiv

Jim Blomo

unread,

Aug 8, 2011, 3:48:51 PM8/8/11

to mr...@googlegroups.com

On Wed, Aug 3, 2011 at 6:01 PM, Shivkumar Shivaji <sshi...@gmail.com> wrote:
> I do have questions about the performance of this approach. If one downloads a file from s3 and maps the line, we are not using emr api to process s3 file locations. Thus, we lose out on both storage and performance optimizations. Is there a solution that can continue to leverage EMR's efficient handling of s3 inputs?
>

None that I'm aware of, sorry.

Jim

Shivkumar Shivaji

unread,

Aug 19, 2011, 2:59:49 AM8/19/11

to mr...@googlegroups.com

I have a generic version that works with the manifest mapper. Ie, I have extended mrjob to check if the total input paths are greater than about 10K, if so it will write out a manifest file and extend steps to do a manifest mapper. I changed the protocols for the manifest step via pick_protocols. I have also written code to decompress gz files on the fly thus saving disk space.

I ran into a ugly code issue. From emr, you need to create an s3 connection from scratch. I had to pass the credentials via the wire. Reusing the existing make_s3_conn() from emr.py was not easy and led to some odd boto exceptions. I was able to make it work by opening a new boto connection from EMR. Would be great if emr.py can be changed to allow for easier s3 connections from EMR.

Am happy to share this patch if others are interested. Would like to clean up the s3 code re-use and agree on a clean way to send AWS credentials to EMR so that s3 files can be downloaded and processed properly.

Shiv

Dave Marin

unread,

Aug 19, 2011, 12:02:13 PM8/19/11

to mr...@googlegroups.com

The S3 credentials are actually already stored on EMR already because
Hadoop needs them.

The master bootstrap script in the current version of mrjob reads
these credentials so that it can download bootstrap files. See:

https://github.com/Yelp/mrjob/blob/master/mrjob/emr.py#L1490

(This is code that prints out code; hopefully it's not too confusing.
Just ignore the calls to writeln() and the quotes.)

Would something similar work for your task?

-Dave

--

Shivkumar Shivaji

unread,

Aug 19, 2011, 1:54:10 PM8/19/11

to mr...@googlegroups.com

I solved the emr.py s3 reuse to some extent. It still creates one more s3 connection than necessary but it's easy to fix if emr.py is patched.

I am using an older version of mrjob but I think the lines you point out will solve the credentials issue, looking forward to trying that.

Thanks, Shiv

Shivkumar Shivaji

unread,

Aug 31, 2011, 8:41:34 PM8/31/11

to mr...@googlegroups.com

Got the engineering side of it to work. There were a few more twists:
1. Needed to increase the buffer size for s3 streaming transfers to optimize for performance.
2. I also need to assign a lot of mappers to the manifest mapper as EMR wrongly guessed that one/two mappers would suffice given that input is just a few K large. The input has one line per s3 file to download.

I am still troubled by the fact that Amazon is likely throttling my S3 downloads and the job seems to be taking much longer than usual. Wondering if there is a way around this.

Thanks, Shiv

Shivkumar Shivaji

unread,

Sep 7, 2011, 6:47:01 PM9/7/11

to mr...@googlegroups.com

Turns out there is a simpler solution if one is interested in writing say 10 lines of java code.

I created a new input format which extends TextInputFormat and overrode listStatus. I pass in the manifest file containing say one s3 path per line. I then get the filesystem of each s3 path and return the FileStatus objects back with the listStatus method.

The nice thing about this solution is that one does not need an extra step in the MR job and it auto scales to the number of desired mappers. In addition, the performance of this solution is quite close to passing in all the input paths directly. The downside is that you need to use a modified streaming jar with a new input format and send it across when you run a job. However, the new code for the input format is just about ten lines.

I also tested the solution indicated in this thread, unfortunately, the first step was too slow in that solution.

Shiv

On Aug 8, 2011, at 12:48 PM, Jim Blomo wrote:

Dave Marin

unread,

Sep 7, 2011, 7:13:49 PM9/7/11

to mr...@googlegroups.com

Neat, glad you found such a simple solution. :)

-Dave

--

Shivkumar Shivaji

unread,

Sep 7, 2011, 7:52:34 PM9/7/11

to mr...@googlegroups.com

Thanks a lot for exposing the input format option. Without that this solution would not be possible. In fact, I was using a version of mrjob prior to that exposure. Exposing the input and output formats allows one to solve non standard problems using hadoop's native language, java!

Shiv

Jim Blomo

unread,

Nov 19, 2011, 3:45:02 PM11/19/11

to mr...@googlegroups.com

Hi Shiv, we're now running into the same problem you were :) Are you
able to opensource your InputFormat solution? With your permission, I
may include it in oddjob, our JVM interop library.

JIm

Shivkumar Shivaji

unread,

Nov 19, 2011, 4:17:05 PM11/19/11

to mr...@googlegroups.com

Should not be a problem. In fact, what I worry about most is that my solution could be more aesthetically written with more test cases. However, it does work.

I will email it later today and gather your feedback.

Shiv

yi...@adience.com

unread,

Aug 25, 2015, 11:57:47 AM8/25/15

to mrjob

Hi,

I've encountered the same need and so far couldn't find a better way than extending TextInputFormat. Could you by any chance publish your solution?

Thanks,
Yigal

Reply all

Reply to author

Forward