<ErrorResponse xmlns="http://elasticmapreduce.amazonaws.com/doc/2009-03-31">
<Error>
<Type>Sender</Type>
<Code>ValidationError</Code>
<Message>Size of an individual step exceeded the maximum allowed</Message>
</Error>
<RequestId></RequestId>
</ErrorResponse>-Dave
--
Yelp is looking to hire great engineers! See http://www.yelp.com/careers.
I will try the manifest. This will likely solve the problem well.
Shiv
I tried creating a file with s3 paths and it did not work. How exactly do I tell EMR to process the s3 paths in a given file?
Thanks, Shiv
On Jun 24, 2011, at 12:28 PM, Mat Kelcey wrote:
def manifest_mapper(self, _, url):
filestream = read_url(url)
for line in filestream:
yield line
def normal_mapper(self, _, line):
... do log processing ...
Jim
I do have questions about the performance of this approach. If one downloads a file from s3 and maps the line, we are not using emr api to process s3 file locations. Thus, we lose out on both storage and performance optimizations. Is there a solution that can continue to leverage EMR's efficient handling of s3 inputs?
Thanks, Shiv
None that I'm aware of, sorry.
Jim
I ran into a ugly code issue. From emr, you need to create an s3 connection from scratch. I had to pass the credentials via the wire. Reusing the existing make_s3_conn() from emr.py was not easy and led to some odd boto exceptions. I was able to make it work by opening a new boto connection from EMR. Would be great if emr.py can be changed to allow for easier s3 connections from EMR.
Am happy to share this patch if others are interested. Would like to clean up the s3 code re-use and agree on a clean way to send AWS credentials to EMR so that s3 files can be downloaded and processed properly.
Shiv
The master bootstrap script in the current version of mrjob reads
these credentials so that it can download bootstrap files. See:
https://github.com/Yelp/mrjob/blob/master/mrjob/emr.py#L1490
(This is code that prints out code; hopefully it's not too confusing.
Just ignore the calls to writeln() and the quotes.)
Would something similar work for your task?
-Dave
--
I am using an older version of mrjob but I think the lines you point out will solve the credentials issue, looking forward to trying that.
Thanks, Shiv
I am still troubled by the fact that Amazon is likely throttling my S3 downloads and the job seems to be taking much longer than usual. Wondering if there is a way around this.
Thanks, Shiv
I created a new input format which extends TextInputFormat and overrode listStatus. I pass in the manifest file containing say one s3 path per line. I then get the filesystem of each s3 path and return the FileStatus objects back with the listStatus method.
The nice thing about this solution is that one does not need an extra step in the MR job and it auto scales to the number of desired mappers. In addition, the performance of this solution is quite close to passing in all the input paths directly. The downside is that you need to use a modified streaming jar with a new input format and send it across when you run a job. However, the new code for the input format is just about ten lines.
I also tested the solution indicated in this thread, unfortunately, the first step was too slow in that solution.
Shiv
On Aug 8, 2011, at 12:48 PM, Jim Blomo wrote:
-Dave
--
Shiv
JIm