[Errno 104] Connection reset by peer

Ножкин Андрей

unread,

Dec 16, 2013, 10:31:53 AM12/16/13

to mr...@googlegroups.com

Oh guys all these map-reduce tasks makes me sad... It's not hard to code the algorithm - it's hard to set up a tool to execute it! Really sad =(

In short: "socket.error: [Errno 104] Connection reset by peer" exception. The script actually has access to S3 because it does create buckets and uploads some small files (I've checked manually via AWS console). But the largest file - INPUT - is not uploaded. Hey, it's just 7GB of test data!

Have tried 4 times, always got errors.

mrjob==0.4.2

CONFIG

# cat /etc/mrjob.conf

runners:

inline:

base_tmp_dir: /home/tmp

emr:

base_tmp_dir: /home/tmp

aws_access_key_id: [VALID KEY HERE]

aws_secret_access_key: [VALID SECRET HERE]

aws_region: us-east-1

ec2_instance_type: m1.medium

num_ec2_instances: 7

# python /home/bigdata/mr_job_1.py -r emr /home/filesystem/INPUT > /home/filesystem/OUTPUT

using configs in /etc/mrjob.conf

creating new scratch bucket mrjob-f02b7cd37b2bfffd

using s3://mrjob-f02b7cd37b2bfffd/tmp/ as our scratch dir on S3

creating tmp directory /home/tmp/mr_job_1.root.20131216.152251.298419

writing master bootstrap script to /home/tmp/mr_job_1.root.20131216.152251.298419/b.py

creating S3 bucket 'mrjob-f02b7cd37b2bfffd' to use as scratch space

Copying non-input files into s3://mrjob-f02b7cd37b2bfffd/tmp/mr_job_1.root.20131216.152251.298419/files/

Traceback (most recent call last):

File "/home/bigdata/workers/process_data/swap_keys_and_sites.py", line 178, in <module>

MRSwapData().run()

File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 494, in run

mr_job.execute()

File "/usr/local/lib/python2.7/dist-packages/mrjob/job.py", line 512, in execute

super(MRJob, self).execute()

File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 147, in execute

self.run_job()

File "/usr/local/lib/python2.7/dist-packages/mrjob/launch.py", line 208, in run_job

runner.run()

File "/usr/local/lib/python2.7/dist-packages/mrjob/runner.py", line 458, in run

self._run()

File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 806, in _run

self._prepare_for_launch()

File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 817, in _prepare_for_launch

self._upload_local_files_to_s3()

File "/usr/local/lib/python2.7/dist-packages/mrjob/emr.py", line 905, in _upload_local_files_to_s3

s3_key.set_contents_from_filename(path)

File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1290, in set_contents_from_filename

encrypt_key=encrypt_key)

File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1221, in set_contents_from_file

chunked_transfer=chunked_transfer, size=size)

File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 713, in send_file

chunked_transfer=chunked_transfer, size=size)

File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 889, in _send_file_internal

query_args=query_args

File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 547, in make_request

retry_handler=retry_handler

File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 947, in make_request

retry_handler=retry_handler)

File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 908, in _mexe

raise e

socket.error: [Errno 104] Connection reset by peer

Steve Johnson

unread,

Dec 16, 2013, 10:56:50 AM12/16/13

to mr...@googlegroups.com

You should upload the file to S3 before running mrjob on it. You don't want to have to wait for 7 GB of data to upload every time you run your script anyway.

--

You received this message because you are subscribed to the Google Groups "mrjob" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mrjob+un...@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

Ножкин Андрей

unread,

Dec 17, 2013, 10:18:32 AM12/17/13

to mr...@googlegroups.com

Hey Steve - thanks for the reply. It helped a little bit. I mean - it helped to run the job, but now I've got another issue, probably more complicated than the previous one.

I'll try to explain in short, in simplified code.

def mapper(line):

j = json.loads(line)

j['somekey'] = 'somevalue'

yield j['anotherkey'], j

def combiner(key, value):

for v in value:

somevar = v['somekey']

...

so, mapper method updates some dict and yields that dict. Combiner method contains the code that reads the key that was supposed to be added in the mapper method. Everything works well on the test data, if running in local mode. But on Amazon EMR, the 'v' variable in the combiner method doesn't have that key! It has all the others, but not the key that I set in the code. The whole data was re-checked and re-uploaded to the S3 bucket. So everything looks really strange. I also tried to run the mapreduce task on a smaller subset, around 300MB, and it worked as it supposed to be.

Thanks guys for your help.

Ножкин Андрей

unread,

Dec 17, 2013, 10:20:13 AM12/17/13

to mr...@googlegroups.com

Forgot to mention - I use PickleProtocol as the internal protocol, for some reasons. Maybe it helps.

Jimmy Retzlaff

unread,

Dec 17, 2013, 12:22:35 PM12/17/13

to mr...@googlegroups.com

Keep in mind that Hadoop may call your combiner 0, 1, or more times on a given stream of data. This means your combiner output data must be compatible with its input data (and your reducer must be OK with the combiner never being called). Does your combiner yield (key, value) pairs where the values always contain 'somekey'? Typically you only want to add a combiner as an optimization (early versions of mrjob didn't even support combiners), so one approach is to get everything working without the combiner first.

Jimmy

--

Ножкин Андрей

unread,

Dec 17, 2013, 10:31:26 PM12/17/13

to mr...@googlegroups.com

Didn't know that. Thank you. It seems everything is ok now.

Reply all

Reply to author

Forward