Possible segfault: difficulty retrieving coredumps

77 views
Skip to first unread message

Kai Ju Liu

unread,
Sep 21, 2011, 2:00:04 PM9/21/11
to mr...@googlegroups.com
Hi. I've been having the following issue recently with one of my MRJob jobs. This job interacts with a MySQL database in the reducer steps via gevent and a trusted MySQL client library. Reducers will fail with the following error message, and given enough failures, the job will of course fail.

2011-09-21 07:14:21,750 WARN org.apache.hadoop.mapred.TaskTracker (main): Error running child
java.io.IOException: subprocess exited with error code 139
R/W/S=7468/0/0 in:414=7468/18 [rec/s] out:0=0/18 [rec/s]
minRecWrittenToEnableSkip_=9223372036854775807 LOGNAME=null
HOST=null
USER=hadoop
HADOOP_USER=null
last Hadoop input: |null|
last tool output: |null|
Date: Wed Sep 21 07:14:21 UTC 2011
Broken pipe
    at org.apache.hadoop.streaming.PipeReducer.reduce(PipeReducer.java:131)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:467)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:415)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)
2011-09-21 07:14:21,753 INFO org.apache.hadoop.mapred.TaskRunner (main): Runnning cleanup for the task

Does this error message indicate an actual segfault in the reducer tasks?

I've been working under the assumption that the reducer tasks are indeed segfaulting, but I also haven't been able to retrieve any coredumps. I've set up the proper core limit on all nodes as well as a fixed coredump location so coredumps won't be automatically cleaned up by the tasktracker. I've also run simple streaming jobs with Python sleeps and manually issued the "kill -11" signal to the Python processes. In all tests, I've seen the same error message as above and coredumps in the expected locations.

Is there anything else that must be configured in order to retrieve coredumps from streaming jobs?

Thanks!
Kai Ju

Steve Johnson

unread,
Sep 21, 2011, 2:24:58 PM9/21/11
to mr...@googlegroups.com
This may be a silly question, but you are logging into the machines over SSH to look for core dumps, right? mrjob doesn't know how to fetch them. How are you looking for error messages?

I'm guessing your problems are quite a bit beyond that, though, so I don't think I'll be of much help.

Kai Ju Liu

unread,
Sep 21, 2011, 2:44:26 PM9/21/11
to mr...@googlegroups.com
Hi Steve. I've found that there are no silly questions when it comes to Hadoop debugging, heh.

The chain of debugging is to look at the job details in the web UI, drill down to failed reduce tasks, check the error column, and then drill down to failed attempts and their corresponding hosts. I then check these hosts for coredumps.

Let me know if you think of anything else. Thanks!

Kai Ju

Shivkumar Shivaji

unread,
Sep 21, 2011, 2:45:42 PM9/21/11
to mr...@googlegroups.com
There is probably a good way to debug through the whole process and others probably can comment better on debugging coredumps.

I have perhaps a simpler/high level question:

I normally have a reducer write to say s3 and later load in the data using say mysql in a separate non Map reduce process. If you suspect the MySQL client library is at fault, you can probably test it independently outside map reduce. It also might be worth running the MySQL database interaction outside map reduce. The reason is not only the complication in debugging but also the slow nature of typical sql queries. Having each component interact independently makes testing easier as well.

When I saw error code 139 before, it typically was a memory issue, you can find out looking at stderr/stdout in s3 for the exact python exception.

Shiv
Reply all
Reply to author
Forward
0 new messages