scoobi's mapper completing to 100% and then waiting there until killed

90 views
Skip to first unread message

Alex Cozzi

unread,
Jul 16, 2012, 1:46:07 PM7/16/12
to scoobi...@googlegroups.com
I observe a strange pattern in my scoobi program: the mappers all run fairly quickly, reaching 100% but never completing and just sitting there. I see counters getting incremented but only few map jobs eventually complete, while most of them are killed by hadoop.

I observe this behavior with 0.4.0 and 0.5.0-SNAPSHOT

The errors reported in the JobTracker look like:

Task attempt_201207041245_62068_m_000000_0 failed to report status for 600 seconds. Killing!


My program starts with this output:

12/07/16 10:26:40 INFO scoobi.Job: Running job: scoobi-20120716-102640-e4766d46-7238-41ce-ba3c-5c2ef89f5202
12/07/16 10:26:40 INFO scoobi.Job: Number of steps: 4
12/07/16 10:26:40 INFO scoobi.Job: Running step: 1 of 4
12/07/16 10:26:40 INFO scoobi.Job: Number of input channels: 1
12/07/16 10:26:40 INFO scoobi.Job: Number of output channels: 1
12/07/16 10:26:40 INFO scoobi.Job: 0: Combiner10(GroupByKey9(GbkMapper8(Op7[Op6[Return1(()),Return1(())],Return1(())],Load5)))


this is what looks like in the job tracker:

Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempts
map55.61%
27003250319401232 / 0
reduce0.00%
127912790000 / 0


CounterMapReduceTotal
Job CountersSLOTS_MILLIS_MAPS0091,642,334
Rack-local map tasks001,029
Launched map tasks003,932
Data-local map tasks001,792
FileSystemCountersHDFS_BYTES_READ651,170,884,6210651,170,884,621
FILE_BYTES_WRITTEN1,649,776,62901,649,776,629
Map-Reduce FrameworkCombine output records196,669,3980196,669,398
Map input records338,383,6450338,383,645
Spilled Records196,659,2530196,659,253
Map output bytes23,752,869,776023,752,869,776
SPLIT_RAW_BYTES480,3120480,312
Map output records1,484,554,48701,484,554,487
Combine input records205,027,4270205,027,427





Anybody has some idea?
Alex

Alex Cozzi

unread,
Jul 16, 2012, 1:48:20 PM7/16/12
to scoobi...@googlegroups.com
Also I noticed that step 2 fails to detect the failure of the fist step and starts without reporting error but the end results is an empty file, since it was supposed to join the result of step1

Alex Cozzi

unread,
Jul 16, 2012, 7:11:58 PM7/16/12
to scoobi...@googlegroups.com
Some more information: I tried the SAME job on our other cluster which runs hadoop 22 using Ben's CDH4 branch (thanks Ben!) and it works without problem, while on our main cluster (which runs hadoop 20) it hangs.
 So I think this is either an issue with hadoop-20 or on the way scoobi reports job completion.

Eric Springer

unread,
Jul 16, 2012, 7:17:29 PM7/16/12
to scoobi...@googlegroups.com


On Tue, Jul 17, 2012 at 3:46 AM, Alex Cozzi <alex...@gmail.com> wrote:
Anybody has some idea?

I guess it would help to know a bit more about what you're doing, but I have seen this before. The case I've seen it, is when using a combine operation in scoobi. This translates to a combiner in hadoop, which has a bug that [no matter what you do] if you don't complete within $TIMEOUT_TIME it'll timeout.

Looking at that summary it might be the same problem. An easy hack is to just remove the combine and replace it a logically equiv. (but less efficient map).  Or if you're feeling particularly masochistic, you can rearrange the code to rely less on combiners (ala like this:  https://github.com/NICTA/scoobi/blob/master/src/main/scala/com/nicta/scoobi/lib/Matrix.scala#L269 )

But maybe these are different issues, I'm not sure :D

Eric Springer

unread,
Jul 16, 2012, 7:20:53 PM7/16/12
to scoobi...@googlegroups.com
It's probably overkill, but since you're maintaining your own fork you
could try change this line:

https://github.com/NICTA/scoobi/blob/master/src/main/scala/com/nicta/scoobi/impl/exec/MapReduceJob.scala#L154

to:

"if (false)"

So it doesn't use a hadoop combiner at all

Alex Cozzi

unread,
Jul 16, 2012, 7:34:48 PM7/16/12
to scoobi...@googlegroups.com
Thanks Eric!
I will try it tomorrow and report back.
Alex

Alex Cozzi

unread,
Jul 17, 2012, 1:28:44 AM7/17/12
to scoobi...@googlegroups.com
I did the experiment and with your suggested change the job actually terminates correctly, so I think that your hypothesis of a combiner bug seems confirmed.
Do you have a bug # for this problem in hadoop that I can pass over to our hadoop team?
Thank you so much!

Eric Springer

unread,
Jul 17, 2012, 1:51:21 AM7/17/12
to scoobi...@googlegroups.com
Not a problem.

This is the best reference I can give you is:
http://www.mail-archive.com/commo...@hadoop.apache.org/msg15941.html

There's a couple of bug# linked from that, but I'm not surely if they
match exactly. What I did observe from my earlier experiments was that
there was absolutely nothing I could do to stop a combiner time out,
if it couldn't finish within its $ALLOWED_TIMEOUT window
Reply all
Reply to author
Forward
0 new messages