scoobi's mapper completing to 100% and then waiting there until killed

Alex Cozzi

unread,

Jul 16, 2012, 1:46:07 PM7/16/12

to scoobi...@googlegroups.com

I observe a strange pattern in my scoobi program: the mappers all run fairly quickly, reaching 100% but never completing and just sitting there. I see counters getting incremented but only few map jobs eventually complete, while most of them are killed by hadoop.

I observe this behavior with 0.4.0 and 0.5.0-SNAPSHOT

The errors reported in the JobTracker look like:

Task attempt_201207041245_62068_m_000000_0 failed to report status for 600 seconds. Killing!

My program starts with this output:

12/07/16 10:26:40 INFO scoobi.Job: Running job: scoobi-20120716-102640-e4766d46-7238-41ce-ba3c-5c2ef89f5202
12/07/16 10:26:40 INFO scoobi.Job: Number of steps: 4
12/07/16 10:26:40 INFO scoobi.Job: Running step: 1 of 4
12/07/16 10:26:40 INFO scoobi.Job: Number of input channels: 1
12/07/16 10:26:40 INFO scoobi.Job: Number of output channels: 1
12/07/16 10:26:40 INFO scoobi.Job: 0: Combiner10(GroupByKey9(GbkMapper8(Op7[Op6[Return1(()),Return1(())],Return1(())],Load5)))

this is what looks like in the job tracker:

Kind

% Complete

Num Tasks

Pending

Running

Complete

Killed

Failed/Killed
Task Attempts

map

55.61%

2700

3

2503

194

0

1232 / 0

reduce

0.00%

1279

0

0 / 0

	Counter	Map	Total
Job Counters	SLOTS_MILLIS_MAPS	0	91,642,334
	Rack-local map tasks	0	1,029
	Launched map tasks	0	3,932
	Data-local map tasks	0	1,792
FileSystemCounters	HDFS_BYTES_READ	651,170,884,621	651,170,884,621
FileSystemCounters	FILE_BYTES_WRITTEN	1,649,776,629	1,649,776,629
Map-Reduce Framework	Combine output records	196,669,398	196,669,398
	Map input records	338,383,645	338,383,645
	Spilled Records	196,659,253	196,659,253
	Map output bytes	23,752,869,776	23,752,869,776
	SPLIT_RAW_BYTES	480,312	480,312
	Map output records	1,484,554,487	1,484,554,487
	Combine input records	205,027,427	205,027,427

Anybody has some idea?

Alex

Alex Cozzi

unread,

Jul 16, 2012, 1:48:20 PM7/16/12

to scoobi...@googlegroups.com

Also I noticed that step 2 fails to detect the failure of the fist step and starts without reporting error but the end results is an empty file, since it was supposed to join the result of step1

Alex Cozzi

unread,

Jul 16, 2012, 7:11:58 PM7/16/12

to scoobi...@googlegroups.com

Some more information: I tried the SAME job on our other cluster which runs hadoop 22 using Ben's CDH4 branch (thanks Ben!) and it works without problem, while on our main cluster (which runs hadoop 20) it hangs.

So I think this is either an issue with hadoop-20 or on the way scoobi reports job completion.

Eric Springer

unread,

Jul 16, 2012, 7:17:29 PM7/16/12

to scoobi...@googlegroups.com

On Tue, Jul 17, 2012 at 3:46 AM, Alex Cozzi <alex...@gmail.com> wrote:

Anybody has some idea?

I guess it would help to know a bit more about what you're doing, but I have seen this before. The case I've seen it, is when using a combine operation in scoobi. This translates to a combiner in hadoop, which has a bug that [no matter what you do] if you don't complete within $TIMEOUT_TIME it'll timeout.

Looking at that summary it might be the same problem. An easy hack is to just remove the combine and replace it a logically equiv. (but less efficient map). Or if you're feeling particularly masochistic, you can rearrange the code to rely less on combiners (ala like this: https://github.com/NICTA/scoobi/blob/master/src/main/scala/com/nicta/scoobi/lib/Matrix.scala#L269 )

But maybe these are different issues, I'm not sure :D

Eric Springer

unread,

Jul 16, 2012, 7:20:53 PM7/16/12

to scoobi...@googlegroups.com

It's probably overkill, but since you're maintaining your own fork you
could try change this line:

https://github.com/NICTA/scoobi/blob/master/src/main/scala/com/nicta/scoobi/impl/exec/MapReduceJob.scala#L154

to:

"if (false)"

So it doesn't use a hadoop combiner at all

Alex Cozzi

unread,

Jul 16, 2012, 7:34:48 PM7/16/12

to scoobi...@googlegroups.com

Thanks Eric!
I will try it tomorrow and report back.
Alex

Alex Cozzi

unread,

Jul 17, 2012, 1:28:44 AM7/17/12

to scoobi...@googlegroups.com

I did the experiment and with your suggested change the job actually terminates correctly, so I think that your hypothesis of a combiner bug seems confirmed.

Do you have a bug # for this problem in hadoop that I can pass over to our hadoop team?

Thank you so much!

Eric Springer

unread,

Jul 17, 2012, 1:51:21 AM7/17/12

to scoobi...@googlegroups.com

Not a problem.

This is the best reference I can give you is:
http://www.mail-archive.com/commo...@hadoop.apache.org/msg15941.html

There's a couple of bug# linked from that, but I'm not surely if they
match exactly. What I did observe from my earlier experiments was that
there was absolutely nothing I could do to stop a combiner time out,
if it couldn't finish within its $ALLOWED_TIMEOUT window

Reply all

Reply to author

Forward