job failures from NotReplicatedYetException

2,345 views
Skip to first unread message

Oscar Gothberg

unread,
May 10, 2010, 2:05:04 AM5/10/10
to cascading-user
Hi,

I have a setup where I execute several Cascades in sequence, and I keep seeing failures of a significant percentage of them due to "NotReplicatedYetException" w.r.t. files in a _temporary subdirectory of the output dir.

Maybe 10-20% of the Cascades fail in this way, so it's not all of them, but enough to make it unusable.

Any input on what is happening here is appreciated.

/ Oscar

cascading.flow.FlowException: internal error during reducer execution
at cascading.flow.FlowReducer.reduce(FlowReducer.java:82)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:463)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.NotReplicatedYetException: Not replicated yet:/test/out/dayperiod=14731/_temporary/_attempt_201005052338_0194_r_000001_0/part-00001
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1253)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
at org.apache.hadoop.ipc.Client.call(Client.java:739)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy1.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2904)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2786)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)


--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To post to this group, send email to cascadi...@googlegroups.com.
To unsubscribe from this group, send email to cascading-use...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en.

Chris K Wensel

unread,
May 10, 2010, 2:52:11 AM5/10/10
to cascadi...@googlegroups.com
This strikes me as a Hadoop issue, you might bring it up on the hadoop list.

cheers,
chris
--
Chris K Wensel
ch...@concurrentinc.com
http://www.concurrentinc.com

Nick Dimiduk

unread,
May 10, 2010, 2:56:45 PM5/10/10
to cascadi...@googlegroups.com
Sounds like Cascading is processing faster than your HDFS can replicate blocks! Great job with those perf improvements, Chris :D

What's your replication rate set to? How fast is your network between nodes? Is your configuration/deployment resource-starving the cluster in some way?

Oscar Gothberg

unread,
May 10, 2010, 3:12:01 PM5/10/10
to cascadi...@googlegroups.com
Thanks Nick,

the replication is set to 3, network speed is 1000Base-T according to
dmesg. Cluster is 11 machines (10 workers, 1 dedicated namenode +
jobtracker), 4x4 core CPUs, 128 GB RAM, so fairly beefy machines.

I don't know if there's resource starvation going on, is there a good
way to check? What are critical configuration points?

I do see nodes intermittently being black-listed, then disappearing
from the black list, then others coming back on. I'm not sure what is
causing that.

Any input appreciated.

/ Oscar

Nick Dimiduk

unread,
May 10, 2010, 3:24:35 PM5/10/10
to cascadi...@googlegroups.com
I've use ganglia for cluster monitoring in the past. It can be very helpful for observing the whole cluster in aggregate as well as tracking down issues on specific machines - pretty pictures of resource utilization.

This black-list behavior is "a bad thing." If nodes randomly fall off, the blocks they host will become under-replicated. AFAIK, hdfs doesn't auto-rebalance in the background, only on new fs activity.

I'd dig into your node logs around the timeout/blacklist events and take this information to the hadoop-user list or #hadoop on irc. As Chris says, they'll be better equipped to help you on cluster configuration issues.

Best of luck!

-Nick

Ken Krugler

unread,
May 10, 2010, 3:34:37 PM5/10/10
to cascadi...@googlegroups.com
Hi Oscar,

On May 10, 2010, at 12:12pm, Oscar Gothberg wrote:

> Thanks Nick,
>
> the replication is set to 3, network speed is 1000Base-T according to
> dmesg. Cluster is 11 machines (10 workers, 1 dedicated namenode +
> jobtracker), 4x4 core CPUs, 128 GB RAM, so fairly beefy machines.
>
> I don't know if there's resource starvation going on, is there a good
> way to check? What are critical configuration points?

I'd seen something similar to this, when I was running 50 slaves in
EMR. The problem was that the NameNode wasn't configured with enough
listening threads, so when the slaves got really busy creating files,
things would back up.

I made two changes - one was to use a beefier box for the master, and
the other was to (IIRC) increase dfs.namenode.handler.count. E.g.
maybe 128 for your case.

-- Ken
>>> .io
>>> .retry
>>> .RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>> at
>>> org
>>> .apache
>>> .hadoop
>>> .io
>>> .retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>> at $Proxy1.addBlock(Unknown Source)
>>>
>>>
>>> at
>>> org.apache.hadoop.hdfs.DFSClient
>>> $DFSOutputStream.locateFollowingBlock(DFSClient.java:2904)
>>> at
>>> org.apache.hadoop.hdfs.DFSClient
>>> $DFSOutputStream.nextBlockOutputStream(DFSClient.java:2786)
>>> at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access
>>> $2000(DFSClient.java:2076)
>>>
>>>
>>> at
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream
>>> $DataStreamer.run(DFSClient.java:2262)
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups
>>> "cascading-user" group.
>>> To post to this group, send email to cascading-
>>> us...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> cascading-use...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/cascading-user?hl=en.
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups
>> "cascading-user" group.
>> To post to this group, send email to cascadi...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> cascading-use...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/cascading-user?hl=en.
>>
>
> --
> You received this message because you are subscribed to the Google
> Groups "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to cascading-use...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/cascading-user?hl=en
> .
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g

Oscar Gothberg

unread,
May 11, 2010, 5:01:55 PM5/11/10
to cascadi...@googlegroups.com
Thank you Ken,

just upping the dfs.namenode.handler.count on it's own didn't
alleviate the problem, it only increased the rate of failure to more
like 90-100%, but it got me on the right track.

I then tried also increasing the dfs.datanode.handler.count and the
problem seems to have gone away completely.

I went with dfs.namenode.handler.count of 40 (default was 10) and
dfs.datanode.handler.count of 8 (default was 3), fwiw to anyone else
out there seeing this problem.

/ Oscar
>>>> yet:/test/out/dayperiod=14731/_temporary/_attempt_201005052338_0194_r_000001_0/part-00001
>>>>
>>>>
>>>>        at
>>>>
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1253)
>>>>        at
>>>>
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
>>>>        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>>>>
>>>>
>>>>        at
>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>>>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>>>>
>>>>
>>>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
>>>>
>>>>
>>>>        at org.apache.hadoop.ipc.Client.call(Client.java:739)
>>>>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
>>>>        at $Proxy1.addBlock(Unknown Source)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>
>>>>
>>>>        at
>>>>
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>        at
>>>>
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>>
>>>>
>>>>        at
>>>>
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>>>>        at
>>>>
>>>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>>>>        at $Proxy1.addBlock(Unknown Source)
>>>>
>>>>
>>>>        at
>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2904)
>>>>        at
>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2786)
>>>>        at
>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2076)
>>>>
>>>>
>>>>        at
>>>>
>>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2262)
>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups
>>>> "cascading-user" group.
>>>> To post to this group, send email to cascadi...@googlegroups.com.
>>>> To unsubscribe from this group, send email to
>>>> cascading-use...@googlegroups.com.
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/cascading-user?hl=en.
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "cascading-user" group.
>>> To post to this group, send email to cascadi...@googlegroups.com.
>>> To unsubscribe from this group, send email to
>>> cascading-use...@googlegroups.com.
>>> For more options, visit this group at
>>> http://groups.google.com/group/cascading-user?hl=en.
>>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "cascading-user" group.
>> To post to this group, send email to cascadi...@googlegroups.com.
>> To unsubscribe from this group, send email to
>> cascading-use...@googlegroups.com.
>> For more options, visit this group at
>> http://groups.google.com/group/cascading-user?hl=en.
>>
>
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "cascading-user" group.
> To post to this group, send email to cascadi...@googlegroups.com.
> To unsubscribe from this group, send email to
> cascading-use...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/cascading-user?hl=en.
>
>

Reply all
Reply to author
Forward
0 new messages