zookeeper connection problems

6,880 views
Skip to first unread message

anahap

unread,
Jan 19, 2012, 8:53:12 AM1/19/12
to storm-user
Hi,

After our cluster runs for about two hours under heavy load, I get the
folowwing message and the workers restart.

rg.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
NoNode for /taskbeats/td_topology-1-1326978367/512 at
org.apache.zookeeper.KeeperException.create(KeeperException.ja

what is happening. I would understand a timeout but why is the node
missing?

are there any timeouts I can set to resolve this problem?


TIA

nathanmarz

unread,
Jan 19, 2012, 7:20:11 PM1/19/12
to storm-user
The just-released Storm 0.6.2 overhauls the ZK connection management
and will now properly reconnect instead of throwing an error.

-Nathan

Dwayne Pryce

unread,
Feb 17, 2012, 5:29:38 PM2/17/12
to storm...@googlegroups.com
Hi Nathan!

I'm actually running the 0.7.0 Snapshot, and I'm getting this as well.  Full stacktrace is:

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /taskbeats/storm-gi-2-1329514658/276
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
        at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
        at com.netflix.curator.framework.imps.SetDataBuilderImpl$2.call(SetDataBuilderImpl.java:139)
        at com.netflix.curator.framework.imps.SetDataBuilderImpl$2.call(SetDataBuilderImpl.java:135)
        at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
        at com.netflix.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetDataBuilderImpl.java:131)
        at com.netflix.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:123)
        at com.netflix.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilderImpl.java:33)
        at backtype.storm.zookeeper$set_data.invoke(zookeeper.clj:114)
        at backtype.storm.cluster$mk_distributed_cluster_state$reify__1767.set_ephemeral_node(cluster.clj:52)
        at backtype.storm.cluster$mk_storm_cluster_state$reify__2239.task_heartbeat_BANG_(cluster.clj:271)
        at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:90)
        at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
        at backtype.storm.daemon.task$mk_task$fn__3319.invoke(task.clj:157)
        at clojure.lang.AFn.applyToHelper(AFn.java:159)
        at clojure.lang.AFn.applyTo(AFn.java:151)
        at clojure.core$apply.invoke(core.clj:540)
        at backtype.storm.util$async_loop$fn__469.invoke(util.clj:243)
        at clojure.lang.AFn.run(AFn.java:24)
        at java.lang.Thread.run(Thread.java:662)

Is this the same error that anahap was having, or does it just superficially look that way?  My entire Topology seems to come to a halt after this happens (at least, in the sense that I know I have 3m tuples moving through, but I'm only getting a smaller subset of those being acked from my last bolt).

It may be interesting to note that my last bolt is Accumulo, which also uses Zookeeper.  Could this be a configuration issue, where Zookeeper needs some tweaking to be able to keep up with the demands of Storm *and* Accumulo?  The only thing I've found in the ZK logs are some exceptions thrown when a client "may have closed the socket". 

Ted Dunning

unread,
Feb 17, 2012, 8:38:19 PM2/17/12
to storm...@googlegroups.com
Accumulo shouldn't add any significant ZK load since things only change rarely there.

This may be due to GC issues in the bolts which could cause an inordinate delay in telling ZK that the clients are still alive. 

What is your ZK configuration?  If you extend your session expiration limit does this problem go away?  If so, it is likely true that the client is not meeting the promises implicit in the ZK timeout.

Nathan Marz

unread,
Feb 20, 2012, 5:19:08 AM2/20/12
to storm...@googlegroups.com
How often does this happen, and how big is your ZK cluster? Also, what version of ZK are you using? The NoNode exception is actually quite strange, I've never seen this exception on my clusters.

--
Twitter: @nathanmarz
http://nathanmarz.com

Dwayne Pryce

unread,
Feb 20, 2012, 1:18:39 PM2/20/12
to storm-user
Ted:

Our ZK configuration is almost straight out of the box. The only
thing we tweaked was the maximum number of clients per IP, since we
had approximately 96 threads per physical node beating up on it all at
once. We increased that number to 5000.

Are there any good settings optimization that you would recommend for
common storm uses? Most of our experience with using ZK has primarily
been in configuring it for Accumulo use.

I won't be able to tweak anything; we finally created enough
excitement with the numbers we were getting that we had another half a
dozen R610's (after the other half dozen we had dropped off on
Thursday) dropped off in our office today. They're going to be
configured and we're going to expand our ZK and Accumulo cluster to
also reside on those nodes.

Nathan:
This was happening consistently. In our test cluster, we had 9 nodes
(a few physical, a few VMs - we have since decided VMs were pointless,
since we're starting a new JVM for each worker - we're just adding
overhead. Regardless, at the time this was happening, we had 9
"nodes"). We had ZK running on 3 physical servers (not in VMs).
Accumulo was also running on 3 servers (the same 3 servers,
actually). My topology configuration (since it was the only topology
running) was 4 workers per node, 12 threads per worker. Of the 3m
tuples we were trying to move through, this specific error would occur
around the 500k mark (from the spout) to anywhere around the 2m mark
(from the spout).

Since we got the new servers, we're laying out the rack differently,
so the whole thing is down; I can't try it again right now and
probably won't get to until we add more nodes (which, of course, adds
more variables). If it continues happening after I've extended the
session expiration limit like Ted suggested, I'll report back here
with as much information as I can. If it doesn't continue happening,
I'll report back with that too (and get to work with a profiler!)

------------

Just to elaborate a bit on all of this; we have been running our
ingestion process for dozens/hundreds of data feeds on monolithic
systems for the last few years. We had real-time ingestion
requirements that left M/R out of the running, but in terms of the
scope of the data we were dealing with, we really needed a distributed
system to make our ingestion process truly viable for the future.
Basically, what we had was good enough for right now - but only
barely. And time-to-query requirements were only getting smaller with
an architecture that was completely untenable anyway. What I've been
trying to do as a proof of concept is take our existing code, fit it
into bolts and spouts (which it fit pretty well into, but still had
some pretty different "features" that turn out to be less-than-helpful
in a Storm setting) and make it work and show some gloriously fabulous
performance statistics. Then, hopefully, it would receive much more
developer support from management, which would allow us to move
forward with a much more Storm-centric design and implementation.
Thus, the reason we *may* be having the GC overhead and timeout issues
- if indeed that is the problem. So - I'll let you guys know how it
goes, as well as maybe hop onto the IRC channel to talk about it
later. Thanks for your responses!

On Feb 20, 3:19 am, Nathan Marz <nathan.m...@gmail.com> wrote:
> How often does this happen, and how big is your ZK cluster? Also, what
> version of ZK are you using? The NoNode exception is actually quite
> strange, I've never seen this exception on my clusters.
>
> On Fri, Feb 17, 2012 at 2:29 PM, Dwayne Pryce <dwayne.pr...@gmail.com>wrote:
>
>
>
>
>
>
>
>
>
> > Hi Nathan!
>
> > I'm actually running the 0.7.0 Snapshot, and I'm getting this as well.
> > Full stacktrace is:
>
> > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> > NoNode for /taskbeats/storm-gi-2-1329514658/276
> >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> >         at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> >         at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl$2.call(SetDataBuilder Impl.java:139)
> >         at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl$2.call(SetDataBuilder Impl.java:135)
> >         at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
> >         at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetD ataBuilderImpl.java:131)
> >         at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilde rImpl.java:123)
> >         at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilde rImpl.java:33)
> >         at backtype.storm.zookeeper$set_data.invoke(zookeeper.clj:114)
> >         at
> > backtype.storm.cluster$mk_distributed_cluster_state$reify__1767.set_ephemer al_node(cluster.clj:52)
> >         at
> > backtype.storm.cluster$mk_storm_cluster_state$reify__2239.task_heartbeat_BA NG_(cluster.clj:271)
> >         at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> >         at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp l.java:25)

Ted Dunning

unread,
Feb 20, 2012, 2:12:04 PM2/20/12
to storm...@googlegroups.com
The standard diagnostic path from here is to start looking at your GC logs (that is, turn them on and then look at them).  It used to be relatively common for hbase to have problems with GC causing ZK session expiration.  I don't have any experience with accumulo.

THis may also be an odd corner case for Curator.  I have seen it use a few suspect idioms in the past relative to retry.

It is also possible that ZK was getting frozen out somehow due to sharing the machines with accumulo.  That seems like a low likelihood explanation.

I look forward to hearing what your GC logs say.  This system should work quite well.

Joseph Schliffer

unread,
Feb 22, 2012, 9:23:53 AM2/22/12
to storm...@googlegroups.com
I saw the same issue after running about 200 million messages through my topology.  The only part of mine that uses ZK is Storm itself, so I doubt that accumulo is the problem.  We will try tuning various ZK settings to see if we can get it to go away.  Interestingly enough, the error happened on a bolt (dead letter queue insert) that was not even in use, so I'm thinking it probably is a timeout issue.  I am going to set up a heartbeat spout and have the idle bolt subscribe to it and see what happens.  Will update this thread with my findings.  

Joseph Schliffer

unread,
Feb 22, 2012, 9:34:40 AM2/22/12
to storm...@googlegroups.com
I'm also on 0.7.0-rc

Ted Dunning

unread,
Feb 22, 2012, 9:37:34 AM2/22/12
to storm...@googlegroups.com, storm...@googlegroups.com
If you have zk logs can you post them somewhere?

Also, it would help your debugging if you could correlate gc delays.

Sent from my iPhone

Dwayne Pryce

unread,
Feb 22, 2012, 4:13:28 PM2/22/12
to storm-user
I still haven't turned the gc logging on (have another "high priority"
task that just got thrown at me), but we did mitigate the problems by
setting the ZK timeout to 120s instead of 20s that it defaults to. I
didn't see this error again.

I should note that the process that we were having this problem with
is a pretty intensive ETL process that unfortunately has an inordinate
amount of overhead (which was fine in the old system, since the
lifespan of the object was considerably longer than it is here. I
could mitigate this problem by moving this code into the prepare
method instead of inside of execute - I just have to do this by now).

However, this does bring up an important point; is it reasonable
behavior for a grossly inefficient bolt to bring down an entire
topology?

Ted Dunning

unread,
Feb 22, 2012, 4:29:34 PM2/22/12
to storm...@googlegroups.com, storm-user
If it grossly violates the promises that it makes, yes. Otherwise how can storm detect and repair failures?

On the other hand, it is reasonable to be able to make weak promises.

Sent from my iPhone

Dwayne Pryce

unread,
Feb 22, 2012, 5:28:42 PM2/22/12
to storm-user
Is "repairing" the same as the Topology coming to a screeching halt,
though? The reason I ask is because after this happens, the extra
2.5m Tuples that I'm expecting to come blowing down the pipeline ...
don't. Ever.

(Also, I'm just posing questions - if it comes off like I'm being a
jerk, that is certainly not my intent! The truth is, I've had more
fun in the last 2 weeks with Storm than I have for an *extremely* long
time!)

On Feb 22, 2:29 pm, Ted Dunning <ted.dunn...@gmail.com> wrote:
> If it grossly violates the promises that it makes, yes.  Otherwise how can storm detect and repair failures?
>
> On the other hand, it is reasonable to be able to make weak promises.
>
> Sent from my iPhone
>

Nathan Marz

unread,
Feb 22, 2012, 5:36:46 PM2/22/12
to storm...@googlegroups.com
Hey Dwayne,

So I think I understand the cause of this error now, and have a potential fix implemented. Basically the task heartbeats are ephemeral nodes in Zookeeper, which means they can disappear if the client times out to ZK. When setting the value for the heartbeat, the code currently checks if the node exists, and if so does a "set-data" command. Otherwise, it does a "create-node" command. If the timeout happens after the check but before the set-data or create-node command, this error can occur.

My fix changes task heartbeats to be regular nodes, since they don't really need to be ephemeral. One question I have for you though is what you mean by "2.5m Tuples that I'm expecting to come blowing down the pipeline ... don't.  Ever.". Even with the current bug, the process should just crash, restart, and keep on working. Once this issue pops up does it continuously crash the workers, preventing any sort of recovery?

-Nathan

Dwayne Pryce

unread,
Feb 22, 2012, 6:11:55 PM2/22/12
to storm-user
Nathan,

After testing this again to answer your question, it turns out I was
completely incorrect - it DOES continue processing Tuples. So - the
good news is we can chalk that part up to me having a faulty memory.

Thanks for looking into this! I deeply appreciate it :)

- Dwayne

On Feb 22, 3:36 pm, Nathan Marz <nathan.m...@gmail.com> wrote:
> Hey Dwayne,
>
> So I think I understand the cause of this error now, and have a potential
> fix implemented. Basically the task heartbeats are ephemeral nodes in
> Zookeeper, which means they can disappear if the client times out to ZK.
> When setting the value for the heartbeat, the code currently checks if the
> node exists, and if so does a "set-data" command. Otherwise, it does a
> "create-node" command. If the timeout happens after the check but before
> the set-data or create-node command, this error can occur.
>
> My fix changes task heartbeats to be regular nodes, since they don't really
> need to be ephemeral. One question I have for you though is what you mean
> by "2.5m Tuples that I'm expecting to come blowing down the pipeline
> ... don't.  Ever.". Even with the current bug, the process should just
> crash, restart, and keep on working. Once this issue pops up does it
> continuously crash the workers, preventing any sort of recovery?
>
> -Nathan
>

Ted Dunning

unread,
Feb 22, 2012, 6:17:34 PM2/22/12
to storm...@googlegroups.com
On Wed, Feb 22, 2012 at 10:28 PM, Dwayne Pryce <dwayne...@gmail.com> wrote:
Is "repairing" the same as the Topology coming to a screeching halt,
though?  

No.  But handling all reasonable failure modes well always come after first function.  Storm is new.

Nathan Marz

unread,
Feb 22, 2012, 6:55:27 PM2/22/12
to storm...@googlegroups.com
Great news. I'm confident in my fix then (which is now pushed to master).

Joseph Schliffer

unread,
Feb 22, 2012, 7:29:02 PM2/22/12
to storm...@googlegroups.com
Adding the heartbeat spout fixed my issue, thanks.

Ted Dunning

unread,
Feb 22, 2012, 10:19:59 PM2/22/12
to storm...@googlegroups.com, storm...@googlegroups.com
Nathan

What happens if a node goes away?  Who cleans up the heartbeat file?

And why not use an ephemeral?  That gives the hb semantics you want. 

Sent from my iPhone

Nathan Marz

unread,
Feb 22, 2012, 10:28:01 PM2/22/12
to storm...@googlegroups.com
The next instance of the task to startup (whether on the same machine or otherwise) will overwrite that file. Ephemeral isn't necessary because Nimbus checks how often the heartbeat file is updated to determine timeouts.

Ted Dunning

unread,
Feb 22, 2012, 11:56:51 PM2/22/12
to storm...@googlegroups.com, storm...@googlegroups.com
That is my point. Aren't you reimplementing wha zk already does rather well?

Sent from my iPhone

Nathan Marz

unread,
Feb 23, 2012, 8:34:50 PM2/23/12
to storm...@googlegroups.com
You're right, in retrospect I should have implemented timeouts in terms of ephemeral nodes.

Ted Dunning

unread,
Feb 23, 2012, 8:40:43 PM2/23/12
to storm...@googlegroups.com
Retrospective changes are, however, generally at the bottom of the priority list.

Colin Fergus

unread,
Mar 23, 2012, 10:15:39 AM3/23/12
to storm...@googlegroups.com
Latecomer, but I'm seeing the same problem. I'm using 0.7.0-rc, trying to push millions of events per minute.

The exception I see occurs usually within a minute, sometimes a spout, sometimes a bolt: 
java.lang.OutOfMemoryError: GC overhead limit exceeded 
The trace from there varies each time on whatever that particular task happened to last be doing.

From what I've gathered, there's a patch in the codebase. 
Has this been released as part of 0.7.0? If not, can someone point me to the code change/issue?

The other option appears to be changing the storm.yaml to say:
storm.zookeeper.session.timeout=1jillion

Is that correct?

Lastly, if I need to alter the JVM properties of a worker (or all workers), is there a preferred way to do this? Such as -Xmx.

The discussions here have been quite helpful for me as a noob in storm land.

Nathan Marz

unread,
Mar 26, 2012, 5:27:07 AM3/26/12
to storm...@googlegroups.com
Can you show me the exact stack trace(s) you're seeing? Based on your message it sounds like you're seeing many different issues. And yes, that fix is in 0.7.0. I'm not sure if it's in 0.7.0-rc.

If you're getting on OOM error, it's almost certainly due to a problem in your application code. Possible issues:

1. You're not acking tuples in one of your bolts
2. Your topology can't consume tuples at the rate the spouts are emitting tuples (fix is to throttle the spout with TOPOLOGY_MAX_SPOUT_PENDING)

To change the JVM properties of workers, override "worker.childopts" in your storm.yaml files on the worker nodes.

Colin Fergus

unread,
Mar 27, 2012, 9:32:42 AM3/27/12
to storm...@googlegroups.com
1. You're not acking tuples in one of your bolts

I will tentatively rule this out in my case; my spouts (Java IRichSpout implementations) are emitting Tuple values with no id. Also, my topology is configured with "conf.setNumAckers( 0 )", which should disable all acking. If I've followed the documentation correctly, acking should not be of issue.
 
2. Your topology can't consume tuples at the rate the spouts are emitting tuples (fix is to throttle the spout with TOPOLOGY_MAX_SPOUT_PENDING)

Sound very likely. Before I beg more help, I will try upgrading to 0.7.0, then try the max spout pending throttling. Thanks for your help. 

...  override "worker.childopts" in your storm.yaml files ...

The worker.childopts yaml entry is great for the memory case.
If I wanted to set a custom option (such as -XX:-HeapDumpOnOutOfMemoryError), is there a way to do this? It might be helpful if more outofmemory errors creep up.

On Monday, March 26, 2012 5:27:07 AM UTC-4, nathanmarz wrote:
Can you show me the exact stack trace(s) you're seeing? Based on your message it sounds like you're seeing many different issues. And yes, that fix is in 0.7.0. I'm not sure if it's in 0.7.0-rc.

If you're getting on OOM error, it's almost certainly due to a problem in your application code. Possible issues:

1. You're not acking tuples in one of your bolts
2. Your topology can't consume tuples at the rate the spouts are emitting tuples (fix is to throttle the spout with TOPOLOGY_MAX_SPOUT_PENDING)

To change the JVM properties of workers, override "worker.childopts" in your storm.yaml files on the worker nodes.

Nathan Marz

unread,
Mar 28, 2012, 5:28:26 AM3/28/12
to storm...@googlegroups.com
On Tue, Mar 27, 2012 at 6:32 AM, Colin Fergus <cfe...@gmail.com> wrote:
1. You're not acking tuples in one of your bolts

I will tentatively rule this out in my case; my spouts (Java IRichSpout implementations) are emitting Tuple values with no id. Also, my topology is configured with "conf.setNumAckers( 0 )", which should disable all acking. If I've followed the documentation correctly, acking should not be of issue.
 
2. Your topology can't consume tuples at the rate the spouts are emitting tuples (fix is to throttle the spout with TOPOLOGY_MAX_SPOUT_PENDING)

Sound very likely. Before I beg more help, I will try upgrading to 0.7.0, then try the max spout pending throttling. Thanks for your help. 

...  override "worker.childopts" in your storm.yaml files ...

The worker.childopts yaml entry is great for the memory case.
If I wanted to set a custom option (such as -XX:-HeapDumpOnOutOfMemoryError), is there a way to do this? It might be helpful if more outofmemory errors creep up.

You can set this in worker.childopts as well. You can also use "topology.worker.childopts" to set additional childopts on a topology-specific basis.

Nathan Marz

unread,
Mar 28, 2012, 5:29:26 AM3/28/12
to storm...@googlegroups.com
One more thing - you still need to ack tuples in your bolts, regardless of whether your spout tuples have message ids. The bolts track state for each tuple which doesn't get thrown out until the bolt tuple is acked.

cmonkey

unread,
Apr 6, 2012, 2:50:13 PM4/6/12
to storm...@googlegroups.com
Long delay, but I have info now.

I upgraded to 0.7.0 (from 0.7.0rc). Put in a 'maxSpoutPending' of 255 (wild guess). Put in an 'ack' in all of my tuples. Adjusted memory allocations.
None of these made a marked difference. (memory allocation extended the time it took to see a problem).

However, (Nathan was right), there is essentially a problem in my application code. I did a heap dump on the running worker to find out what was going on. I have since removed the 'maxSpoutPending' config.
My spout reads data off of a Java Queue, converts data to tuples and emits.

My memory error is because my Java Queue is filling up faster than the spout can read off of it, which is a problem with my code. This causes OOM exceptions, and the supervisor bounces the spout task.
Is there any way to speed up, multi-thread, or otherwise get the "nextTuple" method to operate faster?
I'll see if I can't profile my code to find any slow spots, but otherwise I may just need to slow the input mechanism to my Queue (not preferred).

Nathan Marz

unread,
Apr 9, 2012, 4:05:15 AM4/9/12
to storm...@googlegroups.com
I don't know where you're getting your data from, but you'll probably want to parallelize it. Have each spout task handle emitting a subset of the stream. KestrelSpout, for example, reads from a cluster of Kestrel servers. Each spout task will read messages from a single Kestrel server.

Also, 255 for maxSpoutPending is pretty low. That's likely to artificially slow your consumption rate. I generally put in a value there between 10K to 20K. 

cmonkey

unread,
Apr 9, 2012, 8:35:41 PM4/9/12
to storm...@googlegroups.com
Thanks for the maxSpoutPending info. I have made it 16k, coupled with parallelized spouts (each reading from a different file-system folder). Also, each spout is a little bit dumber now, so an individual spout produces less, but the total of spouts produces more. Things are working alright now!

I'm going to go crush some CPUs and network cards now. I'll let you know when things get weird. Thanks for the help.

Nathan Marz

unread,
Apr 9, 2012, 10:48:47 PM4/9/12
to storm...@googlegroups.com
Great to hear.
Reply all
Reply to author
Forward
0 new messages