Ted:
Our ZK configuration is almost straight out of the box. The only
thing we tweaked was the maximum number of clients per IP, since we
had approximately 96 threads per physical node beating up on it all at
once. We increased that number to 5000.
Are there any good settings optimization that you would recommend for
common storm uses? Most of our experience with using ZK has primarily
been in configuring it for Accumulo use.
I won't be able to tweak anything; we finally created enough
excitement with the numbers we were getting that we had another half a
dozen R610's (after the other half dozen we had dropped off on
Thursday) dropped off in our office today. They're going to be
configured and we're going to expand our ZK and Accumulo cluster to
also reside on those nodes.
Nathan:
This was happening consistently. In our test cluster, we had 9 nodes
(a few physical, a few VMs - we have since decided VMs were pointless,
since we're starting a new JVM for each worker - we're just adding
overhead. Regardless, at the time this was happening, we had 9
"nodes"). We had ZK running on 3 physical servers (not in VMs).
Accumulo was also running on 3 servers (the same 3 servers,
actually). My topology configuration (since it was the only topology
running) was 4 workers per node, 12 threads per worker. Of the 3m
tuples we were trying to move through, this specific error would occur
around the 500k mark (from the spout) to anywhere around the 2m mark
(from the spout).
Since we got the new servers, we're laying out the rack differently,
so the whole thing is down; I can't try it again right now and
probably won't get to until we add more nodes (which, of course, adds
more variables). If it continues happening after I've extended the
session expiration limit like Ted suggested, I'll report back here
with as much information as I can. If it doesn't continue happening,
I'll report back with that too (and get to work with a profiler!)
------------
Just to elaborate a bit on all of this; we have been running our
ingestion process for dozens/hundreds of data feeds on monolithic
systems for the last few years. We had real-time ingestion
requirements that left M/R out of the running, but in terms of the
scope of the data we were dealing with, we really needed a distributed
system to make our ingestion process truly viable for the future.
Basically, what we had was good enough for right now - but only
barely. And time-to-query requirements were only getting smaller with
an architecture that was completely untenable anyway. What I've been
trying to do as a proof of concept is take our existing code, fit it
into bolts and spouts (which it fit pretty well into, but still had
some pretty different "features" that turn out to be less-than-helpful
in a Storm setting) and make it work and show some gloriously fabulous
performance statistics. Then, hopefully, it would receive much more
developer support from management, which would allow us to move
forward with a much more Storm-centric design and implementation.
Thus, the reason we *may* be having the GC overhead and timeout issues
- if indeed that is the problem. So - I'll let you guys know how it
goes, as well as maybe hop onto the IRC channel to talk about it
later. Thanks for your responses!
On Feb 20, 3:19 am, Nathan Marz <
nathan.m...@gmail.com> wrote:
> How often does this happen, and how big is your ZK cluster? Also, what
> version of ZK are you using? The NoNode exception is actually quite
> strange, I've never seen this exception on my clusters.
>
> On Fri, Feb 17, 2012 at 2:29 PM, Dwayne Pryce <
dwayne.pr...@gmail.com>wrote:
>
>
>
>
>
>
>
>
>
> > Hi Nathan!
>
> > I'm actually running the 0.7.0 Snapshot, and I'm getting this as well.
> > Full stacktrace is:
>
> > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> > NoNode for /taskbeats/storm-gi-2-1329514658/276
> > at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
> > at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> > at org.apache.zookeeper.ZooKeeper.setData(ZooKeeper.java:1038)
> > at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl$2.call(SetDataBuilder Impl.java:139)
> > at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl$2.call(SetDataBuilder Impl.java:135)
> > at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
> > at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl.pathInForeground(SetD ataBuilderImpl.java:131)
> > at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilde rImpl.java:123)
> > at
> > com.netflix.curator.framework.imps.SetDataBuilderImpl.forPath(SetDataBuilde rImpl.java:33)
> > at backtype.storm.zookeeper$set_data.invoke(zookeeper.clj:114)
> > at
> > backtype.storm.cluster$mk_distributed_cluster_state$reify__1767.set_ephemer al_node(cluster.clj:52)
> > at
> > backtype.storm.cluster$mk_storm_cluster_state$reify__2239.task_heartbeat_BA NG_(cluster.clj:271)
> > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> > at
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImp l.java:25)