Actually what's probably going on is the spout is hitting the max
spout pending limit and is sitting in a busy loop checking whether
it's under the limit (and never calling nextTuple). I just committed a
new feature called "spout wait strategies" that lets you plug in code
to handle the case where a spout has nothing to do in its execution
loop (either cause it hit the max spout pending limit or because
nextTuple emitted nothing). The default will be for the spout to sleep
for one millisecond. This will be in 0.8.1, which will be released
very soon.
On Sep 5, 3:06 pm, Nathan Marz <
nathan.m...@gmail.com> wrote:
> In that case, you'll need to send me some sample code that reproduces the
> issue.
>
>
>
>
>
>
>
>
>
> On Wed, Sep 5, 2012 at 3:04 PM, Andrew Olson <
noslower...@gmail.com> wrote:
> > Thanks for the quick reply. We do already have this sleep in place,
> > actually. Here's example jstack output for what we see for most of the
> > workers:
>
> > "Thread-61" prio=10 tid=0x00007f67c87cd800 nid=0x51c4 waiting on condition
> > [0x00007f67825e4000]
> > java.lang.Thread.State: TIMED_WAITING (sleeping)
> > at java.lang.Thread.sleep(Native Method)
> > at backtype.storm.utils.Time.sleepUntil(Time.java:52)
> > at backtype.storm.utils.Time.sleep(Time.java:57)
> > at backtype.storm.utils.Utils.sleep(Utils.java:79)
> > at
> > com.cerner.kepler.storm.AbstractEntityStoreSpout.nextTuple(AbstractEntitySt oreSpout.java:771)
> > - locked <0x0000000773bf99f8> (a
> > com.cerner.flux.millennium.storm.MillenniumSpout)
> > at
> > backtype.storm.daemon.executor$fn__3984$fn__4025$fn__4026.invoke(executor.c lj:433)
> > at backtype.storm.util$async_loop$fn__465.invoke(util.clj:378)
> > at clojure.lang.AFn.run(AFn.java:24)
> > at java.lang.Thread.run(Thread.java:662)
>
> > It's only a small handful of worker threads that seem to start spinning
> > out of control, still doing some productive work but not very much.
>
> > On Wednesday, September 5, 2012 4:48:43 PM UTC-5, nathanmarz wrote:
>
> >> Storm calls the spout nextTuple in a busy loop, so most likely that's
> >> what's using the CPU. You can fix this by introducing a short sleep in your
> >> spout when it has no tuples to emit. We also have the following issue open
> >> to add a hook for what to do when a spout doesn't emit anything (e.g. sleep
> >> for 1ms):
https://github.com/**nathanmarz/storm/issues/254<
https://github.com/nathanmarz/storm/issues/254>
>
> >> On Wed, Sep 5, 2012 at 2:43 PM, Andrew Olson <
noslo...@gmail.com> wrote:
>
> >>> After upgrading to Storm 0.8.0, we noticed significant degradation in
> >>> the performance of topologies in our development environment. Further
> >>> investigation revealed that some of the machines in the cluster were nearly
> >>> 100% maxed out on CPU. This issue persisted after upgrading to 0.8.0-wip9
> >>> to see if that might help.
>
> >>> We've spent several days trying to isolate the cause. It appears that
> >>> some worker processes can get into a state where they're consuming a
> >>> disproportionately high amount of processor resources. In particular, it
> >>> looks like the high CPU consumption is happening in the async_loop for
> >>> spout threads. This only seems to affect a few threads for the entire
> >>> topology; most components seem to be performing very efficiently.
>
> >>> Interestingly, we noticed similar behavior in another cluster still
> >>> running 0.7.0. This had previously gone undetected due to much more CPU
> >>> capacity present in that cluster.
>
> >>> I've collected quite a bit of profiling data with jvisualvm for worker
> >>> processes that had a single thread exhibiting this behavior, and attached
> >>> some screenshots to this post for both 0.7.0 and 0.8.0-wip9. For 0.8.0-wip9
> >>> I also collected a method trace using jswat. Here's a quick summary of what
> >>> we're seeing -- note that the thread in question does not seem to be doing
> >>> much productive work, in terms of nextTuple() and ack() calls.
>
> >>> for 0.8.0-wip9,
>
> >>> *Total time: 411s*
> >>> DisruptorQueue.consumeBatch() - 123s
> >>> RotatingMap.size() - 102s
> >>> async_loop$fn__465.invoke() - 184s
> >>> nextTuple() + ack() - 2s
>
> >>> for 0.7.0,
>
> >>> *Total time: 219s*
> >>> async_loop$fn__487.invoke() - 92s
> >>> daemon.task$fn__3462$fn__3511.**invoke() - 53s