Re: [storm-user] High CPU usage by worker processes

1,188 views
Skip to first unread message

Nathan Marz

unread,
Sep 5, 2012, 5:48:41 PM9/5/12
to storm...@googlegroups.com
Storm calls the spout nextTuple in a busy loop, so most likely that's what's using the CPU. You can fix this by introducing a short sleep in your spout when it has no tuples to emit. We also have the following issue open to add a hook for what to do when a spout doesn't emit anything (e.g. sleep for 1ms): https://github.com/nathanmarz/storm/issues/254

On Wed, Sep 5, 2012 at 2:43 PM, Andrew Olson <noslo...@gmail.com> wrote:
After upgrading to Storm 0.8.0, we noticed significant degradation in the performance of topologies in our development environment. Further investigation revealed that some of the machines in the cluster were nearly 100% maxed out on CPU. This issue persisted after upgrading to 0.8.0-wip9 to see if that might help.

We've spent several days trying to isolate the cause. It appears that some worker processes can get into a state where they're consuming a disproportionately high amount of processor resources. In particular, it looks like the high CPU consumption is happening in the async_loop for spout threads.  This only seems to affect a few threads for the entire topology; most components seem to be performing very efficiently.

Interestingly, we noticed similar behavior in another cluster still running 0.7.0.  This had previously gone undetected due to much more CPU capacity present in that cluster.

I've collected quite a bit of profiling data with jvisualvm for worker processes that had a single thread exhibiting this behavior, and attached some screenshots to this post for both 0.7.0 and 0.8.0-wip9. For 0.8.0-wip9 I also collected a method trace using jswat. Here's a quick summary of what we're seeing -- note that the thread in question does not seem to be doing much productive work, in terms of nextTuple() and ack() calls.

for 0.8.0-wip9,

Total time: 411s
DisruptorQueue.consumeBatch()  -  123s
RotatingMap.size()  -  102s
async_loop$fn__465.invoke()  -  184s
nextTuple() + ack()  -  2s

for 0.7.0,

Total time: 219s
async_loop$fn__487.invoke()  -  92s
daemon.task$fn__3462$fn__3511.invoke()  -  53s
RT.seq()  -  20s
Numbers.lt()  -  9s
Numbers.multiply()  -  42s
nextTuple() + ack()  - 1s

Any thoughts on what might be happening here, or what our next troubleshooting steps should be?  Please let me know if I can provide any additional information.

thanks,
Andrew



--
Twitter: @nathanmarz
http://nathanmarz.com

Andrew Olson

unread,
Sep 5, 2012, 6:04:16 PM9/5/12
to storm...@googlegroups.com
Thanks for the quick reply. We do already have this sleep in place, actually. Here's example jstack output for what we see for most of the workers:

"Thread-61" prio=10 tid=0x00007f67c87cd800 nid=0x51c4 waiting on condition [0x00007f67825e4000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at backtype.storm.utils.Time.sleepUntil(Time.java:52)
    at backtype.storm.utils.Time.sleep(Time.java:57)
    at backtype.storm.utils.Utils.sleep(Utils.java:79)
    at com.cerner.kepler.storm.AbstractEntityStoreSpout.nextTuple(AbstractEntityStoreSpout.java:771)
    - locked <0x0000000773bf99f8> (a com.cerner.flux.millennium.storm.MillenniumSpout)
    at backtype.storm.daemon.executor$fn__3984$fn__4025$fn__4026.invoke(executor.clj:433)
    at backtype.storm.util$async_loop$fn__465.invoke(util.clj:378)
    at clojure.lang.AFn.run(AFn.java:24)
    at java.lang.Thread.run(Thread.java:662)

It's only a small handful of worker threads that seem to start spinning out of control, still doing some productive work but not very much.

Nathan Marz

unread,
Sep 5, 2012, 6:06:25 PM9/5/12
to storm...@googlegroups.com
In that case, you'll need to send me some sample code that reproduces the issue.

nathanmarz

unread,
Sep 6, 2012, 2:25:17 AM9/6/12
to storm-user
Actually what's probably going on is the spout is hitting the max
spout pending limit and is sitting in a busy loop checking whether
it's under the limit (and never calling nextTuple). I just committed a
new feature called "spout wait strategies" that lets you plug in code
to handle the case where a spout has nothing to do in its execution
loop (either cause it hit the max spout pending limit or because
nextTuple emitted nothing). The default will be for the spout to sleep
for one millisecond. This will be in 0.8.1, which will be released
very soon.


On Sep 5, 3:06 pm, Nathan Marz <nathan.m...@gmail.com> wrote:
> In that case, you'll need to send me some sample code that reproduces the
> issue.
>
>
>
>
>
>
>
>
>
> On Wed, Sep 5, 2012 at 3:04 PM, Andrew Olson <noslower...@gmail.com> wrote:
> > Thanks for the quick reply. We do already have this sleep in place,
> > actually. Here's example jstack output for what we see for most of the
> > workers:
>
> > "Thread-61" prio=10 tid=0x00007f67c87cd800 nid=0x51c4 waiting on condition
> > [0x00007f67825e4000]
> >    java.lang.Thread.State: TIMED_WAITING (sleeping)
> >     at java.lang.Thread.sleep(Native Method)
> >     at backtype.storm.utils.Time.sleepUntil(Time.java:52)
> >     at backtype.storm.utils.Time.sleep(Time.java:57)
> >     at backtype.storm.utils.Utils.sleep(Utils.java:79)
> >     at
> > com.cerner.kepler.storm.AbstractEntityStoreSpout.nextTuple(AbstractEntitySt oreSpout.java:771)
> >     - locked <0x0000000773bf99f8> (a
> > com.cerner.flux.millennium.storm.MillenniumSpout)
> >     at
> > backtype.storm.daemon.executor$fn__3984$fn__4025$fn__4026.invoke(executor.c lj:433)
> >     at backtype.storm.util$async_loop$fn__465.invoke(util.clj:378)
> >     at clojure.lang.AFn.run(AFn.java:24)
> >     at java.lang.Thread.run(Thread.java:662)
>
> > It's only a small handful of worker threads that seem to start spinning
> > out of control, still doing some productive work but not very much.
>
> > On Wednesday, September 5, 2012 4:48:43 PM UTC-5, nathanmarz wrote:
>
> >> Storm calls the spout nextTuple in a busy loop, so most likely that's
> >> what's using the CPU. You can fix this by introducing a short sleep in your
> >> spout when it has no tuples to emit. We also have the following issue open
> >> to add a hook for what to do when a spout doesn't emit anything (e.g. sleep
> >> for 1ms):https://github.com/**nathanmarz/storm/issues/254<https://github.com/nathanmarz/storm/issues/254>
>
> >> On Wed, Sep 5, 2012 at 2:43 PM, Andrew Olson <noslo...@gmail.com> wrote:
>
> >>> After upgrading to Storm 0.8.0, we noticed significant degradation in
> >>> the performance of topologies in our development environment. Further
> >>> investigation revealed that some of the machines in the cluster were nearly
> >>> 100% maxed out on CPU. This issue persisted after upgrading to 0.8.0-wip9
> >>> to see if that might help.
>
> >>> We've spent several days trying to isolate the cause. It appears that
> >>> some worker processes can get into a state where they're consuming a
> >>> disproportionately high amount of processor resources. In particular, it
> >>> looks like the high CPU consumption is happening in the async_loop for
> >>> spout threads.  This only seems to affect a few threads for the entire
> >>> topology; most components seem to be performing very efficiently.
>
> >>> Interestingly, we noticed similar behavior in another cluster still
> >>> running 0.7.0.  This had previously gone undetected due to much more CPU
> >>> capacity present in that cluster.
>
> >>> I've collected quite a bit of profiling data with jvisualvm for worker
> >>> processes that had a single thread exhibiting this behavior, and attached
> >>> some screenshots to this post for both 0.7.0 and 0.8.0-wip9. For 0.8.0-wip9
> >>> I also collected a method trace using jswat. Here's a quick summary of what
> >>> we're seeing -- note that the thread in question does not seem to be doing
> >>> much productive work, in terms of nextTuple() and ack() calls.
>
> >>> for 0.8.0-wip9,
>
> >>> *Total time: 411s*
> >>> DisruptorQueue.consumeBatch()  -  123s
> >>> RotatingMap.size()  -  102s
> >>> async_loop$fn__465.invoke()  -  184s
> >>> nextTuple() + ack()  -  2s
>
> >>> for 0.7.0,
>
> >>> *Total time: 219s*
> >>> async_loop$fn__487.invoke()  -  92s
> >>> daemon.task$fn__3462$fn__3511.**invoke()  -  53s

Andrew Olson

unread,
Sep 6, 2012, 10:15:36 AM9/6/12
to storm...@googlegroups.com
Yes, that would make perfect sense. Our topology has many different spouts, so I suspect some sort of chain reaction was triggered when one managed to hit the max pending, consequently reducing the cpu resources available for the rest of the components in the topology on that particular machine.  Other unrelated bolts might then be slowed down eventually causing more and more spouts to reach the pending limit and start looping, which would make it increasingly difficult to recover.  We'll give 0.8.1 a try as soon as it's available.

thanks,
Andrew

Andrew Olson

unread,
Sep 8, 2012, 10:10:38 AM9/8/12
to storm...@googlegroups.com
Performance is looking much better with 0.8.1.
Reply all
Reply to author
Forward
0 new messages