Topology is getting stuck

696 views
Skip to first unread message

KIRAN

unread,
Oct 29, 2012, 10:52:42 AM10/29/12
to storm...@googlegroups.com
Hi all,

    I've developed a simple topology where I have a spout with parallelism factor of 5. Each instance pushes 800 tuples into the system, one by one, summing up to 4000. This works perfectly and I could see 4000 acks. But, this is not the scale I'm looking for. 

I modified the spout such that each instance pushes 8000 tuples summing up to 40k tuples getting pushed by 5 threads simultaneously. When I run this topology, I could see it getting stuck, once after each spout instance emits approx. 5000 tuples or so. I tried playing with the vm args and also with the max spout pending. I also tried submitting the topology to a cluster. But, still I'm facing the problem. I would really appreciate if someone could advice me on tuning this.

Thanks,
Kiran

Douglas Moore

unread,
Oct 29, 2012, 5:22:07 PM10/29/12
to storm...@googlegroups.com
On the UI, what are the latency values being shown on the UI for your components?

- Douglas

Tom Brown

unread,
Oct 29, 2012, 6:05:28 PM10/29/12
to storm...@googlegroups.com
It can be very difficult to diagnose a "stuck" cluster. Aside from
adding additional logging (and checking the logs), there's not much
you can do.

Speaking of which... Have you added additional logging and checked
them to see if you can conclusively find a crash or bottleneck?

--Tom

KIRAN

unread,
Oct 29, 2012, 7:00:36 PM10/29/12
to storm...@googlegroups.com
Thanks Douglas and Tom for responding. Please find the UI screenshot attached.

I just have two bolts out of which only one(of name ApplicationLevel) does the actual work or aggregation(counting). Other one(of name split) just breaks the tuples emitted by spout.

I've logging enabled and I've set debugging also to true. Moreover I have added System.out's wherever possible to help me debug. But, all of a sudden, the topology just gets stuck. I dont see any error messages in nimbus/supervisor/worker logs.

I would like to try out by allocating more resources. Any suggestions on that?

I'm observing another weird behavior. Tuples belonging to same fieldsGrouping go to 2 different bolt instances(2 different threads). Based on the contract, this shouldn't happen. Right? I'm just having a single worker.

Thanks,
Kiran
screenshot.png

Tom Brown

unread,
Oct 29, 2012, 7:07:40 PM10/29/12
to storm...@googlegroups.com
Can you post the details of the cluster you're trying this on? And
perhaps the code that sets up the topology (so we can see the
streaming and grouping concretely)?

--Tom

KIRAN

unread,
Oct 29, 2012, 7:26:28 PM10/29/12
to storm...@googlegroups.com
It's a 2 machine cluster.

1) RHEL - Runs zookeeper and Nimbus
2) Mac mini - runs supervisors/workers - as per storm.yaml, 10 workers are possible. But, I'm just running this topology on a single worker

Source code attached.

Thanks,
Kiran
AckTopology.java
AckSpout.java

anahap

unread,
Oct 30, 2012, 4:37:51 AM10/30/12
to storm...@googlegroups.com
Hi Kiran N.
You might run into problems because of the System.out calls.
If they are not redirected by storm (I think the aren't) then the buffer will fill up and effectively stop the threads they are called in.
Please try removing the System.outs and use some loggin framework like log4j.

Nathan Marz

unread,
Oct 30, 2012, 3:20:08 PM10/30/12
to storm...@googlegroups.com
Hi Kiran,

Thanks for reporting this and providing a reproducible test case. The issue was a bug in Storm where a topology could get deadlocked when all internal buffers filled up. Topologies are cyclic as spouts talk to bolts, bolts talk to ackers, and ackers talk back to spouts. This was actually a variant of the dining philosophers problem. I just committed a bug fix that will be available in the next release of Storm (0.8.2). The fix prevents spouts from getting blocked on emitting to the outgoing queue and lets them process incoming acks/fails. Storm will use an unbounded overflow buffer on the spout if it would have been blocked on an emit. Storm will then stop calling nextTuple until it's able to transfer the overflow buffer onto the outgoing queue. This limits the size of the overflow buffer to however many tuples are emitted in one call of nextTuple.

Note that this issue was exceedingly unlikely to happen if spouts emit only one tuple in nextTuple (as most spouts do). 

For now, the fix is available in 0.8.2-wip16.

-Nathan
--
Twitter: @nathanmarz
http://nathanmarz.com

Kiran N

unread,
Nov 13, 2012, 12:11:06 PM11/13/12
to storm...@googlegroups.com

Sure thing, Nathan. Thanks a lot for clarifying my query. I have one more concern. I'm using the same application except for one small change, as described below:

My spout will be iterating over a string array having the following four elements and emitting 400 tuples in a round-robin fashion.
         "app!Instance!5!kpi1!"+now+"!0.62357!0.03700!2.87700!10!0!6978.89460!0.03700!2.87700!275.00000!508.00000!1000.00000!1000.00000!1000.00000!1000.00000!Instance!app!d1!d1p1!d1p1h1!d1p1h1i1!"       "app!Instance!5!kpi2!"+now+"!0.62357!0.03700!2.87700!10!0!6978.89460!0.03700!2.87700!275.00000!508.00000!1000.00000!1000.00000!1000.00000!1000.00000!Instance!app!d1!d1p1!d1p1h1!d1p1h1i1!"         "app!Instance!5!kpi3!"+now+"!0.62357!0.03700!2.87700!10!0!6978.89460!0.03700!2.87700!275.00000!508.00000!1000.00000!1000.00000!1000.00000!1000.00000!Instance!app!d1!d1p1!d1p1h1!d1p1h1i1!",         "app!Instance!5!kpi4!"+now+"!0.62357!0.03700!2.87700!10!0!6978.89460!0.03700!2.87700!275.00000!508.00000!1000.00000!1000.00000!1000.00000!1000.00000!Instance!app!d2!d2p1!d2p1h1!d2p1h1i10!"

I have a cluster setup with 5 machines. I have fifty workers and the parallelism I've specified for the bolt is 5. I'm doing fieldsgrouping based on three parameters kpi, timeStamp and Application(All three in bold. For testing purposes I'm varying only the kpi value). I'm logging the thread ids in the bolts(using Thread.currentThread().getId())...I'm observing a weird behavior. The thread Id is remaining the same across all the five worker machines. Is this expected? Or there is any issue with our setup? But, I could see the thread IDs being different when I run locally. I would really appreciate if you could throw some light.

Pasting the logs from just two machines..

Machine 1:

worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi2---2012-11-12 17:00:50---340---0.6235700000000002---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi2---2012-11-12 17:00:50---350---0.6235700000000002---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi2---2012-11-12 17:00:50---360---0.6235700000000002---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi2---2012-11-12 17:00:50---370---0.6235700000000002---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi2---2012-11-12 17:00:50---380---0.6235700000000002---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi2---2012-11-12 17:00:50---390---0.6235700000000002---0.037---2.877

Machine 2:

worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi3---2012-11-12 17:00:50---290---0.6235700000000001---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi3---2012-11-12 17:00:50---300---0.6235700000000001---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi3---2012-11-12 17:00:50---310---0.6235700000000001---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi3---2012-11-12 17:00:50---320---0.6235700000000002---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi3---2012-11-12 17:00:50---330---0.6235700000000002---0.037---2.877
worker-6700.log:2012-11-12 17:00:50 STDIO [INFO] ~~~~~~~~~~~~~~~~~ Thread Id: 23 app&&&&&&&&&&&&&&&&&&&&&&&&kpi3---2012-11-12 17:00:50---340---0.6235700000000002---0.037---2.877

Thanks,
Kiran

Nathan Marz

unread,
Nov 13, 2012, 2:37:39 PM11/13/12
to storm-user
I don't know why that would be, but it doesn't seem like any sort of problem to me.
Reply all
Reply to author
Forward
0 new messages