How to Monitor a Storm Cluster

1,361 views
Skip to first unread message

Ji Zhang

unread,
Sep 2, 2012, 9:06:33 AM9/2/12
to storm...@googlegroups.com
Hi,

Wiki says the best way to monitor a storm cluster is checking out the Storm UI. But It only shows the current state, what if I want to see a time sequence of states? Or, specifically, when a serious problem occurred and disappeared, I certainly need the states of that period of moments.

Also I'm wondering what indicates the general healthiness of a storm cluster, as well as the running topologies? I find no explanation of the various statistics shown in Storm UI. Anyone could get me through that?

Thanks in advance.

Ji Zhang

unread,
Sep 2, 2012, 9:19:42 AM9/2/12
to storm...@googlegroups.com
To further illustrate the situation, I have a spout receiving messages from rabbitmq, but occasionally the rabbitmq shows there're un-consumed messages (exceeds our warning line). I've checked the logs, showing no clues.

1. Does the 'latency' (0.045 ms) in storm ui indicate the topology is healthy enough?
2. Will storm sometimes stop calling nextTuple()?

Nathan Marz

unread,
Sep 4, 2012, 2:33:58 AM9/4/12
to storm...@googlegroups.com
Storm only stops calling nextTuple when the number of unacked messages on a spout exceeds TOPOLOGY_MAX_SPOUT_PENDING (which defaults to infinity). Though eventually, unacked messages will timeout and it will continue emitting messages). 

Most likely it sounds like you need more parallelism to handle your throughput.

As for monitoring, all the stats are available via the Nimbus Thrift interface. So you can use that to export state to a richer monitoring system (perhaps one that you have in-house).

Though the most important monitoring is the one on your spout source, as the ultimate indicator of something being wrong is backup on your spout source.
--
Twitter: @nathanmarz
http://nathanmarz.com

Ji Zhang

unread,
Sep 6, 2012, 3:19:28 AM9/6/12
to storm...@googlegroups.com
Hi,

Thanks for your accurate reply.

It seems that Nimbus Thrift interface is exactly what I need. I have no experience about thrift, so I need to look into that first. Also I find this article quite useful:

As for my problem, it turns out it was rabbitmq that hit bottleneck.
Reply all
Reply to author
Forward
0 new messages