cluster deployment / nimbus / nimbi?

525 views
Skip to first unread message

Mike Stanley

unread,
Dec 13, 2011, 2:05:34 PM12/13/11
to storm...@googlegroups.com
We are in the process of moving from prototype/beta to production and are looking for HA (as much as possible).

Is it possible/recommended to run multiple nimbus servers?   
We have multiple zookeeper servers, and multiple servers for supervisors/workers, but are not sure what the best practices are with regards to nimbus.  

thanks in advance for your help.

... Mike

Ashley Brown

unread,
Dec 13, 2011, 2:16:00 PM12/13/11
to storm...@googlegroups.com
Is it possible/recommended to run multiple nimbus servers?   
We have multiple zookeeper servers, and multiple servers for supervisors/workers, but are not sure what the best practices are with regards to nimbus.  


AIUI, Nimbus is only required for job submission and monitoring - it needn't be up for the topologies to run, and you can't have more than one.

We've had a Storm cluster running in production for 4 weeks (and staging for a few weeks before that) and haven't encountered any problems with just having a single one.

Mike Stanley

unread,
Dec 13, 2011, 2:22:22 PM12/13/11
to storm...@googlegroups.com
Thanks Ashley.  that's my understanding as well.   

I've had storm running on a single server for 60+ days with topologies processing production stream data without any issues, but now we are about to roll out product features that depend on the storm analysis data so just looking to protect my a$$ a bit ;-)

Ashley Brown

unread,
Dec 13, 2011, 2:32:13 PM12/13/11
to storm...@googlegroups.com
I've had storm running on a single server for 60+ days with topologies processing production stream data without any issues, but now we are about to roll out product features that depend on the storm analysis data so just looking to protect my a$$ a bit ;-)


I fully appreciate the need to do that ;)

Make sure you've got enough extra capacity to cope with a machine being taken out, otherwise the topologies get stuck in a weird limbo until you kick it back into life.

We overloaded the system a bit during testing (so the OOM killer would randomly take out nimbus, supervisors and workers) and it all seemed happy enough, in our case using monit to bring the nimbus/supervisor processes back up.

Of course this all relies on having a spout which supports the reliability API, and correctly building the tuple tree.

A

Nathan Marz

unread,
Dec 14, 2011, 3:51:53 AM12/14/11
to storm...@googlegroups.com
Just to clarify what Ashley said: if Storm doesn't have enough worker slots to run a topology, it will "pack" all the tasks for that topology into whatever slots there are on the cluster. Then, when there are more worker slots available (like you add another machine), it will redistribute the tasks to the proper number of workers. Of course, running a topology with less workers than intended will probably lead to performance problems. So it's best practice to have extra capacity available to handle any failures.

--
Twitter: @nathanmarz
http://nathanmarz.com

Ashley Brown

unread,
Dec 14, 2011, 4:38:37 AM12/14/11
to storm...@googlegroups.com

Just to clarify what Ashley said: if Storm doesn't have enough worker slots to run a topology, it will "pack" all the tasks for that topology into whatever slots there are on the cluster. Then, when there are more worker slots available (like you add another machine), it will redistribute the tasks to the proper number of workers. Of course, running a topology with less workers than intended will probably lead to performance problems. So it's best practice to have extra capacity available to handle any failures.


Aha, that's useful to know. That looked like what it was doing, but as you say we had performance problems so it manifested itself as never-ending replays of tuples, as if new bolts weren't being created.

One possible pitfall I can see with losing a nimbus node which is co-located with  supervisors/workers: presumably the cluster has no way to re-allocate the work in that case, so affected topologies will be broken/stalled until the node comes back up?

Nathan Marz

unread,
Dec 15, 2011, 4:41:08 AM12/15/11
to storm...@googlegroups.com
Yea, that's a good point. Probably a best practice to not co-locate Nimbus with any worker nodes. 

Nathan Marz

unread,
Feb 20, 2013, 9:47:40 PM2/20/13
to storm...@googlegroups.com
It's configured to use 3 workers. worker = java process, and each worker takes up one slot.

On Wed, Feb 20, 2013 at 5:36 PM, <rui....@openx.com> wrote:
Hi Nathan,


On Wednesday, December 14, 2011 12:51:53 AM UTC-8, nathanmarz wrote:
Just to clarify what Ashley said: if Storm doesn't have enough worker slots to run a topology, it will "pack" all the tasks for that topology into whatever slots there are on the cluster. Then, when there are more worker slots available (like you add another machine), it will redistribute the tasks to the proper number of workers. Of course, running a topology with less workers than intended will probably lead to performance problems. So it's best practice to have extra capacity available to handle any failures.


I'm new to storm and we will use this for our production pipeline. I set up a small cluster using 5 machines like this

nimbus --- zk --- 3 workers   with each worker has 4 default slots

then when I tried the wordcount topology, I noticed that there are actually 26 executors - 5 spouts + 8 split + 12 count + 1 ack.
However, it only uses 3 slots, and it looks like all those 26 tasks are just executed on the 3 slots it took, kind of a roundrobin
fashion.

Could you please elaborate this a little more about this behavior? why didn't it use more slots? is it because I use 1 spout and 2 bolts
so 1+2 = 3? how did storm schedule those 26 tasks on those 3 slots? is it like spouts could pause sometime to let split/count run and
later resume?

Your opinion will be greatly appreciated!

thanks,
Rui
 



On Tue, Dec 13, 2011 at 11:32 AM, Ashley Brown <ash...@spider.io> wrote:
I've had storm running on a single server for 60+ days with topologies processing production stream data without any issues, but now we are about to roll out product features that depend on the storm analysis data so just looking to protect my a$$ a bit ;-)


I fully appreciate the need to do that ;)

Make sure you've got enough extra capacity to cope with a machine being taken out, otherwise the topologies get stuck in a weird limbo until you kick it back into life.

We overloaded the system a bit during testing (so the OOM killer would randomly take out nimbus, supervisors and workers) and it all seemed happy enough, in our case using monit to bring the nimbus/supervisor processes back up.

Of course this all relies on having a spout which supports the reliability API, and correctly building the tuple tree.

A



--
Twitter: @nathanmarz
http://nathanmarz.com

--
You received this message because you are subscribed to the Google Groups "storm-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to storm-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Rui Wang

unread,
Feb 21, 2013, 1:14:56 PM2/21/13
to storm...@googlegroups.com
Thanks for the reply, Nathan. I got that worker definition now. I think that my questions are more related to the executors. how do they share the worker slot? Are they just threads in the process?

Another related question is, when you use

conf.setNumWorkers(3);

Let's say you only have two available worker slots now, and you mentioned that storm will pack the tasks into available slots. Could you please elaborate how that would behave?

Thanks,
Rui
Reply all
Reply to author
Forward
0 new messages