[Shinken-devel] Broker memory blowout

28 views

Skip to first unread message

David Good

unread,

Apr 22, 2015, 6:22:43 PM4/22/15

to shinke...@lists.sourceforge.net

We have a Shinken installation using version 2.2. We have 5 servers.
Each server has 40 CPU cores and 64GB of RAM. One server (shinken1) is
running all of the daemons (arbiter, broker, scheduler, poller,
receiver, reactionner). Three servers (shinken2, shinken3, shinken4)
run only a scheduler and poller. The last server (shinken5) is a spare
for all daemons.

Our current configuration (which is still being setup) has 2118 hosts
and 19374 services and everything is running smoothly except that the
Thruk interface seems a bit sluggish.

However, when I change the normal_check_interval from 5 to 1 and the
retry_check_interval from 2 to 1 (to match the current Nagios
implementation) we start to have problems. The broker daemon starts to
use more and more memory and CPU and then I start to see a lot of
timeouts and configuration reassignments in the arbiter log. I've seen
a broker process get up to 30GB of memory and 120% CPU at which point
Shinken is pretty much unusable.

Any idea what would cause this? Is it just that I need more capacity to
run the extra checks? Would adding another poller and scheduler to
shinken2-4 help? They seem to have plenty of CPU and RAM to spare still.

------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Shinken-devel mailing list
Shinke...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

nap

unread,

Apr 24, 2015, 3:12:22 AM4/24/15

to shinke...@lists.sourceforge.net

Hi,

On Thu, Apr 23, 2015 at 12:05 AM, David Good <dg...@willingminds.com> wrote:

We have a Shinken installation using version 2.2. We have 5 servers.
Each server has 40 CPU cores and 64GB of RAM. One server (shinken1) is
running all of the daemons (arbiter, broker, scheduler, poller,
receiver, reactionner). Three servers (shinken2, shinken3, shinken4)
run only a scheduler and poller. The last server (shinken5) is a spare
for all daemons.

Our current configuration (which is still being setup) has 2118 hosts
and 19374 services

So a middle size env.

and everything is running smoothly except that the
Thruk interface seems a bit sluggish.

That's not Thruk that is slow, it's the livestatus module called by Thruk I think.

However, when I change the normal_check_interval from 5 to 1 and the
retry_check_interval from 2 to 1 (to match the current Nagios
implementation) we start to have problems. The broker daemon starts to
use more and more memory and CPU and then I start to see a lot of
timeouts and configuration reassignments in the arbiter log. I've seen
a broker process get up to 30GB of memory and 120% CPU at which point
Shinken is pretty much unusable.

You can look in the broker logs which module is this process. I think it will be the livestatus one that is known to be CPU consumming.

But the 30GB is just too huge for a 20K env, mine take 8GB for 120K services. If you are using log in the livestatus module, remove them. They are known to ask LOT of memory. We have some POCs about CPU tuning, but in all my customers I set the log to off, because it's just too sensible to bad user queries.

There should be ways to reduce this maybe but currently noone is working on this part.

Any idea what would cause this? Is it just that I need more capacity to
run the extra checks? Would adding another poller and scheduler to
shinken2-4 help? They seem to have plenty of CPU and RAM to spare still.

I don't see the link between recuced intervals and the broker consumption. The objects create with a check or a check results are very small, and only increase the number of services have a real impact on the memory/cpu broker consumption (of course scheduler and pollers are directly impacted, but here you point only the broker).

So first look at logs, and then see if it's better. You setup is not so big that it need so much RAM, there something wrong :)