[Shinken-devel] Problem with timeouts

David Good

unread,

May 12, 2015, 12:31:20 AM5/12/15

to shinke...@lists.sourceforge.net

We're still running 2.2 currently (we're on a tight schedule to get
Shinken into production). We just revamped our host and service
configuration, adding about 1000 hosts (currently 3351 hosts, 1839
hostgroups, 30404 services). Since loading that configuration we've had
trouble with connections from the arbiter timing out so it's constantly
redispatching the configuration only to have the daemon that timed out
recover. I've increased ping timeouts to 6 seconds and data timeouts to
120 seconds but it still times out. I though maybe adding more
schedulers and pollers would help as it would cut the config into
smaller pieces that could be processed more quickly, but it's still
constantly dispatching and re-dispatching configurations.

We're running on 5 servers, each with 40 CPUs and 64 GB of RAM. One of
the servers is the master and runs all daemons plus an extra poller and
scheduler. Three other servers are running two schedulers and two
pollers each. The last is a spare setup the same as the master. None
of the servers are showing significant CPU, I/O, Memory or network usage.

Any ideas? Would upgrading to 2.4 help?

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Shinken-devel mailing list
Shinke...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/shinken-devel

Felipe openglx

unread,

May 12, 2015, 5:36:08 AM5/12/15

to shinke...@lists.sourceforge.net

Those timeout issues happened to me on a similar sized structure.

One of my realms was never getting stable and I discovered it wasn't timeout but actually number of threads on Cherrypy due number of other Shinken components.

Try allowing more threads on your scheduler (as it is the most central component):

daemon_thread_pool_size=12

Twelve threads was good enough for me but your number will be higher. So you have:

* 2 "master": 1 arbiter, 1 broker, 1 reactionner, 1 receiver, 2 schedulers, 2 pollers

* 3 nodes: 2 schedulers, 2 pollers

That means that you have 2*8 + 3*4 = 28 daemons. Maybe bring your thread pool size to 30 ?

I'd suspect root cause is your setup having way too many schedulers for no apparent reason. If you are having them distributed please take some read on setting up some realms.

Regards

David Good

unread,

May 12, 2015, 1:29:30 PM5/12/15

to shinke...@lists.sourceforge.net

Does it hurt to have more than I need? All of the Shinken servers have plenty of resources to spare. I'd rather they have more threads than they need than too few. Unless that creates other problems.

I've removed the extra schedulers and pollers I had added thinking it would help (now a total of 4 each plus a spare) and raised the thread pool size on all daemons to 50 and I'm still having issues with timeouts.

Felipe openglx

unread,

May 12, 2015, 5:10:14 PM5/12/15

to shinke...@lists.sourceforge.net

What is the latency between your nodes?

Have you restarted the scheduler after changing that setting?

Are you using CherryPy ?

I don't think it's benefitial in having too many schedulers unless you have a pretty good retention between them set up. I'd recommend two plus one spare for a setup your size.

David Good

unread,

May 12, 2015, 5:27:36 PM5/12/15

to shinke...@lists.sourceforge.net

On 5/12/15 2:08 PM, Felipe openglx wrote:

What is the latency between your nodes?

The ones that are having trouble are all on the same switch -- latency is 0.5ms or less between them. I have 4 other remote servers for checking access to our website from various internet locations. They each run a scheduler and poller and each is in its own realm. I'm not having any difficulty with them, though.

Have you restarted the scheduler after changing that setting?

Yes. I restarted everything on all servers.

Are you using CherryPy ?

I think so:

[1431461784] INFO: [Shinken] Initializing a CherryPy backend with 50 threads

I don't think it's benefitial in having too many schedulers unless you have a pretty good retention between them set up. I'd recommend two plus one spare for a setup your size.

OK. I was unsure how much capacity would be needed and was offered 5 servers so I went with what they gave me :-)

I can make the master server run only arbiter/broker/receiver/reactionner and then have two others run pollers and schedulers, one be a spare poller and scheduler and the last be a spare arbiter/broker/receiver/reactionner.

What I don't understand though, is that my current setup worked fine when I had around 2200 hosts being monitored and only started having major issues after I increased it to 3300. It seems odd that in order to handle the extra load I need to take away some of the servers.

Felipe openglx

unread,

May 12, 2015, 5:48:12 PM5/12/15

to shinke...@lists.sourceforge.net

The devs will be able to give more specifics (maybe even confirm if 2.4 performs better for your case?) but I faced similar issues with timeout because of the time it took to "slice and dice" the amount of objects.

If you can enable debug mode on all nodes and provide some captures it would be great.

How is the interdependency of your hosts? Is there a lot of parenting relation between hosts? And the ratio of services/host?

As far as I know from the internals, the number of objects isn't the only factor: how they relate to eachother will increase the number of "broks" that the daemons have to handle.

In one of my production environments we have near 30k hosts with ping-only, what brings the number of broks to be smaller than in a environment with 3k hosts with 30k services (one of my other prod env).

What I want to say is: your environment is considered mid-sized for Shinken and you have some very beefy servers, so it looks like some tuning to be done.

If thread pool didn't help we need to check what the daemons are doing that they "forget" to reply ping.

Could you confirm OS, Python version, and anything else you think is relevant?

Have you checked that firewall isn't blocking a return path? I remember one case where a node received the ping request but the pong reply wasn't being received due more rigorous firewalling...

You mentioned that with 2k hosts it worked, just wanted to double check that.

Maybe bring the system up with 0 hosts just to see if everyone is happy?

(Sorry that my e-mail is a bit confusing, it's bed time but felt the need to assist on the troubleshooting.... tomorrow I will try to make it clearer).

Regards

David Good

unread,

May 12, 2015, 7:17:55 PM5/12/15

to shinke...@lists.sourceforge.net

On 5/12/15 2:46 PM, Felipe openglx wrote:
> The devs will be able to give more specifics (maybe even confirm if
> 2.4 performs better for your case?) but I faced similar issues with
> timeout because of the time it took to "slice and dice" the amount of
> objects.
> If you can enable debug mode on all nodes and provide some captures it
> would be great.

OK -- I'll see about setting that up.

>
> How is the interdependency of your hosts? Is there a lot of parenting
> relation between hosts? And the ratio of services/host?

There is some parenting, but not that much -- about 1000 of the hosts
have parents (vm guests on hypervisors). When I started having trouble,
there were a total of 3300 hosts and 30000 services

>
> As far as I know from the internals, the number of objects isn't the
> only factor: how they relate to eachother will increase the number of
> "broks" that the daemons have to handle.
> In one of my production environments we have near 30k hosts with
> ping-only, what brings the number of broks to be smaller than in a
> environment with 3k hosts with 30k services (one of my other prod env).
>
>
> What I want to say is: your environment is considered mid-sized for
> Shinken and you have some very beefy servers, so it looks like some
> tuning to be done.
> If thread pool didn't help we need to check what the daemons are doing
> that they "forget" to reply ping.
>
> Could you confirm OS, Python version, and anything else you think is
> relevant?

Scientific Linux 6.5 (pretty much the same as RHEL or CentOS 6.5).
Python 2.6.6 (installed from system-provided RPM -
python-2.6.6-51.el6.x86_64).

We do have a *lot* of hostgroups included in host definitions. Probably
averaging around 10 per host. This would've probably at least doubled
with the new host config generation scheme.

>
> Have you checked that firewall isn't blocking a return path? I
> remember one case where a node received the ping request but the pong
> reply wasn't being received due more rigorous firewalling...
> You mentioned that with 2k hosts it worked, just wanted to double
> check that.
> Maybe bring the system up with 0 hosts just to see if everyone is happy?

No firewalls and the servers are all on the same switch and are even on
adjacent ports. None of them are running iptables.

It was happy before I made some major changes in how the host/service
configuration is generated (host definitions are now generated from an
internal inventory database) which resulted in adding another 1000 or so
hosts. It has since been decided that about 1000 servers don't need to
be monitored so now I'm down to about 2200 servers. We'll see if that
helps any.

>
>
> (Sorry that my e-mail is a bit confusing, it's bed time but felt the
> need to assist on the troubleshooting.... tomorrow I will try to make
> it clearer).
>

No problem. I really appreciate the help.

Andy Xie

unread,

May 12, 2015, 10:13:28 PM5/12/15

to shinken-devel

I know it is irrelevant to the problem, but can you tell me what is the time of the reload operation when you do "shinken reload", we have about 2k hosts and 2w services and the reload time is about 2min 30sec, this is only the arbiter reload time, when the configuration is sent to all the daemon and all the satellites begin to work properly, it will take about another 3min, which is totally added up to almost 5min. That is too long.

Another one question, how do you set your configuration reload time, every 5 mins or every 1h or even every 1day.

BTW, i have some redispatch in arbiter too. Because we only have one master broker, the other one is slave. Allmost every time i reload the configuration the configuration corresponding to broker need redispatch, however i can tell that even the dispatch for broker has spend for more than 80sec it still return True for the transmission. and when it check the first dispatch in "check_bad_dispatch" it will said that broker configuration may be lost then a redispatch is issued. Sometimes the same for reactionner too.

I currently have some questions that the configuration for scheduler is much bigger that the configuration for reactionner and broker, however, the dispatch for scheduler is fine and the dispatch for reactionner and broker is not.

++++++

Ning Xie

David Good

unread,

May 13, 2015, 2:13:00 PM5/13/15

to shinke...@lists.sourceforge.net

On 5/12/15 4:16 PM, David Good wrote:

On 5/12/15 2:46 PM, Felipe openglx wrote:

The devs will be able to give more specifics (maybe even confirm if 
2.4 performs better for your case?) but I faced similar issues with 
timeout because of the time it took to "slice and dice" the amount of 
objects.
If you can enable debug mode on all nodes and provide some captures it 
would be great.

OK -- I'll see about setting that up.

Here's one set of captures. I'm choosing timeouts between daemons running on the same server as the arbiter so there's no issue of network interference or clock skew.

Here's the arbiter:

[1431539957] INFO: [Shinken] [All] Trying to send configuration to poller poller-1
[1431539960] ERROR: [Shinken] Failed sending configuration for poller-1: Connexion error to http://shinken1.dc1.example.com:7771/ : Operation timed out after 3001 milliseconds with 0 bytes received
[1431539960] INFO: [Shinken] [All] Trying to send configuration to poller poller-4

Here's the corresponding entries from poller-1. Note that we have a lot of servers we're checking via NRPE but that don't have the NRPE daemon setup properly yet to allow access from the shinken servers, which looking at the code where these error messages are generated seems to be the cause of these errors:

[1431539922] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
  File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
    break
Error: []

[1431539957] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
  File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
    break
Error: []

[1431539957] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
  File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
    break
Error: []

[1431540097] DEBUG: [Shinken] socket.shutdown failed: [Errno 107] Transport endpoint is not connecte
d
[1431540097] DEBUG: [Shinken] socket.shutdown failed: [Errno 9] Bad file descriptor
[1431540097] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
  File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
    break
Error: []

There doesn't seem to be anything going on with the poller at the time that the arbiter is complaining about it not responding.

Olivier Hanesse

unread,

May 13, 2015, 2:24:05 PM5/13/15

to shinke...@lists.sourceforge.net

Did you try without the nrpe-booster module on poller ?

David Good

unread,

May 13, 2015, 2:24:32 PM5/13/15

to shinke...@lists.sourceforge.net

According to the logs, from the time the arbiter logs its version and copyright until it starts dispatching configurations takes a bit over 2 minutes (145 seconds):

[1431476926] INFO: [Shinken] Shinken 2.2
[1431476926] INFO: [Shinken] Copyright (c) 2009-2014:
...
[1431476927] INFO: [Shinken] Begin to dispatch configurations to satellites
...
[1431477071] INFO: [Shinken] Dispatching Realm All

I haven't setup the automatic reload time yet. I'm thinking probably once an hour, but only if the configuration has changed. No point in reloading if the configuration is the same.

David Good

unread,

May 13, 2015, 2:33:52 PM5/13/15

to shinke...@lists.sourceforge.net

No, because it's not just the pollers having trouble. I may try that today though just to eliminate the possibility.

David Good

unread,

May 13, 2015, 2:37:40 PM5/13/15

to shinke...@lists.sourceforge.net

Here's another set of captures. Again, both the arbiter and broker are on the same server:

arbiter:

[1431540048] INFO: [Shinken] [All] Dispatching broker satellite with order: broker-master (spare:False), broker-spare (spare:True),[1431540048] INFO: [Shinken] [All] Trying to send configuration to broker broker-master[1431540048] DEBUG: [Shinken] Posting to http://shinken1.dc1.example.com:7772/put_conf: 2019B[1431540168] ERROR: [Shinken] Failed sending configuration for broker-master: Connexion error to http://shinken1.dc1.example.com:7772/ : Operation timed out after 120001 milliseconds with 0 bytes received

broker:

[1431540000] DEBUG: [broker-master] [Livestatus] Request duration 0.0890s[1431540089] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok[1431540091] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table servicescolumns ['host_has_been_checked', 'host_name', 'host_state', 'host_scheduled_downtime_depth', 'host_acknowledged', 'has_been_checked', 'state', 'scheduled_downtime_depth', 'acknowledged']stats_columns []filter_columns ['host_name']is_stats Falseis_cacheable False[1431540091] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: [][1431540091] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: [][1431540091] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 3[1431540091] DEBUG: [broker-master] [Livestatus] Request duration 0.0033s[1431540091] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok[1431540092] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table commentscolumns ['author', 'comment', 'entry_time', 'entry_type', 'expires', 'expire_time', 'host_name', 'id', 'persistent', 'service_description', 'source', 'type']stats_columns []filter_columns []is_stats Falseis_cacheable False[1431540092] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: [][1431540092] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: [][1431540092] DEBUG: [broker-master] [Livestatus] Request duration 0.0203s[1431540092] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok[1431540093] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table downtimescolumns ['author', 'comment', 'end_time', 'entry_time', 'fixed', 'host_name', 'id', 'start_time', 'service_description', 'triggered_by', 'duration']stats_columns []filter_columns []is_stats Falseis_cacheable False[1431540093] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: [][1431540093] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: [][1431540093] DEBUG: [broker-master] [Livestatus] Request duration 0.0158s[1431540093] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok[1431540094] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table servicescolumns []stats_columns ['description']filter_columns ['host_name']is_stats Trueis_cacheable False[1431540094] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: [][1431540094] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: [][1431540094] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 3[1431540094] DEBUG: [broker-master] [Livestatus] Request duration 0.0026s[1431540094] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok[1431540095] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table servicescolumns ['accept_passive_checks', 'acknowledged', 'action_url', 'action_url_expanded', 'active_checks_enabled', 'check_command', 'check_interval', 'check_options', 'check_period', 'check_type', 'checks_enabled', 'comments', 'current_attempt', 'current_notification_number', 'description', 'event_handler', 'event_handler_enabled', 'custom_variable_names', 'custom_variable_values', 'execution_time', 'first_notification_delay', 'flap_detection_enabled', 'groups', 'has_been_checked', 'high_flap_threshold', 'host_acknowledged', 'host_action_url_expanded', 'host_active_checks_enabled', 'host_address', 'host_alias', 'host_checks_enabled', 'host_check_type', 'host_latency', 'host_plugin_output', 'host_perf_data', 'host_current_attempt', 'host_check_command', 'host_comments', 'host_groups', 'host_has_been_checked', 'host_icon_image_expanded', 'host_icon_image_alt', 'host_is_executing', 'host_is_flapping', 'host_name', 'host_notes_url_expanded', 'host_notifications_enabled', 'host_scheduled_downtime_depth', 'host_state', 'host_accept_passive_checks', 'icon_image', 'icon_image_alt', 'icon_image_expanded', 'is_executing', 'is_flapping', 'last_check', 'last_notification', 'last_state_change', 'latency', 'long_plugin_output', 'low_flap_threshold', 'max_check_attempts', 'next_check', 'notes', 'notes_expanded', 'notes_url', 'notes_url_expanded', 'notification_interval', 'notification_period', 'notifications_enabled', 'obsess_over_service', 'percent_state_change', 'perf_data', 'plugin_output', 'process_performance_data', 'retry_interval', 'scheduled_downtime_depth', 'state', 'state_type', 'modified_attributes_list', 'last_time_critical', 'last_time_ok', 'last_time_unknown', 'last_time_warning', 'display_name', 'host_display_name', 'host_custom_variable_names', 'host_custom_variable_values', 'in_check_period', 'in_notification_period', 'host_parents', 'is_impact', 'source_problems', 'impacts', 'criticity', 'is_problem', 'poller_tag', 'got_business_rule', 'parent_dependencies']stats_columns []filter_columns ['host_name']is_stats Falseis_cacheable False[1431540095] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: [][1431540095] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: [][1431540095] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 3[1431540095] DEBUG: [broker-master] [Livestatus] Request duration 0.0098s[1431540095] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok[1431540096] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table statuscolumns ['accept_passive_host_checks', 'accept_passive_service_checks', 'check_external_commands', 'check_host_freshness', 'check_service_freshness', 'enable_event_handlers', 'enable_flap_detection', 'enable_notifications', 'execute_host_checks', 'execute_service_checks', 'last_command_check', 'last_log_rotation', 'livestatus_version', 'nagios_pid', 'obsess_over_hosts', 'obsess_over_services', 'process_performance_data', 'program_start', 'program_version', 'interval_length']stats_columns []filter_columns []is_stats Falseis_cacheable False[1431540096] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: [][1431540096] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: [][1431540096] DEBUG: [broker-master] [Livestatus] Request duration 0.0038s[1431540096] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok[1431540098] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table contactgroupscolumns ['name']stats_columns []filter_columns ['members']is_stats Falseis_cacheable False[1431540098] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: ['members'][1431540098] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: ['members'][1431540098] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 0[1431540098] DEBUG: [broker-master] [Livestatus] Request duration 0.0030s[1431540173] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok

Olivier Hanesse

unread,

May 13, 2015, 2:40:20 PM5/13/15

to shinke...@lists.sourceforge.net

Could you launch a daemon in "cli" and paste the output ?

David Good

unread,

May 13, 2015, 3:17:45 PM5/13/15

to shinke...@lists.sourceforge.net

Which daemon? Do you mean just run something like /usr/bin/shinken-poller from the commandline?

Olivier Hanesse

unread,

May 13, 2015, 4:00:48 PM5/13/15

to shinke...@lists.sourceforge.net

A poller for example.

And yes, do a "ps -ef" and copy/paste the command line without the "-d" (daemon mode)

David Good

unread,

May 13, 2015, 4:12:13 PM5/13/15

to shinke...@lists.sourceforge.net

Here's everything up to when it starts waiting for its configuration from the arbiter:

[root@shinken1 shinken]# /usr/bin/shinken-poller -r -c /etc/shinken/daemons/pollerd.ini [1431547694] INFO: [Shinken] Stale pidfile exists at invalid literal for int() with base 10: '' (/var/run/shinken/pollerd.pid). Reusing it. [1431547694] INFO: [Shinken] Opening HTTP socket at http://0.0.0.0:7771 Bottle server starting up (using CherryPyServer(ssl_key='', ssl_cert='', daemon_thread_pool_size=50, ca_cert='', use_ssl=False))... Listening on http://0.0.0.0:7771/ Use Ctrl-C to quit. [1431547694] INFO: [Shinken] Initializing a CherryPy backend with 50 threads Shutting down... [1431547694] INFO: [Shinken] Shinken 2.2 [1431547694] INFO: [Shinken] Copyright (c) 2009-2014: [1431547694] INFO: [Shinken] Gabes Jean (napa...@gmail.com) [1431547694] INFO: [Shinken] Gerhard Lausser, Gerhard...@consol.de [1431547694] INFO: [Shinken] Gregory Starck, g.st...@gmail.com [1431547694] INFO: [Shinken] Hartmut Goebel, h.go...@goebel-consult.de [1431547694] INFO: [Shinken] License: AGPL [1431547694] INFO: [Shinken] Stale pidfile exists at invalid literal for int() with base 10: '' (/var/run/shinken/pollerd.pid). Reusing it. [1431547694] INFO: [Shinken] Opening HTTP socket at http://0.0.0.0:7771 [1431547694] INFO: [Shinken] Initializing a CherryPy backend with 50 threads [1431547694] INFO: [Shinken] Using the local log file '/var/log/shinken/pollerd.log' [1431547694] INFO: [Shinken] Starting HTTP daemonList to register :[('__init__', <bound method IForArbiter.__init__ of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('api', <bound method IForArbiter.api of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('api_full', <bound method IForArbiter.api_full of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_external_commands', <bound method IForArbiter.get_external_commands of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_log_level', <bound method IForArbiter.get_log_level of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_running_id', <bound method IForArbiter.get_running_id of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_start_time', <bound method IForArbiter.get_start_time of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('got_conf', <bound method IForArbiter.got_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('have_conf', <bound method IForArbiter.have_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('ping', <bound method IForArbiter.ping of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('push_broks', <bound method IForArbiter.push_broks of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('push_host_names', <bound method IForArbiter.push_host_names of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('put_conf', <bound method IForArbiter.put_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('remove_from_conf', <bound method IForArbiter.remove_from_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('set_log_level', <bound method IForArbiter.set_log_level of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('wait_new_conf', <bound method IForArbiter.wait_new_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('what_i_managed', <bound method IForArbiter.what_i_managed of <shinken.satellite.IForArbiter object at 0x26ff450>>)] Registering api [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering api_full [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering get_external_commands [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering get_log_level [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering get_running_id [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering get_start_time [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering got_conf [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering have_conf [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering ping [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering push_broks ['broks'] <shinken.satellite.IForArbiter object at 0x26ff450> Registering push_host_names ['sched_id', 'hnames'] <shinken.satellite.IForArbiter object at 0x26ff450> Registering put_conf ['conf'] <shinken.satellite.IForArbiter object at 0x26ff450> Registering remove_from_conf ['sched_id'] <shinken.satellite.IForArbiter object at 0x26ff450> Registering set_log_level ['loglevel'] <shinken.satellite.IForArbiter object at 0x26ff450> Registering wait_new_conf [] <shinken.satellite.IForArbiter object at 0x26ff450> Registering what_i_managed [] <shinken.satellite.IForArbiter object at 0x26ff450> picking only bound methods of class and not parents List to register :[('get_broks', <bound method IBroks.get_broks of <shinken.satellite.IBroks object at 0x26ff590>>)] Registering get_broks ['bname'] <shinken.satellite.IBroks object at 0x26ff590> picking only bound methods of class and not parents List to register :[('get_returns', <bound method ISchedulers.get_returns of <shinken.satellite.ISchedulers object at 0x26ff610>>), ('push_actions', <bound method ISchedulers.push_actions of <shinken.satellite.ISchedulers object at 0x26ff610>>)] Registering get_returns ['sched_id'] <shinken.satellite.ISchedulers object at 0x26ff610> Registering push_actions ['actions', 'sched_id'] <shinken.satellite.ISchedulers object at 0x26ff610> picking only bound methods of class and not parents List to register :[('get_raw_stats', <bound method IStats.get_raw_stats of <shinken.satellite.IStats object at 0x26ff4d0>>)] Registering get_raw_stats [] <shinken.satellite.IStats object at 0x26ff4d0> [1431547694] INFO: [Shinken] Modules directory: /var/lib/shinken/modules [1431547694] INFO: [Shinken] Modules directory: /var/lib/shinken/modules [1431547694] INFO: [Shinken] Waiting for initial configuration

David Good

unread,

May 13, 2015, 4:45:27 PM5/13/15

to shinke...@lists.sourceforge.net

OK, things seem to be stable now. I discovered that several of the
schedulers were using massive amounts of memory (over 30GB) causing the
kernel to try to kill them or their children. I restarted them, then
restarted anything that showed up as a problem in the arbiter log and
since then it's been stable.

One odd thing though is that some of the daemons wouldn't die normally
-- I had to use 'kill -KILL' on them.

Olivier Hanesse

unread,

May 13, 2015, 4:51:26 PM5/13/15

to shinke...@lists.sourceforge.net

and nothing after [1431547694] INFO: [Shinken] Waiting for initial configuration ?

On arbiter side, I guess there is an error log saying that he did'nt succeed to push configuration for that poller ?

David Good

unread,

May 13, 2015, 4:58:46 PM5/13/15

to shinke...@lists.sourceforge.net

No, it kept going after that, I just didn't want to post too much :-)

The problem isn't that it doesn't ever get a config it's that it'll get one and then a bit later the arbiter will timeout trying to give it another. Then it'll recover and one of the other daemons will timeout (broker, receiver, reactionner, scheduler, another poller) and it never seems to stop dispatching and re-dispatching configs. It finally seems to have settled down after my latest round of restarts. I have a new config to load (some server changes were made recently) so we'll see if it goes any more smoothly this time.

Felipe openglx

unread,

May 13, 2015, 5:01:41 PM5/13/15

to shinke...@lists.sourceforge.net

I've noticed that Shinken 2 doesn't go easily with kill. I've always done "pkill -9 -f shinken-" when needing to restart them.

Glad to hear you got something working, David. All hosts or just a fraction of them?

Regards

David Good

unread,

May 13, 2015, 5:20:05 PM5/13/15

to shinke...@lists.sourceforge.net

It was all hosts, but I just reloaded with a new config, so we'll see if my luck holds :-)

Felipe openglx

unread,

May 13, 2015, 5:52:21 PM5/13/15

to shinke...@lists.sourceforge.net

Play the lotto just in case ;)

My suspicion would be that your previous "restart" to adjust the thread pool (or other testing) didn't kill all threads, hence why you had some very unusual situations going on.

Let us know how it goes, best luck on getting the project delivered!

Regards

David Good

unread,

May 14, 2015, 5:51:03 PM5/14/15

to shinke...@lists.sourceforge.net

I doubt that was the case -- I was careful to make sure everything was stopped before restarting.

And now my problems have started up again. I may be forced to upgrade to 2.4 to see if it helps any. Very frustrating. If that doesn't fix it, I may be forced to fall back to nagios and gearman. It'd hate to do that as we had promised that Shinken would scale better than Nagios.

David Good

unread,

May 14, 2015, 6:14:33 PM5/14/15

to shinke...@lists.sourceforge.net

Here's another example of what I'm seeing -- In the arbiter log I'll see something like this:

[1431641122] INFO: [Shinken] [All] Trying to send configuration to poller poller-1
[1431641242] ERROR: [Shinken] Failed sending configuration for poller-1: Connexion error to http://shinken1.dc1.eharmony.com:7771/ : Operation timed out after 120001 milliseconds with 0 bytes received

And then just a few seconds later:

[1431641291] INFO: [Shinken] [All] Trying to send configuration to poller poller-1
[1431641291] INFO: [Shinken] [All] Dispatch OK of configuration 1 to poller poller-1

And this poller is on the same server as the arbiter. I see this happening sporadically for pretty much every daemon, causing the configuration to be constantly in the process of being re-dispatched. This is especially frustrating as I'm trying to test out some new configs adding and removing hosts and services from monitoring. If it can't finish dispatching it makes it hard to test :-/

David Good

unread,

May 14, 2015, 6:25:28 PM5/14/15

to shinke...@lists.sourceforge.net

Here's the poller.ini file I'm using:

[daemon] #-- Global Configuration #user=shinken ; if not set then by default it's the current user. #group=shinken ; if not set then by default it's the current group. # Set to 0 if you want to make this daemon NOT run daemon_enabled=1 # Larger configurations need more threads (default is 8?) daemon_thread_pool_size=50 #-- Path Configuration # The daemon will chdir into the directory workdir when launched # paths variables values, if not absolute paths, are relative to workdir. # using default values for following config variables value: workdir = /var/run/shinken logdir = /var/log/shinken pidfile=%(workdir)s/pollerd.pid #-- Network configuration # host=0.0.0.0 # port=7771 # http_backend=auto # idontcareaboutsecurity=0 #-- SSL configuration -- use_ssl=0 # WARNING : Put full paths for certs #ca_cert=/etc/shinken/certs/ca.pem #server_cert=/etc/shinken/certs/server.cert #server_key=/etc/shinken/certs/server.key #hard_ssl_name_check=0 #-- Local log management -- # Enabled by default to ease troubleshooting use_local_log=1 local_log=%(logdir)s/pollerd.log # accepted log level values= DEBUG,INFO,WARNING,ERROR,CRITICAL log_level=INFO #log_level=DEBUG

And here's the poller.cfg file:

#=============================================================================== # POLLER (S1_Poller) #=============================================================================== # Description: The poller is responsible for: # - Active data acquisition # - Local passive data acquisition # https://shinken.readthedocs.org/en/latest/08_configobjects/poller.html #=============================================================================== define poller { poller_name poller-1 address shinken1.dc1.example.com port 7771 ## Optional spare 0 ; 1 = is a spare, 0 = is not a spare manage_sub_realms 0 ; Does it take jobs from schedulers of sub-Realms? min_workers 0 ; Starts with N processes (0 = 1 per CPU) max_workers 0 ; No more than N processes (0 = 1 per CPU) processes_by_worker 256 ; Each worker manages N checks polling_interval 1 ; Get jobs from schedulers each N seconds timeout 3 ; Ping timeout data_timeout 120 ; Data send timeout max_check_attempts 3 ; If ping fails N or more, then the node is dead check_interval 60 ; Ping node every N seconds ## Interesting modules that can be used: # - booster-nrpe = Replaces the check_nrpe binary. Therefore it # enhances performances when there are lot of NRPE # calls. # - named-pipe = Allow the poller to read a nagios.cmd named pipe. # This permits the use of distributed check_mk checks # should you desire it. # - SnmpBooster = Snmp bulk polling module modules named-pipe, booster-nrpe ## Advanced Features #passive 0 ; For DMZ monitoring, set to 1 so the connections ; will be from scheduler -> poller. # Poller tags are the tag that the poller will manage. Use None as tag name to manage # untaggued checks #poller_tags None # Enable https or not use_ssl 0 # enable certificate/hostname check, will avoid man in the middle attacks hard_ssl_name_check 0 realm All }

On 5/14/15 3:13 PM, David Good wrote:

Here's another example of what I'm seeing -- In the arbiter log I'll see something like this:

[1431641122] INFO: [Shinken] [All] Trying to send configuration to poller poller-1

[1431641242] ERROR: [Shinken] Failed sending configuration for poller-1: Connexion error to http://shinken1.dc1.example.com:7771/ : Operation timed out after 120001 milliseconds with 0 bytes received

Felipe openglx

unread,

May 15, 2015, 6:44:49 AM5/15/15

to shinke...@lists.sourceforge.net

Sorry, David, I'm out of ideas.

Only difference from your config to mine is the use of modules on the poller: I don't use any.

Have you tried upgrading to 2.4 just in case? Or maybe downgrading to 2.0.3 (that's the version I am).

Regards

Reply all

Reply to author

Forward