What is the latency between your nodes?
Have you restarted the scheduler after changing that setting?
Are you using CherryPy ?
[1431461784] INFO: [Shinken] Initializing a CherryPy backend with 50 threads
I don't think it's benefitial in having too many schedulers unless you have a pretty good retention between them set up. I'd recommend two plus one spare for a setup your size.
On 5/12/15 2:46 PM, Felipe openglx wrote:The devs will be able to give more specifics (maybe even confirm if 2.4 performs better for your case?) but I faced similar issues with timeout because of the time it took to "slice and dice" the amount of objects. If you can enable debug mode on all nodes and provide some captures it would be great.OK -- I'll see about setting that up.
[1431539957] INFO: [Shinken] [All] Trying to send configuration to poller poller-1 [1431539960] ERROR: [Shinken] Failed sending configuration for poller-1: Connexion error to http://shinken1.dc1.example.com:7771/ : Operation timed out after 3001 milliseconds with 0 bytes received [1431539960] INFO: [Shinken] [All] Trying to send configuration to poller poller-4
[1431539922] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
break
Error: []
[1431539957] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
break
Error: []
[1431539957] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
break
Error: []
[1431540097] DEBUG: [Shinken] socket.shutdown failed: [Errno 107] Transport endpoint is not connecte
d
[1431540097] DEBUG: [Shinken] socket.shutdown failed: [Errno 9] Bad file descriptor
[1431540097] DEBUG: [Shinken] Error on SSL shutdown : library=missing reason=missing : [] ; Tracebac
k (most recent call last):
File "/var/lib/shinken/modules/booster-nrpe/module.py", line 220, in close
break
Error: []
There doesn't seem to be anything going on with the poller at the
time that the arbiter is complaining about it not responding.[1431476926] INFO: [Shinken] Shinken 2.2I haven't setup the automatic reload time yet. I'm thinking probably once an hour, but only if the configuration has changed. No point in reloading if the configuration is the same.
[1431476926] INFO: [Shinken] Copyright (c) 2009-2014:
...
[1431476927] INFO: [Shinken] Begin to dispatch configurations to satellites
...
[1431477071] INFO: [Shinken] Dispatching Realm All
[1431540048] INFO: [Shinken] [All] Dispatching broker satellite with order: broker-master (spare:False), broker-spare (spare:True),broker:
[1431540048] INFO: [Shinken] [All] Trying to send configuration to broker broker-master
[1431540048] DEBUG: [Shinken] Posting to http://shinken1.dc1.example.com:7772/put_conf: 2019B
[1431540168] ERROR: [Shinken] Failed sending configuration for broker-master: Connexion error to http://shinken1.dc1.example.com:7772/ : Operation timed out after 120001 milliseconds with 0 bytes received
[1431540000] DEBUG: [broker-master] [Livestatus] Request duration 0.0890s
[1431540089] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[1431540091] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table services
columns ['host_has_been_checked', 'host_name', 'host_state', 'host_scheduled_downtime_depth', 'host_acknowledged', 'has_been_checked', 'state', 'scheduled_downtime_depth', 'acknowledged']
stats_columns []
filter_columns ['host_name']
is_stats False
is_cacheable False
[1431540091] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: []
[1431540091] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: []
[1431540091] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 3
[1431540091] DEBUG: [broker-master] [Livestatus] Request duration 0.0033s
[1431540091] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[1431540092] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table comments
columns ['author', 'comment', 'entry_time', 'entry_type', 'expires', 'expire_time', 'host_name', 'id', 'persistent', 'service_description', 'source', 'type']
stats_columns []
filter_columns []
is_stats False
is_cacheable False
[1431540092] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: []
[1431540092] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: []
[1431540092] DEBUG: [broker-master] [Livestatus] Request duration 0.0203s
[1431540092] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[1431540093] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table downtimes
columns ['author', 'comment', 'end_time', 'entry_time', 'fixed', 'host_name', 'id', 'start_time', 'service_description', 'triggered_by', 'duration']
stats_columns []
filter_columns []
is_stats False
is_cacheable False
[1431540093] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: []
[1431540093] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: []
[1431540093] DEBUG: [broker-master] [Livestatus] Request duration 0.0158s
[1431540093] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[1431540094] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table services
columns []
stats_columns ['description']
filter_columns ['host_name']
is_stats True
is_cacheable False
[1431540094] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: []
[1431540094] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: []
[1431540094] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 3
[1431540094] DEBUG: [broker-master] [Livestatus] Request duration 0.0026s
[1431540094] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[1431540095] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table services
columns ['accept_passive_checks', 'acknowledged', 'action_url', 'action_url_expanded', 'active_checks_enabled', 'check_command', 'check_interval', 'check_options', 'check_period', 'check_type', 'checks_enabled', 'comments', 'current_attempt', 'current_notification_number', 'description', 'event_handler', 'event_handler_enabled', 'custom_variable_names', 'custom_variable_values', 'execution_time', 'first_notification_delay', 'flap_detection_enabled', 'groups', 'has_been_checked', 'high_flap_threshold', 'host_acknowledged', 'host_action_url_expanded', 'host_active_checks_enabled', 'host_address', 'host_alias', 'host_checks_enabled', 'host_check_type', 'host_latency', 'host_plugin_output', 'host_perf_data', 'host_current_attempt', 'host_check_command', 'host_comments', 'host_groups', 'host_has_been_checked', 'host_icon_image_expanded', 'host_icon_image_alt', 'host_is_executing', 'host_is_flapping', 'host_name', 'host_notes_url_expanded', 'host_notifications_enabled', 'host_scheduled_downtime_depth', 'host_state', 'host_accept_passive_checks', 'icon_image', 'icon_image_alt', 'icon_image_expanded', 'is_executing', 'is_flapping', 'last_check', 'last_notification', 'last_state_change', 'latency', 'long_plugin_output', 'low_flap_threshold', 'max_check_attempts', 'next_check', 'notes', 'notes_expanded', 'notes_url', 'notes_url_expanded', 'notification_interval', 'notification_period', 'notifications_enabled', 'obsess_over_service', 'percent_state_change', 'perf_data', 'plugin_output', 'process_performance_data', 'retry_interval', 'scheduled_downtime_depth', 'state', 'state_type', 'modified_attributes_list', 'last_time_critical', 'last_time_ok', 'last_time_unknown', 'last_time_warning', 'display_name', 'host_display_name', 'host_custom_variable_names', 'host_custom_variable_values', 'in_check_period', 'in_notification_period', 'host_parents', 'is_impact', 'source_problems', 'impacts', 'criticity', 'is_problem', 'poller_tag', 'got_business_rule', 'parent_dependencies']
stats_columns []
filter_columns ['host_name']
is_stats False
is_cacheable False
[1431540095] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: []
[1431540095] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: []
[1431540095] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 3
[1431540095] DEBUG: [broker-master] [Livestatus] Request duration 0.0098s
[1431540095] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[1431540096] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table status
columns ['accept_passive_host_checks', 'accept_passive_service_checks', 'check_external_commands', 'check_host_freshness', 'check_service_freshness', 'enable_event_handlers', 'enable_flap_detection', 'enable_notifications', 'execute_host_checks', 'execute_service_checks', 'last_command_check', 'last_log_rotation', 'livestatus_version', 'nagios_pid', 'obsess_over_hosts', 'obsess_over_services', 'process_performance_data', 'program_start', 'program_version', 'interval_length']
stats_columns []
filter_columns []
is_stats False
is_cacheable False
[1431540096] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: []
[1431540096] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: []
[1431540096] DEBUG: [broker-master] [Livestatus] Request duration 0.0038s
[1431540096] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[1431540098] DEBUG: [broker-master] [Livestatus Query Metainfo] I cannot cache this table contactgroups
columns ['name']
stats_columns []
filter_columns ['members']
is_stats False
is_cacheable False
[1431540098] DEBUG: [broker-master] [Livestatus Query Metainfo] ge_contains_filters: ['members']
[1431540098] DEBUG: [broker-master] [Livestatus Query Metainfo] unique_ge_contains_filters: ['members']
[1431540098] DEBUG: [broker-master] [Livestatus Regenerator] Hint is 0
[1431540098] DEBUG: [broker-master] [Livestatus] Request duration 0.0030s
[1431540173] INFO: [broker-master] [Logstore Null] Open LiveStatusLogStoreNull ok
[root@shinken1 shinken]# /usr/bin/shinken-poller -r -c /etc/shinken/daemons/pollerd.ini
[1431547694] INFO: [Shinken] Stale pidfile exists at invalid literal for int() with base 10: '' (/var/run/shinken/pollerd.pid). Reusing it.
[1431547694] INFO: [Shinken] Opening HTTP socket at http://0.0.0.0:7771
Bottle server starting up (using CherryPyServer(ssl_key='', ssl_cert='', daemon_thread_pool_size=50, ca_cert='', use_ssl=False))...
Listening on http://0.0.0.0:7771/
Use Ctrl-C to quit.
[1431547694] INFO: [Shinken] Initializing a CherryPy backend with 50 threads
Shutting down...
[1431547694] INFO: [Shinken] Shinken 2.2
[1431547694] INFO: [Shinken] Copyright (c) 2009-2014:
[1431547694] INFO: [Shinken] Gabes Jean (napa...@gmail.com)
[1431547694] INFO: [Shinken] Gerhard Lausser, Gerhard...@consol.de
[1431547694] INFO: [Shinken] Gregory Starck, g.st...@gmail.com
[1431547694] INFO: [Shinken] Hartmut Goebel, h.go...@goebel-consult.de
[1431547694] INFO: [Shinken] License: AGPL
[1431547694] INFO: [Shinken] Stale pidfile exists at invalid literal for int() with base 10: '' (/var/run/shinken/pollerd.pid). Reusing it.
[1431547694] INFO: [Shinken] Opening HTTP socket at http://0.0.0.0:7771
[1431547694] INFO: [Shinken] Initializing a CherryPy backend with 50 threads
[1431547694] INFO: [Shinken] Using the local log file '/var/log/shinken/pollerd.log'
[1431547694] INFO: [Shinken] Starting HTTP daemonList to register :[('__init__', <bound method IForArbiter.__init__ of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('api', <bound method IForArbiter.api of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('api_full', <bound method IForArbiter.api_full of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_external_commands', <bound method IForArbiter.get_external_commands of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_log_level', <bound method IForArbiter.get_log_level of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_running_id', <bound method IForArbiter.get_running_id of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('get_start_time', <bound method IForArbiter.get_start_time of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('got_conf', <bound method IForArbiter.got_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('have_conf', <bound method IForArbiter.have_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('ping', <bound method IForArbiter.ping of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('push_broks', <bound method IForArbiter.push_broks of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('push_host_names', <bound method IForArbiter.push_host_names of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('put_conf', <bound method IForArbiter.put_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('remove_from_conf', <bound method IForArbiter.remove_from_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('set_log_level', <bound method IForArbiter.set_log_level of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('wait_new_conf', <bound method IForArbiter.wait_new_conf of <shinken.satellite.IForArbiter object at 0x26ff450>>), ('what_i_managed', <bound method IForArbiter.what_i_managed of <shinken.satellite.IForArbiter object at 0x26ff450>>)]
Registering api [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering api_full [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering get_external_commands [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering get_log_level [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering get_running_id [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering get_start_time [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering got_conf [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering have_conf [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering ping [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering push_broks ['broks'] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering push_host_names ['sched_id', 'hnames'] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering put_conf ['conf'] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering remove_from_conf ['sched_id'] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering set_log_level ['loglevel'] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering wait_new_conf [] <shinken.satellite.IForArbiter object at 0x26ff450>
Registering what_i_managed [] <shinken.satellite.IForArbiter object at 0x26ff450>
picking only bound methods of class and not parents
List to register :[('get_broks', <bound method IBroks.get_broks of <shinken.satellite.IBroks object at 0x26ff590>>)]
Registering get_broks ['bname'] <shinken.satellite.IBroks object at 0x26ff590>
picking only bound methods of class and not parents
List to register :[('get_returns', <bound method ISchedulers.get_returns of <shinken.satellite.ISchedulers object at 0x26ff610>>), ('push_actions', <bound method ISchedulers.push_actions of <shinken.satellite.ISchedulers object at 0x26ff610>>)]
Registering get_returns ['sched_id'] <shinken.satellite.ISchedulers object at 0x26ff610>
Registering push_actions ['actions', 'sched_id'] <shinken.satellite.ISchedulers object at 0x26ff610>
picking only bound methods of class and not parents
List to register :[('get_raw_stats', <bound method IStats.get_raw_stats of <shinken.satellite.IStats object at 0x26ff4d0>>)]
Registering get_raw_stats [] <shinken.satellite.IStats object at 0x26ff4d0>
[1431547694] INFO: [Shinken] Modules directory: /var/lib/shinken/modules
[1431547694] INFO: [Shinken] Modules directory: /var/lib/shinken/modules
[1431547694] INFO: [Shinken] Waiting for initial configuration
[daemon]And here's the poller.cfg file:
#-- Global Configuration
#user=shinken ; if not set then by default it's the current user.
#group=shinken ; if not set then by default it's the current group.
# Set to 0 if you want to make this daemon NOT run
daemon_enabled=1
# Larger configurations need more threads (default is 8?)
daemon_thread_pool_size=50
#-- Path Configuration
# The daemon will chdir into the directory workdir when launched
# paths variables values, if not absolute paths, are relative to workdir.
# using default values for following config variables value:
workdir = /var/run/shinken
logdir = /var/log/shinken
pidfile=%(workdir)s/pollerd.pid
#-- Network configuration
# host=0.0.0.0
# port=7771
# http_backend=auto
# idontcareaboutsecurity=0
#-- SSL configuration --
use_ssl=0
# WARNING : Put full paths for certs
#ca_cert=/etc/shinken/certs/ca.pem
#server_cert=/etc/shinken/certs/server.cert
#server_key=/etc/shinken/certs/server.key
#hard_ssl_name_check=0
#-- Local log management --
# Enabled by default to ease troubleshooting
use_local_log=1
local_log=%(logdir)s/pollerd.log
# accepted log level values= DEBUG,INFO,WARNING,ERROR,CRITICAL
log_level=INFO
#log_level=DEBUG
#===============================================================================
# POLLER (S1_Poller)
#===============================================================================
# Description: The poller is responsible for:
# - Active data acquisition
# - Local passive data acquisition
# https://shinken.readthedocs.org/en/latest/08_configobjects/poller.html
#===============================================================================
define poller {
poller_name poller-1
address shinken1.dc1.example.com
port 7771
## Optional
spare 0 ; 1 = is a spare, 0 = is not a spare
manage_sub_realms 0 ; Does it take jobs from schedulers of sub-Realms?
min_workers 0 ; Starts with N processes (0 = 1 per CPU)
max_workers 0 ; No more than N processes (0 = 1 per CPU)
processes_by_worker 256 ; Each worker manages N checks
polling_interval 1 ; Get jobs from schedulers each N seconds
timeout 3 ; Ping timeout
data_timeout 120 ; Data send timeout
max_check_attempts 3 ; If ping fails N or more, then the node is dead
check_interval 60 ; Ping node every N seconds
## Interesting modules that can be used:
# - booster-nrpe = Replaces the check_nrpe binary. Therefore it
# enhances performances when there are lot of NRPE
# calls.
# - named-pipe = Allow the poller to read a nagios.cmd named pipe.
# This permits the use of distributed check_mk checks
# should you desire it.
# - SnmpBooster = Snmp bulk polling module
modules named-pipe, booster-nrpe
## Advanced Features
#passive 0 ; For DMZ monitoring, set to 1 so the connections
; will be from scheduler -> poller.
# Poller tags are the tag that the poller will manage. Use None as tag name to manage
# untaggued checks
#poller_tags None
# Enable https or not
use_ssl 0
# enable certificate/hostname check, will avoid man in the middle attacks
hard_ssl_name_check 0
realm All
}
Here's another example of what I'm seeing -- In the arbiter log I'll see something like this:
[1431641122] INFO: [Shinken] [All] Trying to send configuration to poller poller-1
[1431641242] ERROR: [Shinken] Failed sending configuration for poller-1: Connexion error to http://shinken1.dc1.example.com:7771/ : Operation timed out after 120001 milliseconds with 0 bytes received