I have a Nagios XI server monitoring 648 hosts and 23490 services with the help of one gearman worker running on another server. Everything was working as expected for weeks
but I started getting service and host check orphaned messages in the logs since a few days ago.
I have gone through a lot of messages on this forum and tried different things but I can't seem to narrow down or fix the issue.
What I have noticed is whenever I restart Nagios, the worker picks up a lot of work which it executes, but then after a few minutes it stop doing anything. I can only see the following in the logs then:
[2013-07-28 16:05:38][6006][DEBUG] child started with pid: 6006
[2013-07-28 16:05:38][6007][DEBUG] child started with pid: 6007
[2013-07-28 16:05:38][6008][DEBUG] child started with pid: 6008
[2013-07-28 16:05:38][6009][DEBUG] child started with pid: 6009
[2013-07-28 16:05:38][6005][DEBUG] child started with pid: 6005
[2013-07-28 16:05:39][6010][DEBUG] child started with pid: 6010
[2013-07-28 16:05:40][6011][DEBUG] child started with pid: 6011
The following are my configuration files. I have made a lot of changes to see if I can fix the issue but that could have made matters worse.
/etc/mod_gearman/mod_gearman_neb.conf:
debug=1
logfile=/var/log/mod_gearman/mod_gearman_neb.log
server=localhost:4730
eventhandler=yes
services=yes
do_hostchecks=no
encryption=yes
key=**********
use_uniq_jobs=on
localhostgroups=localhost_group
localservicegroups=
result_workers=1
perfdata=no
perfdata_mode=1
orphan_host_checks=yes
orphan_service_checks=yes
accept_clear_results=no
mod_gearman_worker.conf:
debug=1
logfile=/var/log/mod_gearman/mod_gearman_worker.log
eventhandler=yes
services=yes
hosts=yes
do_hostchecks=yes
encryption=yes
key=**********
job_timeout=60
min-worker=20
max-worker=30
idle-timeout=20
max-jobs=1000
spawn-rate=1
fork_on_exec=no
show_error_output=yes
enable_embedded_perl=on
use_embedded_perl_implicitly=off
use_perl_cache=on
workaround_rc_25=off
I tried changing the values of the following without any luck:
min-worker=20
max-worker=30
idle-timeout=20
use_uniq_jobs=on
do_hostchecks=no
Anyone spot anything obvious?
Thanks.