Service & host check orphaned

2,045 views
Skip to first unread message

Mejo

unread,
Jul 28, 2013, 6:50:32 AM7/28/13
to mod_g...@googlegroups.com
I have a Nagios XI server monitoring 648 hosts and 23490 services with the help of one gearman worker running on another server. Everything was working as expected for weeks but I started getting service and host check orphaned messages in the logs since a few days ago.

I have gone through a lot of messages on this forum and tried different things but I can't seem to narrow down or fix the issue. 

What I have noticed is whenever I restart Nagios, the worker picks up a lot of work which it executes, but then after a few minutes it stop doing anything. I can only see the following in the logs then:

[2013-07-28 16:05:38][6006][DEBUG] child started with pid: 6006
[2013-07-28 16:05:38][6007][DEBUG] child started with pid: 6007
[2013-07-28 16:05:38][6008][DEBUG] child started with pid: 6008
[2013-07-28 16:05:38][6009][DEBUG] child started with pid: 6009
[2013-07-28 16:05:38][6005][DEBUG] child started with pid: 6005
[2013-07-28 16:05:39][6010][DEBUG] child started with pid: 6010
[2013-07-28 16:05:40][6011][DEBUG] child started with pid: 6011


The following are my configuration files. I have made a lot of changes to see if I can fix the issue but that could have made matters worse.

/etc/mod_gearman/mod_gearman_neb.conf:
debug=1
logfile=/var/log/mod_gearman/mod_gearman_neb.log
server=localhost:4730
eventhandler=yes
services=yes
do_hostchecks=no
encryption=yes
key=**********
use_uniq_jobs=on
localhostgroups=localhost_group
localservicegroups=
result_workers=1
perfdata=no
perfdata_mode=1
orphan_host_checks=yes
orphan_service_checks=yes
accept_clear_results=no


mod_gearman_worker.conf:
debug=1
logfile=/var/log/mod_gearman/mod_gearman_worker.log
eventhandler=yes
services=yes
hosts=yes
do_hostchecks=yes
encryption=yes
key=**********
job_timeout=60
min-worker=20
max-worker=30
idle-timeout=20
max-jobs=1000
spawn-rate=1
fork_on_exec=no
show_error_output=yes
enable_embedded_perl=on
use_embedded_perl_implicitly=off
use_perl_cache=on
p1_file=/usr/share/mod_gearman/mod_gearman_p1.pl
workaround_rc_25=off


I tried changing the values of the following without any luck:

min-worker=20
max-worker=30
idle-timeout=20

use_uniq_jobs=on
do_hostchecks=no



Anyone spot anything obvious?

Thanks.

sven

unread,
Jul 28, 2013, 7:25:48 AM7/28/13
to mod_g...@googlegroups.com
Hi,

which libgearman version do you use and which mod-gearman version?

Regards,
 Sven
--
You received this message because you are subscribed to the Google Groups "mod_gearman" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mod_gearman...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Mejo

unread,
Jul 28, 2013, 10:17:45 AM7/28/13
to mod_g...@googlegroups.com


gearmand.x86_64                       0.25-1                          
gearmand-devel.x86_64                 0.25-1                     
gearmand-server.x86_64                0.25-1                     
libgearman.x86_64                     0.14-3.el5                    
mod_gearman.x86_64                    1.3.8-1                     

..on Centos 5.8.


Thanks.

sven

unread,
Jul 28, 2013, 11:12:01 AM7/28/13
to mod_g...@googlegroups.com
Hi,

i would suggest to first update Mod-Gearman to the latest release (which is 1.4.8 at the moment) and libgearman to 0.25.
Besides that i have no access to any Nagios XI installation, so my possible help is limited.

 Sven

Mejo Jose

unread,
Jul 29, 2013, 9:50:46 AM7/29/13
to mod_g...@googlegroups.com
Hi Sven,

Updating the packages didn't make any difference so I did a re-install using Nagios XI script but I kept on getting the same messages.

As a last try, I stopped Nagios, cleared retention.dat, started it again and everything went back to normal!  I was under the impression it was a mod_gearman issue but if clearing retention.dat worked, it must have been something to do with Nagios itself?

Thanks for your help anyway.

Alessandro Ren

unread,
Jul 29, 2013, 9:59:40 AM7/29/13
to mod_g...@googlegroups.com

    Mejo,

next time this happens, try to set use_retained_scheduling_info to 0 and restart nagios, then set it back to 1.
    I have a nagios installation where I sometimes get orphan checks as well, witg 24k services.
    It seems to me nagios gets lost in the scheduling info cache with this amount of services.
    I am running Nagios 3.2.3.

    []s.

Mejo

unread,
Jul 30, 2013, 2:19:31 AM7/30/13
to mod_g...@googlegroups.com
Thanks. I will try this if I run into similar issue in the future.

Mejo

unread,
Jul 30, 2013, 1:36:26 PM7/30/13
to mod_g...@googlegroups.com
Alessandro,

I had the same issue a few hours ago and I tried what you suggested which seems to have fixed the issue.

Is there a permanent fix for this?

Thanks.


On Monday, July 29, 2013 5:59:40 PM UTC+4, Alessandro Ren wrote:

Alessandro Ren

unread,
Jul 30, 2013, 2:03:23 PM7/30/13
to mod_g...@googlegroups.com

    Mejo,

I dont know, I dont think so, which version of Nagios are you using?
    Maybe, you could try the nagios-devel mailing list? Apparently, it happens on huge installations.
    Have you tried leaving the use_retained_scheduling_info always 0?

    []s.

Mejo Jose

unread,
Jul 30, 2013, 3:17:35 PM7/30/13
to mod_g...@googlegroups.com
The issue has popped up again so I'm back to square one now. I have use_retained_scheduling_info=0 in nagios.cfg. 

Example message from the log:
Warning: The check of service 'Port 21 Status' on host 'Switch-12045' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...

Once this happens the gearman worker server isn't getting any jobs to execute. I haven't made any configuration changes since I fixed this yesterday so not sure were to look.

I am using Nagios XI 2012R2.2

Alessandro Ren

unread,
Jul 30, 2013, 3:23:40 PM7/30/13
to mod_g...@googlegroups.com

    Mejo,

I dont know which Nagios version XI uses, I using:

    - mod_gearman 1.3.8
    - gearmand-0.23
    - libgearman-0.23

    Everything is running on standard CentOS 5 with four nodes running checks.
    Did you try this options use_uniq_jobs=off on the mod_gearman neb module?

    []s.

Mejo

unread,
Jul 31, 2013, 8:34:03 AM7/31/13
to mod_g...@googlegroups.com
4 workers to run 24k checks? I'm using a single worker to do that. Is that okay?

Yes I tried with use_uniq_jobs=off but that doesn't make any difference. It runs okay for a few hours and then I start getting the errors again. 

Alessandro Ren

unread,
Jul 31, 2013, 9:51:02 AM7/31/13
to mod_g...@googlegroups.com

    Mejo,

I forgot to mention that my gearmand is compiled using libboost 1.46 which is not a CentOS 5 standard.
    If you want to try our CentOS 5 gearmand, mod_gearmand and libboost you can use our repo to download the packages or just set it up on your server and install with yum.
    All RPMs are vanilla from the original sources, if I remember correctly, we had to recompile gearmand with debug enabled and it would not recompile with the original CentOS 5 libboost.

    []s.

[OpMon-extras]
name=OpMon-53- Extras
baseurl=http://repo.opservices.com.br/rpms/opmon53/extras/$basearch/
enabled=1
gpgcheck=0

    Just set the repo or via access it and download what you need http://repo.opservices.com.br/rpms/opmon53/extras/

    []s.

Sven Nierlein

unread,
Jul 31, 2013, 11:03:01 AM7/31/13
to mod_g...@googlegroups.com
Hi,

Oh nice, you got mod-gearman packages for centos 4 in your repository too. Do you mind if i link
them on the mod-gearman.org download page?

And btw, there is a repository for mod-gearman already, but only for centos/redhat 5 and 6. (And suse, debian and ubuntu)
Any reason why you don't use them?

Sven

Alessandro Ren

unread,
Jul 31, 2013, 11:04:17 AM7/31/13
to mod_g...@googlegroups.com

Sven,

all packages are for CentOS 5 on our repo.

[]s.

Mejo

unread,
Aug 23, 2013, 4:34:53 AM8/23/13
to mod_g...@googlegroups.com
I don't want to jinx it but it seems that I have 'fixed' this issue and it had nothing to do with software.

It looks like I didn't provide enough hardware resources to gearman or Nagios server - they were both running on a VM and load average on gearman worker was around 40 all the time. Since the system was responsive, I thought this would be okay. I thought if there weren't enough hardware resources, things will just run slow, not end up being a mess. 

After trying several things to troubleshoot I setup Nagios XI on a physical machine with much faster disk access, a lot more CPUs and memory and it has been working well for a couple of weeks now. Gearman worker is still running on VM but I made sure the load average is within an acceptable level on both machines.

Thanks for all the responses.

On Tuesday, July 30, 2013 11:17:35 PM UTC+4, Mejo wrote:
The issue has popped up again so I'm back to square one now. I have use_retained_scheduling_info=0 in nagios.cfg. 

Example message from the log:
Warning: The check of service 'Port 21 Status' on host 'Switch-12045' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...

Once this happens the gearman worker server isn't getting any jobs to execute. I haven't made any configuration changes since I fixed this yesterday so not sure were to look.

I am using Nagios XI 2012R2.2
To unsubscribe from this group and stop receiving emails from it, send an email to mod_gearman+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.
 
 

--
You received this message because you are subscribed to the Google Groups "mod_gearman" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mod_gearman+unsubscribe@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages