Make UNKNOWN state of (service check orphaned, is the mod-gearman worker on queue 'service' running?)

5,154 views
Skip to first unread message

Björn Frostberg

unread,
Sep 23, 2012, 4:55:39 PM9/23/12
to mod_g...@googlegroups.com
I suppose it's Nagios that determines that orphaned service checks enters CRITICAL state? Is there a way to make it UNKNOWN state? Just want to avoid a lot of notifications going out  if this happens again. 

I was trying to tune the amount of gearman workers which resulted service checks being orphaned. 

Sven Nierlein

unread,
Sep 24, 2012, 11:46:02 AM9/24/12
to mod_g...@googlegroups.com
On 23.09.2012 22:55, Björn Frostberg wrote:
> I suppose it's Nagios that determines that orphaned service checks enters CRITICAL state? Is there a way to make it UNKNOWN state? Just want to avoid a lot of notifications going out if this happens again.
>
> I was trying to tune the amount of gearman workers which resulted service checks being orphaned.

No, its Mod-Gearman which sets/sends this state/message. Nagios would know nothing about the gearman queues being used. But there is currently
no switch to set that state.

Did the automatic worker leveling not work?

Sven

Björn Frostberg

unread,
Sep 24, 2012, 12:38:37 PM9/24/12
to mod_g...@googlegroups.com
Hi Sven,

Not sure what happened. I had issues with high service queues and increased the workers to 300 and maybe I restarted things in the wrong order or something.  As a side note this(high service queues) seems to be due to Nagios is doing quite uneven scheduling of service checks.  

Anyway, would be great if orphaned checks could be configured to "UNKNOWN" state. I wasn't very popular for the amount of notifications going out, it was a lot and would like to make sure it doesn't happen again for whatever reason.
  
Regards, Bjorn

Björn Frostberg

unread,
Sep 26, 2012, 6:22:57 AM9/26/12
to mod_g...@googlegroups.com
On Monday, September 24, 2012 5:46:03 PM UTC+2, Sven Nierlein wrote:
Would these settings have any effect on this. Meaning if it's set to "no", I still risk getting the state CRITICAL for orphaned services back to Nagios?

# The Mod-Gearman NEB module will submit a fake result for orphaned host
# checks with a message saying there is no worker running for this
# queue. Use this option to get better reporting results, otherwise your
# hosts will keep their last state as long as there is no worker
# running.
# Default: yes
orphan_host_checks=yes

# Same like 'orphan_host_checks' but for services.
# Default: yes
orphan_service_checks=yes
 

Sven Nierlein

unread,
Sep 28, 2012, 4:27:47 AM9/28/12
to mod_g...@googlegroups.com
Disabling this totally disables orphan state detection and you would have to do your own checks to see if all hosts/services got a result in time.

cosmin nistor

unread,
Feb 28, 2013, 2:58:44 AM2/28/13
to mod_g...@googlegroups.com
Hi Sven,
I am implementing mod_german and really liked how the ophaned_service_check sounded, but unfortunatelly it does not work in my environment. I am still in a testing phase, so I wanted to test just this feature.
I have 4 servicegroups, one servicegroup (hw) is passed dirrectly to nagios, one servicegroups (app) is passed to a worker (that is actually working on the same machine with nagios server) and 2 servicegroups are passed to a worker that I've intentionally left stopped. The jobs for that unavailable worker are kep by gearmand in a waiting queue and do not timeout, so the status in nagios is pending. I have leaved this for more than 12 hours and the status in nagios is still not updated. Have I got it wrong, ins't it how it's supposed to work?
 
Thank you,
Cosmin.

Sven Nierlein

unread,
Feb 28, 2013, 3:45:54 AM2/28/13
to mod_g...@googlegroups.com
Orphaned checks is more a nagios feature itself. Do you have orphaned hosts/services enabled in your nagios.cfg and the mod-gearman module config?

Sven

cosmin nistor

unread,
Feb 28, 2013, 4:15:20 AM2/28/13
to mod_g...@googlegroups.com
I forgot to mention that geamand and libgearman are from the package gearmand-0.25-1.rhel5.x86_64.rpm and mod_gearman is version 1.4.2.

cosmin nistor

unread,
Feb 28, 2013, 4:36:40 AM2/28/13
to mod_g...@googlegroups.com
Yes, they are enabled:
# cat nagios.cfg  | grep -Ev '^#|^$' | egrep -i 'orph|timeout'
service_check_timeout=15
host_check_timeout=15
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
check_for_orphaned_services=1
check_for_orphaned_hosts=1
service_check_timeout_state=u
 
 
# cat mod_gearman_neb.conf  | egrep -v '^#|^$'
debug=3
logfile=/opt/mod_gearman/var/mod_gearman/mod_gearman_neb.log
server=localhost:4730
eventhandler=yes
services=yes
hosts=yes
servicegroups=hw,os,db,app
do_hostchecks=yes
encryption=no
key=nagios
use_uniq_jobs=on
localhostgroups=
localservicegroups=hw
result_workers=5
perfdata=no
perfdata_mode=1
orphan_host_checks=yes
orphan_service_checks=yes
accept_clear_results=yes

cosmin nistor

unread,
Feb 28, 2013, 4:43:59 AM2/28/13
to mod_g...@googlegroups.com
This is how gearman_top looks like:
 
2013-02-28 09:32:16  -  localhost:4730   -  v0.25
 Queue Name      | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------
 check_results   |               5  |           0  |           0
 eventhandler    |               5  |           0  |           0
 host            |               5  |           0  |           0
 service         |               5  |           0  |           0
 servicegroup_os |               0  |         100  |           0
 worker_cn_x28   |               1  |           0  |           0
------------------------------------------------------------------

cosmin nistor

unread,
Feb 28, 2013, 8:19:42 AM2/28/13
to mod_g...@googlegroups.com
I've made a test with use_uniq_jobs=off, still no result in nagios. Services are kept in the Pending state for more than 3 hours.
I will attach some part of the logs:
 
=== mod_gearnam_neb.log ===
 
[2013-02-28 12:56:11][21856][TRACE] handle_svc_check(13, data)
[2013-02-28 12:56:11][21856][TRACE] service is member of servicegroup: os
[2013-02-28 12:56:11][21856][DEBUG] received job for queue servicegroup_os: coni1 - CPU Usage
[2013-02-28 12:56:11][21856][DEBUG] service: 'coni1' - 'CPU Usage', next_check is at 2013-02-28 12:56:07, latency so far: 4
[2013-02-28 12:56:11][21856][TRACE] cmd_line: /opt/anritsu/selfmon/nagios/libexec/check_nrpe -t 120 -H 172.28.45.28 -c check_cpu -a 90 95
[2013-02-28 12:56:11][21856][TRACE] add_job_to_queue(servicegroup_os, (null), 3, 1, 2, 1)
[2013-02-28 12:56:11][21856][TRACE] 279 --->type=service
result_queue=check_results
host_name=coni1
service_description=CPU Usage
start_time=1362056167.0
next_check=1362056167.0
core_time=1362056171.48620
timeout=15
command_line=/opt/anritsu/selfmon/nagios/libexec/check_nrpe -t 120 -H 172.28.45.28 -c check_cpu -a 90 95

<---
[2013-02-28 12:56:11][21856][TRACE] 372 +++>
dHlwZT1zZXJ2aWNlCnJlc3VsdF9xdWV1ZT1jaGVja19yZXN1bHRzCmhvc3RfbmFtZT1jb25pMQpzZXJ2aWNlX2Rlc2NyaXB0aW9uPUNQVSBVc2FnZQpzdGFydF90aW1lPTEzNjIwNTYxNjcuMApuZXh0X2NoZWNrPTEzNjIwNTYxNjcuMApjb3JlX3RpbWU9MTM2MjA1NjE3MS40ODYyMAp0aW1lb3V0PTE1CmNvbW1hbmRfbGluZT0vb3B0L2Fucml0c3Uvc2VsZm1vbi9uYWdpb3MvbGliZXhlYy9jaGVja19ucnBlIC10IDEyMCAtSCAxNzIuMjguNDUuMjggLWMgY2hlY2tfY3B1IC1hIDkwIDk1CgoK
<+++
[2013-02-28 12:56:11][21856][TRACE] add_job_to_queue() finished successfully: 0 0
[2013-02-28 12:56:11][21856][TRACE] handle_svc_check() finished successfully
[2013-02-28 12:56:11][21856][TRACE] handle_svc_check() finished successfully -> 206
 
 
=== gearmand.log ===
full of messages like:
 
ERROR [    10 ] lost connection to client recv(peer has closed connection) 127.0.0.1:34855 -> libgearman-server/io.cc:607
 
 
 
=== nagios.log ===
 
[1362056401] Warning: The check of service 'Check NTP Sync' on host 'coni5' looks like it was orphaned (results never came back).  I'm scheduling an immediate check of the service...

cosmin nistor

unread,
Feb 28, 2013, 4:27:48 PM2/28/13
to mod_g...@googlegroups.com
Hi, still cannot figure this out. i'm sure it is a stupid configuration issue, but cannot ge to the bottom of it. Do you have any ideea? Thank you.

cosmin nistor

unread,
Mar 1, 2013, 4:14:14 PM3/1/13
to mod_g...@googlegroups.com
Searching possible causes on the internet for my problem, I have come across the suggestion that multiple nagios instances might be running, but this is not the case.

cosmin nistor

unread,
Mar 4, 2013, 9:24:09 AM3/4/13
to mod_g...@googlegroups.com
Hi all, I've upgraded to gearmand v1.1.5, no improvement in this issue though.
I've done several tests, for example: disable neb module, modify check command with a while true sleep 1 loop, and nagios is reporting that the service check timed out. With gearman neb enabled, after a while, in nagios.log only messages that the service is orphaned can be seen, but the status of the service is the last status which was available, which can be a status from yesterday, not up to date. It would make sense to mark the status of that service with unknown - worker not available, as specified in the documentation, but in my scenario it does not happen.
So, any suggestion is welcomed.
Sorry for the many posts.

Steve Shipway

unread,
Mar 12, 2013, 5:03:57 PM3/12/13
to mod_g...@googlegroups.com
I would also prefer the NEB to use 'unknown' in the event of an orphaned check.

By trial and error, I have discovered that we get more orphaned checks if min-worker and idle-timeout on the worker nodes are set too high.  By setting min-worker to the minimum value I've ever seen for active workers, and idle-timeout to only 20, I've practically eliminated the problem (though I suspect Ive only really eliminated the *symptom* :) )

Annoyingly, the only way to fix the issue once it happens is to restart the worker that is holding the orphaned check - as I have no way to know which worker it is, I need to restart all workers...

cosmin nistor

unread,
Mar 13, 2013, 5:57:43 PM3/13/13
to mod_g...@googlegroups.com
Unfortunately, in my system, when I'll put it all together, it will mean restarting about 50 workers, so this won't work for me. This is why I am trying to make it 'fail proof', and at least not have nagios hanged over that services with status OK when they are long time CRITICAL.

Sven Nierlein

unread,
May 6, 2013, 8:07:43 AM5/6/13
to mod_g...@googlegroups.com
Hi,

could you try the latest version?

Sven
> --
> You received this message because you are subscribed to the Google Groups "mod_gearman" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mod_gearman...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

cosmin nistor

unread,
May 8, 2013, 7:55:01 AM5/8/13
to mod_g...@googlegroups.com
Hi Sven,

Sorry, I've just seen your message. I will try so this weekend, then come back with feedback.

Thank you,
Cosmin.

Jelle Smet

unread,
May 28, 2013, 8:03:36 AM5/28/13
to mod_g...@googlegroups.com
+1 for making orphaned checks return UNKNOWN status ... I think it makes sense in this case.

Sven Nierlein

unread,
May 28, 2013, 8:45:00 AM5/28/13
to mod_g...@googlegroups.com
On 5/28/13 14:03, Jelle Smet wrote:
> +1 for making orphaned checks return UNKNOWN status ... I think it makes sense in this case.

I put that on my todo list. I think a new config setting would be best to not change current
behaviour.

Sven

Ankur Vishwakarma

unread,
Oct 26, 2016, 6:16:44 AM10/26/16
to mod_gearman
Hi Cosmin,

In case you are alive :P please help me.

I have configured mod-gearman and currently in testing phase.

What is happening with me is, I have installed the mod-gearman on Nagios XI server and have 1 worker server in remote site.

All configurations are done, but the problem is that when I stop the worker on the remote site, it take good amount of time like 30-45 mins to reflect in Nagios XI monitoring screen, where as when i turn on the remote worker process on the remote site, within seconds it show the host is up and OK.

Do you got any idea why there is such delay in the notification from worker to Nagios XI system?

Thanks,
Ankur Vishwakarma
Reply all
Reply to author
Forward
0 new messages