Ive been trying to solve this issue for a long time now, but it's rather complicated throubleshooting since I'm a little bit of a novice to this setup. Here in our company we have a Nagios XI (Core 4.2.4) being offloaded by a Gearman Server which is being accessed through port forwarding by workers at remote sites, each processing a single hostgroup queue in their respective local networks.
The problem starts after around 3 to 4 hours of processing: "Jobs Waiting" begin to pile up in the check_results queue
and this value keeps getting larger by the minute, without ever going down. Meanwhile, the Nagios XI services and hosts stop being processed completely and indefinitely until we reset the gearmand and nagios services. Tried installing the latest gearmand-server version provided by https://assets.nagios.com/downloads/nagiosxi/docs/Integrating_Mod_Gearman_with_Nagios_XI.pdf
as well as Consol Labs Repositories, but nothing seems to change this behaviour.
While the jobs are stuck, we've found that some large amounts of CLOSE_WAIT connections for each of the workers are shown through "netstat -anp | grep 4730". Our structure consists of around 1050 services of which an average of 800 are handled by a sum of 15 workers. Please, would you be able to shed a light on what's going on?
Thank you very much for your attention and time!
Ramiro Fróes Ferrão