Nagios service not restarting when removing a host from the database

551 views
Skip to first unread message

John Santana

unread,
Aug 5, 2013, 10:51:07 AM8/5/13
to puppet...@googlegroups.com
Nagios is restarted every time a host or service is added, but never when removing hosts or services.

The client resources:

@@nagios_host { "$::fqdn":
    ensure    => 'present',
    alias     => "$::hostname",
    address   => "$::ipaddress",
    use       => 'linux-server',
}

@@nagios_service { "check_ssh_${::hostname}":
    check_command       => 'check_ssh',
    use                 => 'generic-service',
    host_name           => "$::fqdn",
    service_description => 'SSH',
}

The nagios server resources:

service { 'nagios':
  ensure    => 'running',
  hasstatus => true,
  enable    => true,
}

resources { [ "nagios_host", "nagios_service" ]:
  purge => true,
}

Nagios_host     <<||>> { notify  => Service['nagios'] }
Nagios_service  <<||>> { notify  => Service['nagios'] }

Based on variations I have seen out there I have also tried the following:

- Have the service subscribe to /etc/nagios with checksum=>mtime
- Added  before  => File['/etc/nagios'], to Nagios_host and Nagios_service
- Tried checksum=>mtime on /etc/nagios/nagios_*.cfg resources

When I add a host and then run puppet agent --test on the nagios server I see this:
notice: /Stage[main]/Nagios::Monitor/Nagios_host[test1.tld]/ensure: created
notice: /Stage[main]/Nagios::Monitor/Nagios_service[check_ssh_test1]/ensure: created
notice: /Stage[main]/Nagios::Monitor/Service[nagios]: Triggered 'refresh' from 2 events

When I remove the host from the database via

delete from fact_values where host_id='N';
delete from resources where host_id='N';
delete from hosts where id='N';

The next run of puppet on the nagios server produces:
notice: /Nagios_service[check_ssh_test1]/ensure: removed
notice: /Nagios_host[test1.tld]/ensure: removed

nagios_host.cfg and nagios_service.cfg are properly updated, but the service will not restart.

This is centos6.3 with epel puppet-2.6.17 (for client and master). Any ideas?

Gabriel Filion

unread,
Aug 5, 2013, 4:22:41 PM8/5/13
to puppet...@googlegroups.com, John Santana
Hi there,

On 05/08/13 10:51 AM, John Santana wrote:
> When I remove the host from the database via
>
> delete from fact_values where host_id='N';
> delete from resources where host_id='N';
> delete from hosts where id='N';

if you remove the host exported resource in the manifests and the DB,
then the nagios server is not collecting anything about it anymore:
that's why the service doesn't get notified.

you need to export the resource with ensure => absent and run puppet on
the host, then on the nagios server so that everything runs fine.


however in your example, you seem not to be redefining the "target" when
collecting, so you might consider using purge => true. to achieve what
you want with the workflow you mentioned above (e.g. without the need to
export with ensure => absent)

--
Gabriel Filion

signature.asc

puppe...@downhomelinux.com

unread,
Aug 5, 2013, 4:33:11 PM8/5/13
to Gabriel Filion, puppet...@googlegroups.com
On Mon, Aug 05, 2013 at 04:22:41PM -0400, Gabriel Filion wrote:
>
> you need to export the resource with ensure => absent and run puppet on
> the host, then on the nagios server so that everything runs fine.

Dozens of VMs are routinely destroyed on a weekly basis and in an
automated fashion based on load. The nagios_*.cfg files are
automatically changed, why is the notify not triggering?

> however in your example, you seem not to be redefining the "target" when
> collecting, so you might consider using purge => true. to achieve what
> you want with the workflow you mentioned above (e.g. without the need to
> export with ensure => absent)

I am purging unless you are referring to a different resource stanza.
From my OP:

Gabriel Filion

unread,
Aug 5, 2013, 6:35:14 PM8/5/13
to puppe...@downhomelinux.com, puppet...@googlegroups.com
ah, so that's why the config is updated automatically then..

I can't use purging in my environment because of the annoying limitation
with the "target" argument, so the best I can do now is to pull one
suggestion out of my hat:

if you try and add the notify here, maybe it'll catch removals too?

> resources { [ "nagios_host", "nagios_service" ]:
> purge => true,
notify => Service['nagios'],
> }

it might complain about the notify lines in the collection.. not
entirely sure. if so, try to remove the line at the collection point.

--
Gabriel Filion

signature.asc

jcbollinger

unread,
Aug 6, 2013, 9:20:09 AM8/6/13
to puppet...@googlegroups.com


On Monday, August 5, 2013 3:33:11 PM UTC-5, John Santana wrote:
On Mon, Aug 05, 2013 at 04:22:41PM -0400, Gabriel Filion wrote:
>
> you need to export the resource with ensure => absent and run puppet on
> the host, then on the nagios server so that everything runs fine.

Dozens of VMs are routinely destroyed on a weekly basis and in an
automated fashion based on load. The nagios_*.cfg files are
automatically changed, why is the notify not triggering?



Because there is a difference between unmanaged resources and resources that are managed 'absent'.  Only resources that are actually declared for the target node are managed, whether declared directly or via collection of virtual or exported resources.  Declaring that unmanaged resources of a given type should be purged does not make the affected resources managed.

You declare that Service['nagios'] must be notified when a managed Nagios_host or Nagios_service is modified (including from being absent to being present, or vise versa), but that's not what happens when you remove the [exported] resource declaration altogether.

 
> however in your example, you seem not to be redefining the "target" when
> collecting, so you might consider using purge => true. to achieve what
> you want with the workflow you mentioned above (e.g. without the need to
> export with ensure => absent)

I am purging unless you are referring to a different resource stanza.
From my OP:

resources { [ "nagios_host", "nagios_service" ]:
  purge => true,
}




By definition, that causes unmanaged resources of the specified types to be removed from the system.  Because they are unmanaged, their removal does not cause nagios to be notified.

You could try having that Resources resource notify Service['nagios'] itself:


resources { [ "nagios_host", "nagios_service" ]:
  purge => true,
  notify => Service['nagios']
}

I don't actually know whether that would work, but it seems right.  Be sure to check both that it causes nagios to be notified when hosts or services are purged, and that it doesn't do so otherwise.  And please let us know what happens. I'm very curious.


John

puppe...@downhomelinux.com

unread,
Aug 6, 2013, 12:14:12 PM8/6/13
to jcbollinger, puppet...@googlegroups.com
On Tue, Aug 06, 2013 at 06:20:09AM -0700, jcbollinger wrote:
>
> Because there is a difference between unmanaged resources and resources
> that are managed 'absent'. Only resources that are actually declared for
> the target node are managed, whether declared directly or via collection of
> virtual or exported resources. Declaring that unmanaged resources of a
> given type should be purged does not make the affected resources managed.

Your explanation just solved my problem. For the record using notify in
the resource purge doesn't work as attested in several google searches
(and confirmed here) with errors like:

warning: /Nagios_service[check_ssh_test1]: Service[nagios] still depends on me -- not purging

However, provided you do the following:

1) name your services like "servicename_${::hostname}"
2) name your hosts as "$::fqdn"
3) Make sure every @@nagios_* resource has "ensure=>'present'"

then the following does work to both remove the resources from the
nagios_*.cfg files __and__ restart nagios:

Get resource ids:
select id,title from resources where title like '%_<yourhostname>' or title='yourfqdn';

Update present to absent for all service checks and the host check.
update param_values set value='absent' where value='present' and (resource_id=1 or resource_id=2);

Since there may be a use case to turn off a single service instead of
blowing away all the service and host checks for a given node, I'll
probably populate a checklist form. Granted, that will require that the
host doesn't redefine it again (different problem).

Anyway, thanks so much for the unmanaged vs managed comment. That was
totally slipping by me while I pounded my head into my desk trying to
figure out why the files were changing but the service wasn't being
notified.
Reply all
Reply to author
Forward
0 new messages