How to use prometheus to alert server reboot event

4,292 views
Skip to first unread message

Yong Zhang

unread,
May 2, 2017, 4:53:06 AM5/2/17
to Prometheus Users
Hi, All

I'm trying to find a solution to alert server reboot event with node_exporter, the "up" or "node_boot_time" metrics seems hard to achieve this, any ideas? Usually reboot a server can be done within one minute. 

Ben Kochie

unread,
May 2, 2017, 5:09:12 AM5/2/17
to Yong Zhang, Prometheus Users
You could alert on changes in node_boot_time.

ALERT NodeRebooted
  IF changes(node_boot_time[1h]) > 0

You would also want to combine that with a down alert like this:

ALERT NodeDown
  IF up{job="node"} == 0
  FOR 5m

This way you get both failure modes, a totally down node, and a node that rebooted.

On Tue, May 2, 2017 at 10:53 AM, Yong Zhang <hisca...@gmail.com> wrote:
Hi, All

I'm trying to find a solution to alert server reboot event with node_exporter, the "up" or "node_boot_time" metrics seems hard to achieve this, any ideas? Usually reboot a server can be done within one minute. 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/61560c30-20d9-497b-b7bf-e8d0286c7012%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Cemalettin Koc

unread,
May 2, 2017, 6:50:44 PM5/2/17
to Prometheus Users, hisca...@gmail.com
Why did you use 1h and not 5m? Is there a particular reason to choose a long window?


On Tuesday, May 2, 2017 at 12:09:12 PM UTC+3, Ben Kochie wrote:
You could alert on changes in node_boot_time.

ALERT NodeRebooted
  IF changes(node_boot_time[1h]) > 0

You would also want to combine that with a down alert like this:

ALERT NodeDown
  IF up{job="node"} == 0
  FOR 5m

This way you get both failure modes, a totally down node, and a node that rebooted.
On Tue, May 2, 2017 at 10:53 AM, Yong Zhang <hisca...@gmail.com> wrote:
Hi, All

I'm trying to find a solution to alert server reboot event with node_exporter, the "up" or "node_boot_time" metrics seems hard to achieve this, any ideas? Usually reboot a server can be done within one minute. 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Julius Volz

unread,
May 2, 2017, 6:53:20 PM5/2/17
to Cemalettin Koc, Prometheus Users, Yong Zhang
Since Prometheus is a sliding-window processing system, if it's interrupted for only 5 minutes, a foo[5m] alert would miss its firing condition. And in general, it would only alert for 5 minutes, then resolve the alert, as the reboot slides out of the 5 minutes window.

I guess 1h is an ok compromise - you ensure with reasonable certainty that you will get the alert, but the alert also will also resolve itself after 1h (you can silence it until then if you have faster notification repeat intervals).

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/65ef14ff-7a92-40a6-b7ca-268260b3e307%40googlegroups.com.

Yong Zhang

unread,
May 2, 2017, 9:00:12 PM5/2/17
to Prometheus Users, hisca...@gmail.com
Wonderful! Thanks everyone!


On Tuesday, 2 May 2017 17:09:12 UTC+8, Ben Kochie wrote:
You could alert on changes in node_boot_time.

ALERT NodeRebooted
  IF changes(node_boot_time[1h]) > 0

You would also want to combine that with a down alert like this:

ALERT NodeDown
  IF up{job="node"} == 0
  FOR 5m

This way you get both failure modes, a totally down node, and a node that rebooted.
On Tue, May 2, 2017 at 10:53 AM, Yong Zhang <hisca...@gmail.com> wrote:
Hi, All

I'm trying to find a solution to alert server reboot event with node_exporter, the "up" or "node_boot_time" metrics seems hard to achieve this, any ideas? Usually reboot a server can be done within one minute. 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Yong Zhang

unread,
May 18, 2017, 9:14:58 PM5/18/17
to Prometheus Users, hisca...@gmail.com
Hi, Ben

I'm facing one tricky issue these days when alert changes(node_boot_time[1h]) > 0, some server is up for several days without reboot but triggered reboot alerts. I can't find why this happend, do you have any ideas? Almost everyday I receive such alerts on different server...


On Tuesday, 2 May 2017 17:09:12 UTC+8, Ben Kochie wrote:
You could alert on changes in node_boot_time.

ALERT NodeRebooted
  IF changes(node_boot_time[1h]) > 0

You would also want to combine that with a down alert like this:

ALERT NodeDown
  IF up{job="node"} == 0
  FOR 5m

This way you get both failure modes, a totally down node, and a node that rebooted.
On Tue, May 2, 2017 at 10:53 AM, Yong Zhang <hisca...@gmail.com> wrote:
Hi, All

I'm trying to find a solution to alert server reboot event with node_exporter, the "up" or "node_boot_time" metrics seems hard to achieve this, any ideas? Usually reboot a server can be done within one minute. 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Ben Kochie

unread,
May 19, 2017, 2:31:25 AM5/19/17
to Yong Zhang, Prometheus Users
We have noticed that some VM environments have problem with unstable clocks that cause the node_boot_time to wander.  We still haven't fully explained why this happens only in some cases.

Can you give us more information about the underlying enviornment?
* Cloud provider?
* Custom VM setup?
* Do you run an NTP client?


To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/02540380-b387-439a-a3a3-e3cc0836913a%40googlegroups.com.

Yong Zhang

unread,
May 21, 2017, 9:20:09 PM5/21/17
to Prometheus Users, hisca...@gmail.com
* Cloud provider?  Tencent Cloud of China
* Custom VM setup? Cloud image (Ubuntu 16.04)
* Do you run an NTP client? (Yes, and time sync from default time servers, is this the cause?)

Ben Kochie

unread,
May 22, 2017, 4:05:19 AM5/22/17
to Yong Zhang, Prometheus Users
On Mon, May 22, 2017 at 3:20 AM, Yong Zhang <hisca...@gmail.com> wrote:
* Cloud provider?  Tencent Cloud of China 
* Custom VM setup? Cloud image (Ubuntu 16.04)
* Do you run an NTP client? (Yes, and time sync from default time servers, is this the cause?)

It may be, we're still not sure, as it only happens on some combinations of VM guests and hosts.  Time inside VMs is very tricky due to the time sharing of cores, TSC stability, etc.  This combined with the fact that Linux uses the TSC to calculate time-since-boot makes things even more tricky.

I assume you're using standard ntpd.

A copy of the first CPU of a node's /proc/cpuinfo may reveal some more information.  If you know for sure which hypervisor is used, that would also help.  (KVM/Xen/etc)

Another question, what is the output of 
sudo systemctl status systemd-timesyncd.service

You may have a conflict between systemd's timesyncd and ntpd.

The only real workaround would be to make your own psudo boot time.
 
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5ef4d4a9-d20f-4475-b1b5-890a34ba13a7%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages