Alert if system no longer up to date (apt or Windows Update)

468 views
Skip to first unread message

R. Diez

unread,
Nov 13, 2021, 3:25:12 PM11/13/21
to Prometheus Users
Hi all:

One of the things I regularly check is whether Windows Update has got stuck, or whether it lists some interesting optional updates.

On Linux, I also check whether the system has pending security updates that are not getting installed automatically.

This is a typical thing to check. When you log onto Cockpit (https://cockpit-project.org/), you see "Checking for package updates...", and then you may get a message like "Security Updates Available".

I would like Prometheus to generate an alert whenever a system is no longer updating itself in a timely manner.

Can someone help? The Node Exporter does export some metrics under "apt_upgrades_pending", but I am not sure how to make an alert out of these simple counters.

Unfortunately, I could not find anything related in the Windows Exporter.

Thanks in advance,
  rdiez

Brian Candler

unread,
Nov 14, 2021, 4:54:03 AM11/14/21
to Prometheus Users
node_exporter itself doesn't generate apt_upgrades_pending.

Therefore, I expect you're running a script like this from cron:
together with node_exporter's textfile_collector.

This isn't a counter, it's a gauge.  And you can alert on it just by any simple PromQL expression, e.g.

apt_upgrades_pending > 0
# alert as soon as updates available

min_over_time(apt_upgrades_pending[48h]) > 0
# only alert if updates have been pending for 48 hours continuously

I don't know anything about Windows package management, but if you can make a batch file or powershell script to determine the information you want, you should be able to do the same sort of thing.

R. Diez

unread,
Nov 21, 2021, 3:02:09 PM11/21/21
to Prometheus Users
Many thanks for your help.

I'm afraid I cannot get it right. I am still confused with Prometheus' aggregates etc. Can you help me further?

I am using the Node Exporter version 0.18.1 that comes packaged with Ubuntu 20.04, which I believe is using the apt.sh collector. The metrics I am getting for apt_upgrades_pending over a few days are:

{instance="MyHostname",job="node-exporter"}
{arch="amd64",instance="MyHostname",job="node-exporter",origin="Ubuntu:20.04/focal-updates,Ubuntu:20.04/focal-security"}
{arch="amd64",instance="MyHostname",job="node-exporter",origin="Ubuntu:20.04/focal-updates"}
{arch="all",instance="MyHostname",job="node-exporter",origin="Ubuntu:20.04/focal-updates,Ubuntu:20.04/focal-security"}
{arch="all",instance="MyHostname",job="node-exporter",origin="Ubuntu:20.04/focal-updates"}
{arch="all",instance="MyHostname",job="node-exporter",origin="Ubuntu:20.04/focal-updates"}

I thought arch="all" would include all categories, but look at this scrape:

apt_upgrades_pending{arch="all",origin="Ubuntu:20.04/focal-updates"} 5
apt_upgrades_pending{arch="all",origin="Ubuntu:20.04/focal-updates,Ubuntu:20.04/focal-security"} 3
apt_upgrades_pending{arch="amd64",origin="Ubuntu:20.04/focal-updates"} 2
apt_upgrades_pending{arch="amd64",origin="Ubuntu:20.04/focal-updates,Ubuntu:20.04/focal-security"} 27

And this other scrape:

apt_upgrades_pending{arch="amd64",origin="Ubuntu:20.04/focal-updates"} 10

Those scrapes are for the same computer.

Sometimes, arch="all" is there, sometimes not. I am guessing that which arch="xxx" and origin="xxx" metrics are returned depends on which updates are available at that point in time.

This is the alert you suggested:

min_over_time(apt_upgrades_pending[48h]) > 0

I tested it, and I am getting many alerts for each computer.

For the purposes of this alert, I guess I could group (sum) all those 'apt_upgrades_pending' values by 'instance', before applying 'min_over_time', but I cannot get the expression right.

Thanks in advance,
  rdiez

Brian Candler

unread,
Nov 21, 2021, 4:27:56 PM11/21/21
to Prometheus Users
> For the purposes of this alert, I guess I could group (sum) all those 'apt_upgrades_pending' values by 'instance', before applying 'min_over_time', but I cannot get the expression right.

The min_over_time was just a suggestion to prevent alerts if the packages became available, and were promptly installed.  That is: alert only if packages have been outstanding for 48 hours continuously.  But actually there's an easier way to do that, with "for: 48h" on the alerting rule.

So really:

expr: sum by (instance) (apt_upgrades_pending) > 0
for: 48h

is most likely all you need.

For completeness, suppose you do want to use min_over_time as well.  If you want to take an instant vector expression like the above and turn it into a range vector, then you need a subquery.  But the other way round doesn't need a subquery, because the output of min_over_time is already an instant vector.

sum by (instance) (min_over_time(apt_upgrades_pending[48h])) > 0

R. Diez

unread,
Nov 21, 2021, 5:24:44 PM11/21/21
to Prometheus Users
> sum by (instance) (apt_upgrades_pending) > 0

I think that your suggestion will work. I'll try it out for the next days. Many thanks.

About the other way, for completeness: Instead of using a subquery, I wonder whether this would work too:

  - record: apt_upgrades_pending:sum_grouped_by_instance
    expr: sum by ( instance ) ( apt_upgrades_pending )

  - alert: AptNotUpdatingPackages
    expr: min_over_time( apt_upgrades_pending:sum_grouped_by_instance[7d] ) > 0
    for: 0m
    labels:
      severity: warning

Prometheus seems unable to do a min_over_time() from a sum(). Is that kind of intermediate step with "record" a generic way to overcome this limitation? Or have I missed something?

Brian Candler

unread,
Nov 22, 2021, 3:38:13 AM11/22/21
to Prometheus Users
Yes, using a recording rule will work fine too.

> Prometheus seems unable to do a min_over_time() from a sum()

That's because the input to min_over_time is a range vector - which covers multiple points over a period of time, not a single instant of time.

As I said before, you can turn an instant vector into a range vector using a subquery. Something like:

    min_over_time((sum by (instance) (apt_upgrades_pending))[48h:5m])

A subquery is just like specifying a time range on a metric, but you have to specify the sampling interval (here "5m"). The instant query is repeated at this interval to generate a range of data points.  You don't need to do this on a plain metric because it just uses all the individual samples which were scraped.
Reply all
Reply to author
Forward
0 new messages