Mount Point Missing Alarm

424 views
Skip to first unread message

s.saurab...@gmail.com

unread,
Mar 1, 2021, 8:56:56 AM3/1/21
to Prometheus Users

Hi Everyone,

I have specific requirement from the client that prometheus should generate alert in case any mount point on the server goes missing.

For Eg: If server has 3 mount points like /data1 /NFS1 /NFS2 and if by any reason ,/NFS2 gets delinked from the server in that case prometheus should generate alert.

When I tried with below query,it is working fine(as this metric goes missing when /NFS2 got delinked from the server)

absent(node_filesystem_readonly{device="XX:/NFS2",fstype="nfs2",hostname="EAST_WB_XX",instance="XX:9100",job="XX",mountpoint="/NFS2"}) == 1

However there are 800 servers which are required to get monitor therefore it is not possible to add 800 rules for each IP in the rules.yml.

When I add below rule,it didn't generate the missing alert.

absent(node_filesystem_readonly{mountpoint="/NFS2"}) == 1

Please advice if we can achieve this with some tweaking in the query so that it can be generic for all servers.

Looking forward for your response.

Thanks,
Saurabh

Julius Volz

unread,
Mar 1, 2021, 10:43:59 AM3/1/21
to s.saurab...@gmail.com, Prometheus Users
Hi Saurabh,

For any calculations where you compare the current state to the past state as a correctness check (where the past state represents the desired / expected state), you always have some limitations: First, it's already possible that mountpoints are missing before you even start collecting data, in which case you would never be able to notice those missing mount points. Second, Prometheus is a sliding window system, so any reference of the current to the past will "slide" over your data, and eventually your current state will become the past, whether it's in the originally desired state or not (thus you will stop noticing problems at that time / alerts will stop firing). For example, you can compare the current set of mountpoints to the set 10 minutes ago, and you can get an alert if some mountpoint went missing. But if you wait another 10 minutes, then when the alert calculation runs again, both the current and old state used for comparisons will no longer contain the now-missing mountpoints, so the alert would stop firing.

Still, given those caveats, you could write an alert expression like this:

    node_filesystem_readonly offset 1h
unless
    node_filesystem_readonly

This basically says "alert me if there was a filesystem 1h ago, unless it is also currently present ". But beware that this alert will auto-resolve after 1 hour, due to the sliding window effect described above. So to be somewhat more resilient, you could increase the 1h to 1d or something. But be aware that for the alert to work, you need at least 1d of history in your TSDB then, so in a fresh Prometheus, the alert will always need at least 1d to start working.

In the end, it's always better to have a proper authoritative source of truth somewhere that tells the monitoring system which mountpoints are expected (possibly, for each type of server or so), rather than relying on past/current comparisons, but this can be a workaround.

Regards,
Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/aea59e21-6a1a-406c-a033-0d8bcfdf6831n%40googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com
Reply all
Reply to author
Forward
0 new messages