Prometheus Disk usage (BlackBox Exporter + AlertManager)

43 views
Skip to first unread message

Yudin Dmitriy

unread,
Dec 27, 2024, 3:43:11 PM12/27/24
to Prometheus Users
Hi Friends!

I neeed your help, i've created alert in AlertManager to monitor disk space on windows OS

and it works fine except one annoying thing - this rule also monitoring System Volume that is almost full (and it's normal) and i don't know how to exclude it. Maybe someone can help with this?
I want to exclude this volume from checking : HarddiskVolume4
Rule:
- alert: DiskSpaceUsage
    expr: 100.0 - 100 * (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes) > 80
    for: 10m
    labels:
      severity: high
    annotations:
      summary: "Disk Space Usage (instance {{ $labels.instance }})"
      description: "Disk Space on Drive is used more than 80%\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"


Brian Candler

unread,
Dec 28, 2024, 5:11:22 AM12/28/24
to Prometheus Users
Go into the Prometheus web interface (PromQL query editor), type "windows_logical_disk_free_bytes", and look at the vector of results you get.  I don't use Windows, but I'm guessing you'll see something like:

windows_logical_disk_free_bytes{instance="server1", filesystem="c:"} 12345
windows_logical_disk_free_bytes{instance="server1", filesystem="d:"} 98765
windows_logical_disk_free_bytes{instance="server2", filesystem="c:"} 42424
windows_logical_disk_free_bytes{instance="server2", filesystem="d:"} 24242

You then write a query which excludes the filesystems that you want to exclude, or includes the ones you want to include.  There are various ways to do this.

If you want to exclude all the c: drives, then it would be

expr: 100.0 - 100 * (windows_logical_disk_free_bytes{filesystem!="c:"} / windows_logical_disk_size_bytes) > 80

If you want to exclude all drives less than 1GB then it would be

expr: 100.0 - 100 * (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes) > 80 unless windows_logical_disk_size_bytes < 1000000000

If you want to exclude just {instance="server1", filesystem="c:"} then you can do:

expr: 100.0 - 100 * (windows_logical_disk_free_bytes / windows_logical_disk_size_bytes) > 80 unless windows_logical_disk_size_bytes{instance="server1", filesystem="c:"}

You can try out all of these expressions in the PromQL query browser: any value which shows in the table or graph view will generate an alert (regardless of its value).

If you want to have different percentage thresholds for different drives, then you can create static timeseries for each drive (e.g. using the node_exporter textfile collector, or simply using a static web page containing metrics that you scrape). This is described in the following article:

----------

Note 1: static thresholds are a pain. You can instead alert on the *rate of growth* of the filesystem, to predict when it will be full. Search the archives of this list for "predict_linear" and "node_filesystem_avail_bytes"

Note 2: personally I keep the metrics as fractions from 0 to 1, using humanizePercentage when rendering them; also keep them in their natural format (i.e. "free space" rather than "used space"). So:

- alert: DiskSpaceUsage
    expr: windows_logical_disk_free_bytes / windows_logical_disk_size_bytes < 0.2

    for: 10m
    labels:
      severity: high
    annotations:
      summary: "Disk Space Usage (instance {{ $labels.instance }})"
      description: "Free Disk Space on Drive is less than 20%\n  VALUE = {{ $value | humanizePercentage }}\n  LABELS: {{ $labels }}"

Apart from keeping the expressions simpler, there are cases where this is more accurate. For example, in Linux at least, there are separate metrics for "free space" and "available space". "free space" is larger, because "available space" is reduced by the amount of space reserved for root use.  You therefore have to be very careful:

- calculating "used space" needs to be 100% minus "free space"
- however, what you actually want to alert on is "available space" reaching zero, because that's the point when applications are unable to write to the disk any more (other than those running as root)
Reply all
Reply to author
Forward
0 new messages