The expression you've written doesn't really make much sense. If you have a metric "disk_used_percent", which runs between 0 and 100 (presumably), why are you summing it by host? This means that if one host had three disks, each 40% used, that the result would be "120% used" and trigger an alert unnecessarily.
I would expect the expression to be simply:
expr: disk_used_percent > 85
> Now for that i need to create 2 rule for each severity. Now i have question can we create one query for both severity like range between 85-95 warring and 95 up critical?
No, you were right the first time: you need one rule for 85%+ and one for 95%+
You can then use inhibit rules in Alertmanager so that if the 95%+ alert is firing, it inhibits sending the 85%+ one. To do this you'll need to add labels to your alerts, and set up the
inhibit rules appropriately.
Personally though, I find such rules difficult to maintain and irritating. Suppose you have one machine which is sitting at 88% disk full, but is working perfectly normally. Do you want it to be continuously alerting? Suppose you've already done all the free space tidying you can. Are you *really* going to add more disk space to this machine, just to bring the usage under 85% to silence the alert? Probably not (unless it's a VM and can be grown easily). However, once you start to accept continuously firing alerts, then you'll find that everyone ignores them, and then *real* problems get lost amongst the noise.
You might decide you want to have different thresholds for each filesystem. But then either you end up with lots of alerting rules, or you need to put the thresholds in their own timeseries, as described here:
- and this is a pain to maintain.
Personally, I've ditched all static alerting thresholds on disk space. Instead I have rules for when the filesystem is completely full(*), plus rules which look at how fast the filesystem is growing, and predict when they will be full if they continue to grow at the current rate. Examples:
- name: DiskRate10m
interval: 1m
rules:
# Warn if rate of growth over last 10 minutes means filesystem will fill in 2 hours
- alert: DiskFilling10m
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 7200) < 0)) * 7200
for: 20m
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 10m growth rate'
- name: DiskRate3h
interval: 10m
rules:
# Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
- alert: DiskFilling3h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 172800) < 0)) * 172800
for: 6h
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 3h growth rate'
- name: DiskRate12h
interval: 1h
rules:
# Warn if rate of growth over last 12 hours means filesystem will fill in 7 days
- alert: DiskFilling12h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -
(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[12h], 604800) < 0)) * 604800
for: 24h
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 12h growth rate'
(*) In practice I also alert at *just below* full, e.g.
- name: DiskSpace
interval: 1m
rules:
# Alert if any filesystem has less than 100MB available space (except for filesystems which are smaller than 150MB)
- alert: DiskFull
expr: |
node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"} < 100000000 unless node_filesystem_size_bytes{fstype!~"fuse.*|nfs.*"} < 150000000
for: 10m
labels:
severity: critical
annotations:
summary: 'Filesystem full or less than 100MB free space'
I find this helpful for /boot partitions where if they do get completely full with partially-installed kernel updates, it's tricky to fix. But I still wouldn't "alert" in the sense of getting someone out of bed at 3am - unless the system is failing in a way that your users or customers would notice (which is something you should be checking and alerting on separately), this is something that can be fixed at leisure.
Finally, I can strongly recommend this "philosophy on alerting":
You might want to consider whether some of these system checks would be better off in dashboards or daily reports, rather than being sent out immediately as alerts.