This should be tempered with an expectation that there are many failures that you *can't* predict. The ones you may be able to predict are around running out of a resource - primarily disk space or RAM/swap. (See example at end).
This "philosophy on alerting" from an ex-Google site engineer is well worth reading:
The most important take away is: focus your alerting effort on symptoms (i.e. problems seen by the end user) rather than causes. Alerts then become immediately actionable. Put another way: if the problem isn't so bad that end users don't notice it, then it's not worth getting somebody out of bed to deal with. Also, try to design your systems so that they are resilient to the failure of a single node.
Sometimes analysis of these end-user measurements can give you predictions of future failures - e.g. if you find a particular server's response time is higher than normal, this may be a precursor to it failing completely.
Anyway, here are some rules you can use as a starting point. For the linear predictive ones, note that the "for:" clause requires the condition to be true for twice the duration of the prediction interval. This means if there's a sudden step in the disk usage, but it then goes back to flat, it doesn't send an alert.
groups:
- name: DiskSpace
interval: 1m
rules:
# Alert if any filesystem has less than 100MB available space.
- alert: DiskFull
expr: |
node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"} < 100000000 and node_filesystem_size_bytes{fstype!~"fuse.*|nfs.*"} > 100000000
for: 2m
labels:
severity: critical
annotations:
summary: 'Filesystem full or less than 100MB available space'
- name: DiskRate10m
interval: 1m
rules:
# Warn if rate of growth over last 10 minutes means filesystem will fill in 2 hours
- alert: DiskFilling
expr: |
predict_linear(node_filesystem_free_bytes{fstype!~"fuse.*|nfs.*"}[10m], 7200) < 100000000
for: 20m
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in less than 2h at current 10m growth rate'
- name: DiskRate3h
interval: 10m
rules:
# Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
- alert: DiskFilling
expr: |
predict_linear(node_filesystem_free_bytes{fstype!~"fuse.*|nfs.*"}[3h], 2*86400) < 100000000
for: 6h
labels:
severity: warning
annotations:
summary: 'Filesystem will be full in less than 2d at current 3h growth rate'