Hi All,
I have a monitoring requirement related to the user level disk usage and alerting. And i am wondering if prometheus is the correct tool to handle this requirement or,
a custom python script (whish uses os, subprocess, smtp module) to handle monitoring and alerting will be optimial solution in this context?
Here is the problem description -
In our setup we have 3 servers we have a single mount point "/", and each user's directory, such as "/home/user1", "/home/user2", and so forth, resides within this mount point.

We enforce disk quotas for individual users, and our goal is to monitor each user's disk usage and trigger alerts to the top 10 users when overall quota exceeds 90%.
Challenges:
1. Afaik, prometheus monitors the overall storage status and the mountpoint information, so individual user's disk consumption is not being tracked by Prometheus. Example -

a) Do i need to write custom exporter here which uses du -sh to figure out the disk usage ? where
user_disk_usage_bytes{
username="ravi"} 390000
b) or node exporter can do this?
after data collection, i need to deal with alerting rule
2. Here is the alert condition on the custom exporter-
condition1: can help determine the users who have high usage
topk
( user_disk_usage_bytes
/ scalar(node_filesystem_size_bytes{instance="server1:9100",mountpoint='/'}
) ) condition2: this can help determine if the usage has reached 90% (available space less than 10%)
(
node_filesystem_avail_bytes{instance="server1:9100",mountpoint='/'} / node_filesystem_size_bytes{
instance="server1:9100",mountpoint='/' } ) < 0.1
I don't think
condition1 and
condition2 will work as labels and label values returned by condition1 and condition2 are different.
Is there a way to achieve this with PromQL ?
Now, assuming that i am able to get a list of users if system utilization is 90% as -
{username="ravi"} 80
{username="user1"} 90
{username="user2"} 70
{username="user3"} 80
{username="user4"} 90
the alerting rule will be
groups:
- name: example
rules:
- alert: Storage space is low on server1
expr: condition1 and condition2
for: 10m
labels:
alertname: "Server1's Storage space is running low, Please cleanup the disk space - {{ $labels.username }}"
annotations:
summary: "you are using {{ $value }}% space on the / space.please cleanup."
So i need 3 rules - 1 each for server1,server2 and server3
3. Now alert manager is responsible to sending out the alerts
And to send the alert , i think this should be the configuration in current context -

as i have already included username in the alert name , and by default grouping of alert happens by alertname so i think with this setting 1:1 email should be sent to each user.
Apologies for the lengthy post , but I have tried expressing the flow to solve this problem based on my understanding of Prometheus so far.
I would greatly appreciate any insights, recommendations, or best practices i can get can offer in achieving dynamic user disk usage monitoring with Prometheus and Alert Manager.
Thank you in advance .
Best regards,
Puneet