User level disk usage monitoring and notification - with prometheus and alertmanager

Puneet Singh

unread,

Feb 28, 2024, 10:55:11 AM2/28/24

to Prometheus Users

Hi All,
I have a monitoring requirement related to the user level disk usage and alerting. And i am wondering if prometheus is the correct tool to handle this requirement or,
a custom python script (whish uses os, subprocess, smtp module) to handle monitoring and alerting will be optimial solution in this context?

Here is the problem description -
In our setup we have 3 servers we have a single mount point "/", and each user's directory, such as "/home/user1", "/home/user2", and so forth, resides within this mount point.

We enforce disk quotas for individual users, and our goal is to monitor each user's disk usage and trigger alerts to the top 10 users when overall quota exceeds 90%.

Challenges:
1. Afaik, prometheus monitors the overall storage status and the mountpoint information, so individual user's disk consumption is not being tracked by Prometheus. Example -

a) Do i need to write custom exporter here which uses du -sh to figure out the disk usage ? where
user_disk_usage_bytes{username="ravi"} 390000

b) or node exporter can do this?

after data collection, i need to deal with alerting rule
2. Here is the alert condition on the custom exporter-

condition1: can help determine the users who have high usage
topk( user_disk_usage_bytes /   scalar(node_filesystem_size_bytes{instance="server1:9100",mountpoint='/'}) )

condition2: this can help determine if the usage has reached 90% (available space less than 10%)
( node_filesystem_avail_bytes{instance="server1:9100",mountpoint='/'} / node_filesystem_size_bytes{ instance="server1:9100",mountpoint='/' }   ) < 0.1

I don't think  condition1 and condition2 will work as labels and label values returned by condition1 and condition2 are different.

Is there a way to achieve this with PromQL ?

Now, assuming that i am able to get a list of users if system utilization is 90% as -
{username="ravi"} 80
{username="user1"} 90

{username="user2"} 70
{username="user3"} 80
{username="user4"} 90

the alerting rule will be
groups:
- name: example
rules:
- alert: Storage space is low on server1
expr: condition1 and condition2
for: 10m
labels: alertname: "Server1's Storage space is running low, Please cleanup the disk space - {{ $labels.username }}" annotations:
summary: "you are using {{ $value }}% space on the / space.please cleanup."

So i need 3 rules - 1 each for server1,server2 and server3

3. Now alert manager is responsible to sending out the alerts

And to send the alert , i think this should be the configuration in current context -

as i have already included username in the alert name , and by default grouping of alert happens by alertname so i think with this setting 1:1 email should be sent to each user.

Apologies for the lengthy post , but I have tried expressing the flow to solve this problem based on my understanding of Prometheus so far.

I would greatly appreciate any insights, recommendations, or best practices i can get can offer in achieving dynamic user disk usage monitoring with Prometheus and Alert Manager.

Thank you in advance .

Best regards,
Puneet

Brian Candler

unread,

Feb 29, 2024, 4:23:01 AM2/29/24

to Prometheus Users

> I don't think condition1 and condition2 will work as labels and label values returned by condition1 and condition2 are different.

condition1 if on (instance,mountpoint) group_left(username) condition2

This assumes that the both expressions have "instance" and "mountpoint" labels; these are the only ones considered when matching. It also assumes there is a many-to-1 relationship from the left-hand size (users) to right hand side (filesystem), and that there is a label "username" that you would like carried forward from the LHS into the result.

> So i need 3 rules - 1 each for server1,server2 and server3

I don't think so. The vector of results can include values for each (user,filesystem,instance) on the LHS, and each (filesystem,instnace) on the RHS, and alert separately for every filesystem that reaches 90%.

Puneet Singh

unread,

Feb 29, 2024, 6:46:05 AM2/29/24

to Prometheus Users

Thank you Brian,
I will explore the group_left query once i have the data flowing in .

The default node exporter has the ability to report the disk usage at user level in my context? - by extending it via any flag ( i came across the text collector and i plan to explore that.)
or writing the custom exporter would be the optimal workaround?

Regards
Puneet

Brian Candler

unread,

Feb 29, 2024, 10:35:35 AM2/29/24

to Prometheus Users

On Thursday 29 February 2024 at 18:46:05 UTC+7 Puneet Singh wrote:

The default node exporter has the ability to report the disk usage at user level in my context? - by extending it via any flag ( i came across the text collector and i plan to explore that.)
or writing the custom exporter would be the optimal workaround?

I don't know what the "optimal" solution would be: you haven't said which filesystem you're using, and whether you're actually enforcing quotas at the filesystem level - in which case the filesystem will be keeping track of them, and you can just ask the filesystem for the current quota for each user.

If not, then periodically running du -sk /home/* sounds reasonable as long as it's not done too often. And yes, if you reformat those into prometheus metrics you can just drop them into a file for the textfile collector to pick up. Prometheus itself will add the "instance" label, so you only need to add "user" and "mountpoint" attributes (the latter would be statically "/home")

Don't use du -sh because you'll get metrics like "25M" or "304K" and it will be up to you to normalise them.

For alerting, the simple template expansion you have is almost certainly not going to work; it's almost certainly not user...@gmail.com. You could however make a static set of metrics mapping username to email:

{username="fred",email="fr...@flintstone.com"} 1

{username="wilma",email="w...@rubble.com"} 1

Then scrape this, and do another join in your promQL to pick up the email label. This is similar to the approach from

https://www.robustperception.io/using-time-series-as-alert-thresholds

Once you have the E-mail address as a label, then see

https://www.robustperception.io/using-labels-to-direct-email-notifications/

Reply all

Reply to author

Forward