> [...]
> (1) as I said, it's a bad way to build a monitoring system
It is of course cleaner, in a theoretical way, to place the thresholds in a separate location, and not change the disk metrics every time your relabel them in order to move disks to a separate threshold alert.
But I am not convinced that your solution is better in practice, especially for small networks like mine.
What you are suggesting is actually a work-around. It feels like Prometheus is missing an easy way to assign alerts to an arbitrary set of metrics, so you have to simulate metrics to provide thresholds. Then you can use existing PromQL syntax to check those thresholds against their disks.
If I understood the idea correctly, these virtual metrics would just provide the same values for all timestamps, because only the disk instance is relevant. That is an indication that the concept is not clean, just a work-around. Those "virtual" metrics are going to waste data space, because they are real time-series as far as Prometheus is concerned. They are going to double the number of windows_logical_disk_free_bytes time-series, because each disk instance metric will need a threshold counterpart. If you have thousands of disks, you can argue that this solution does not scale well either.
Associating disks with alert thresholds on an arbitrary basis is a very common requirement. I think that is going to happen all over the place. For example, you may have many thermometers measuring temperatures in the same way, but each temperature may require its own alert threshold. I am surprised that Prometheus makes this hard to achieve.
Your solution seems to be designed to assign an independent threshold per disk, but the most common scenario is that you will only have a small number of thresholds. For example, all Windows system disks (normally C:) would need an alert threshold, all Linux system disks will have another threshold. There will probably be a small number of data disk categories, say log disks, photo disks and document disks, and each one will need a separate alert threshold. But it is improbable that every disk will need a custom threshold. Similarly, if you are alerting based on temperatures, you will probably have groups too, like ambient temperature, fridge temperature and freezer temperatures. Not every thermometer will need a custom alert threshold.
So you need one alarm per threshold, and then a way to assign arbitrary disks or thermometers to one alarm. The easiest way now is probably to use labels. But in fact you are looking for a switch statement:
switch ( computer-instance, disk-volume )
{
case PC1, Volume1: Assign to Alert A.
case PC3, Volume3: Assign to Alert K.
default: Assign to Alert M.
}
> (2) in the limit, you will end up with a separate rewriting rule for every instance+volume combination
That's not too bad. The rewriting rules are just adding a label to each disk. It is perhaps rather verbose, the way the Prometheus syntax is, but those rewriting rules are (or can be) close to the disks they apply to. After all, you have to decide which threshold to apply to each disk, and where exactly you do that, or how verbose it is, does not make much difference in my opinion.
> This doesn't scale to thousands of alerting rules, but neither does metric relabeling with thousands of rules.
- If you solve this problem with alarms, you have to write or modify an alarm per disk you add. You may end up with many alarms.
- If you solve this problem with relabeling, you have to create or modify a label rewriting rule per disk you add. You may end up with many rewriting rules.
- If you solve this problem with virtual threshold metrics, you have to create a virtual metric per disk you add. You may end up with many metrics.
The difference in scalability is not great, as far as I can see (with my rather limited Prometheus knowledge).
Regards,
rdiez