[feature/proposal] Changing alert fingerprint calculation in prometheus/common

George Robinson

unread,

Jun 20, 2023, 8:14:47 AM6/20/23

to Prometheus Developers

In prometheus/common the fingerprint of an alert is calculated as an fnv64a hash of it's labels. The labels are first sorted, and then the label name, separator, label value, and another separator for each label is added to the hash before the final sum is calculated.

I noticed that something missing from the fingerprint is the alert's StartsAt time. You could argue that an alert with labels a₁a₂aₙ that started at time t₁ and then resolved at time t₂ is a different alert than one also with the labels a₁a₂aₙ but started at time t₃ - and so these two alerts should have different fingerprints.

The fact that the fingerprint is constant over its labels has proven interesting while debugging cases of flapping alerts in Alertmanager.

However, while I would like to add StartsAt to the fingerprint, I am concerned that adding the StartsAt timestamp to the fingerprint will break Prometheus rules when run in HA as I do not believe the StartsAt time is synchronised across rulers.

I was wondering if there is some historical context for this? Perhaps the reasons mentioned above, but there could be others that I am also unaware of?

Best regards

George

Matthias Rampke

unread,

Jun 23, 2023, 3:34:45 AM6/23/23

to George Robinson, Prometheus Developers

For a very long time, Prometheus did not store apeet state across restarts, so the alert startsAt would update even though the condition had not changed.

I don't think we ever considered this time to be very meaningful or stable, partially due to the originally stateless implementation, but also due to the HA synchronization issue you mentioned.

Can you explain more what the scenario is where the current label-based identity doesn't work? If I am reading it right, this is the first time someone asks for the alerts to be more responsive to flapping, more typically the desire is to reduce that, identifying successive alerts as being the same thing even if the alert condition wasn't held for a short period of time.

/MR

On Tue, 20 Jun 2023, 15:14 'George Robinson' via Prometheus Developers, <prometheus...@googlegroups.com> wrote:

In prometheus/common the fingerprint of an alert is calculated as an fnv64a hash of it's labels. The labels are first sorted, and then the label name, separator, label value, and another separator for each label is added to the hash before the final sum is calculated.period of time

I noticed that something missing from the fingerprint is the alert's StartsAt time. You could argue that an alert with labels a₁a₂aₙ that started at time t₁ and then resolved at time t₂ is a different alert than one also with the labels a₁a₂aₙ but started at time t₃ - and so these two alerts should have different fingerprints.

The fact that the fingerprint is constant over its labels has proven interesting while debugging cases of flapping alerts in Alertmanager.

However, while I would like to add StartsAt to the fingerprint, I am concerned that adding the StartsAt timestamp to the fingerprint will break Prometheus rules when run in HA as I do not believe the StartsAt time is synchronised across rulers.

I was wondering if there is some historical context for this? Perhaps the reasons mentioned above, but there could be others that I am also unaware of?

Best regards

George

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/b722e00e-bfd8-4ff4-bbef-e5e0836280bbn%40googlegroups.com.

Julius Volz

unread,

Jun 27, 2023, 2:55:49 AM6/27/23

to Matthias Rampke, George Robinson, Prometheus Developers

Yeah, everything in Prometheus and the Alertmanager revolves around alerts only being identified by their label sets and nothing else. For example, this is important for the Alertmanager to see alerts from multiple Prometheus servers as identical if they have the same label set, even if they began and were resolved at slightly different times.

An "alert" in that sense is different from an "incident" or particular time-based instance of an alert, which Prometheus does not explicitly model. The closest thing to that is the Alertmanager taking in varying alert states over time and turning them into discrete notifications while applying throttling and grouping mechanisms. Those can prevent some flapping on the notification front, and careful alerting rules (averaging over large enough durations, using "for" durations, etc.) can do their part as well.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAMV%3D_gYviO9Ad%3DJXrHKHdZzMDgnu545rTEtPi_hnALvyWzaxAA%40mail.gmail.com.

--

Julius Volz

PromLabs - promlabs.com

George Robinson

unread,

Jul 6, 2023, 11:04:06 AM7/6/23

to Prometheus Developers

Thank you both for taking the time to answer my questions!

The main use case I've been thinking is being able to differentiate between flapping alerts in the alert generator (Prometheus) and flapping alerts in the alert receiver (Alertmanager). In the former, the alert is flapping because the data is alternating around the condition without stabilizing. In the latter case, the alert generator is failing to keep the alert receiver informed about the state of the alert before its expiration time (EndsAt).

In either case I'm not proposing for alerts to be more responsive to flapping, however based on what I've learned about Prometheus and Alertmanager so far, and the answers above, being able to differentiate between the two is not a goal of Prometheus, but rather the opposite - to make them look the same.

> For example, this is important for the Alertmanager to see alerts from multiple Prometheus servers as identical if they have the same label set, even if they began and were resolved at slightly different times.

Indeed! The other use case is if we can make it easier to debug cases of flapping alerts, including when there are multiple Prometheus servers sending alerts to an Alertmanager. The motivation here is I've been debugging a number of cases of flapping alerts and it can be hard to understand where the flapping is coming from.

> An "alert" in that sense is different from an "incident" or particular time-based instance of an alert, which Prometheus does not explicitly model. The closest thing to that is the Alertmanager taking in varying alert states over time and turning them into discrete notifications while applying throttling and grouping mechanisms. Those can prevent some flapping on the notification front, and careful alerting rules (averaging over large enough durations, using "for" durations, etc.) can do their part as well.

Thanks for the explanation here! I think this was the main design choice I wanted to understand.

But, I have to ask the question of what is the purpose of Prometheus sending a StartsAt time to Alertmanager? This creates a time-based instance of an alert because alerts have definitive StartsAt times, so Prometheus is kind of modelling time-based alerts - and also not modelling time-based alerts all at the same time.

I think the StartsAt time of an alert can also go backwards when running Prometheus HA because different Prometheus servers will have different offsets for the same evaluation group depending on when the Prometheus process first started.

Reply all

Reply to author

Forward