Alert for high-frequency changes of a metric?

46 views
Skip to first unread message

Moses Moore

unread,
Feb 28, 2020, 11:21:53 AM2/28/20
to Prometheus Users
(Looks like my previous ask of this question got spamblocked because I included a screenshot.  c'est la vie.)

I have alerts for when a metric's value passes above or below a threshold.  I can ask for the minimum or maximum over a time range, I can as for a prediction based on the slope of a graph.

I have some resources that I know will fail soon after their metrics fluctuate wildly over a short period of time.  They may never exceed the absolute value of 85% during their fluctuations, or they may exceed this briefly but not long enough to cause concern if it was a smooth line.  I.E.  If the samples over time were [30, 30, 31, 70, 5, 69, 6, 71, 5, 69, null, null, null]  I want to detect it before the metric goes absent (because the resource crashed).

Setting the threshold at ">69" doesn't work because the value drops below the threshold on the next scrape, closing the alert; besides, if it were at a steady 69 that would be healthy.
Setting the threshold at "avg(metric[interval)" doesn't work because the average of an oscillating metric will be well within the healthy range.
I thought of setting an alert for "max_over_time - min_over_time > 50" but that would trigger on a smooth ascension -- a false positive.

What's the question should I ask Prometheus to detect a metric that vibrates too much?

Łukasz Mierzwa

unread,
Feb 28, 2020, 11:26:54 AM2/28/20
to Prometheus Users
Did you try something like "changes(foo[30m]) > 10" ? That would alert if the value changed 10 times in the last 30 minutes.

Moses Moore

unread,
Mar 2, 2020, 10:49:50 AM3/2/20
to promethe...@googlegroups.com
>Did you try something like "changes(foo[30m]) > 10" ?
> That would alert if the value changed 10 times in the last 30 minutes.

Good guess.  If it were a boolean ([0, 1]) then your idea would detect
flip-flopping (flapping?) over time.

A metric can be healthy while changing constantly, but gently -- a
smoothly increasing slope or a long sine-wave.  I'm hoping to detect
high frequency of deviant amplitudes of change.  Think of a temperature
sensor that rises and falls over the course of a day, but when something
overheats it will spike, then the device downgrades until the
temperature is under a threshold, then spikes again as it tries
operating at full capacity... every time it drops into the "healthy"
range, so an alert looking merely for exceeding a threshold will never
leave the alertstate="pending".

Really wish I could post a picture to illustrate the pattern I'm looking
for.  It's obvious to a human eye.

Maybe something like a combination of changes() and rate() ? with some
*_over_time aggregation?  Seems too complex, I hope I'm overthinking it.
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Prometheus Users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/prometheus-users/6BCaoU4WCS8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> prometheus-use...@googlegroups.com
> <mailto:prometheus-use...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/5f7ef3b8-a311-4d98-8bcb-2594e6aaef80%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/5f7ef3b8-a311-4d98-8bcb-2594e6aaef80%40googlegroups.com?utm_medium=email&utm_source=footer>.

Murali Krishna Kanagala

unread,
Mar 2, 2020, 11:24:26 AM3/2/20
to Moses Moore, Prometheus Users
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9ab78b23-8eee-a201-213b-8ed979f237e9%40gmail.com.

Ben Kochie

unread,
Mar 2, 2020, 12:16:25 PM3/2/20
to Moses Moore, Prometheus Users
You're basically talking about anomaly detection. There are lots of articles on the subject.

Your thermal example is also easily solved by using `predict_linear()`.


You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages