How auto-resolved an alarm ?

74 views
Skip to first unread message

Loïc

unread,
Jun 22, 2022, 4:27:52 AM6/22/22
to Prometheus Users
Hi,

I use an exporter mtail to alerting when a pattern match into the kubernetes logs. When my alarm is firing, i would like auto-resolved it. I search how to use tje endsat parameter in my rule but i don't found.

Also, i tried to use the promql function rate but in this case my first occurence is missing. 
  
Have you an idea  ? 

Thanks 
Loïc

Brian Candler

unread,
Jun 22, 2022, 6:11:40 AM6/22/22
to Prometheus Users
> When my alarm is firing, i would like auto-resolved it

Alerts are generated by a PromQL expression ("expr:").  For as long as this returns a non-empty instance vector, the alert is firing.  When the result is empty, the alert stops.

For example: I want to get an alert whenever the metric "megaraid_pd_media_errors" increases by more than 200.  But if it has been stable for 72 hours, I want the alert to go away.  This is what I do:

  - alert: megaraid_pd_media_errors_rate
    expr: increase(megaraid_pd_media_errors[72h]) > 200
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: 'Megaraid Physical Disk media error count increased by {{$value | humanize}} over 72h'


Every time the expr is evaluated, it's looking over the most recent 72 hours.  "increase" is like "rate", but its output is scaled up to the time period in question - i.e. instead of rate per second, it gives rate per 72 hours in this case.

> i tried to use the promql function rate but in this case my first occurence is missing. 

"rate" (and "increase") calculate the rate between two data points.  If the timeseries has only one data point, it cannot give a result.  It cannot assume that the previous data point was zero, because in general that may not be the case: prometheus could have been started when the counter was already above zero.

You should make your timeseries spring into existence with value 0 at the start.

Loïc

unread,
Jun 22, 2022, 8:36:31 AM6/22/22
to Prometheus Users
Thanks Brian for your reply. 

In my use case, if i want sent the error log into the alarm generated, i should add the error message as label of my metric. The metric created by mtail : test_dbms_error[$container,$namespace,$pod_name,$domain,$productname,$setname,$message]
As the error message is present in the metric, i can't created my sample with value 0 at the start. Indeed, the content of error message is dynamically registered from the log and i can't created the metric sample before. 

This is why i would like use a alertmanager or prometheus parameter for auto-resolv my rule. But it's not possible? 

Loïc

Julien Pivotto

unread,
Jun 22, 2022, 8:51:26 AM6/22/22
to Loïc, Prometheus Users
On 22 Jun 05:36, Loïc wrote:
> Thanks Brian for your reply.
>
> In my use case, if i want sent the error log into the alarm generated, i
> should add the error message as label of my metric. The metric created by
> mtail :
> test_dbms_error[$container,$namespace,$pod_name,$domain,$productname,$setname,$message]
> As the error message is present in the metric, i can't created my sample
> with value 0 at the start. Indeed, the content of error message is
> dynamically registered from the log and i can't created the metric sample
> before.
>
> This is why i would like use a alertmanager or prometheus parameter for
> auto-resolv my rule. But it's not possible?


This is generally not recommended in prometheus, but you could do

del test_dbms_error[$container,$namespace,$pod_name,$domain,$productname,$setname,$message] after 5m

in mtail.

Note the "after 5m"
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ee94ec73-8714-46b5-b6cd-1ec1cabcf93en%40googlegroups.com.


--
Julien Pivotto
@roidelapluie

Loïc

unread,
Jun 22, 2022, 9:17:47 AM6/22/22
to Prometheus Users
Hi Julien,

If i have no solution with prometheus configuration, i think indeed to delete the sample with mtail , that would resolved the alarm...  Do you know if this could cause the prometheus level ?
Are you using mtail/prometheus with this configuration?

Thanks
Loïc

Brian Candler

unread,
Jun 22, 2022, 9:24:19 AM6/22/22
to Prometheus Users
> if i want sent the error log into the alarm generated, i should add the error message as label of my metric.

That gives you a high cardinality label, which is not what Prometheus is designed for.  Every distinct combination of labels defines a new timeseries.

I can see two solutions here:

1. Use a log storage system like Loki or ElasticSearch/OpenSearch, rather than Prometheus

2. Include the error message as an "exemplar".  When you have multiple events in the same timeseries and time window, then you'll only get one exemplar.  But it may be good enough to give you an "example" of the type of error you're seeing, and it keeps the cardinality of your counters low. (Exemplars are experimental and need to be turned on with a feature flag, and I don't know if mtail supports them)

Loïc

unread,
Jun 22, 2022, 10:44:37 AM6/22/22
to Prometheus Users
Thanks for your reply Brian :)

Loïc

unread,
Jun 23, 2022, 3:13:40 AM6/23/22
to Prometheus Users
Hi,

If i use the label for storing the message field, do you know what is the maximum length of the string that should not be exceeded?
Is there a recommendation on the maximum size?

Thanks
Loïc

Brian Candler

unread,
Jun 23, 2022, 3:34:53 AM6/23/22
to Prometheus Users
The length of the label doesn't really matter in this discussion: you should not be putting a log message in a label at all.  *Any* label which varies from request to request is a serious problem, because each unique value of that label will generate a new timeseries in Prometheus, and you'll get a cardinality explosion.

Internally, Prometheus maintains a mapping of
     {bag of labels} => timeseries

Whether the labels themselves are short or long makes very little difference.  It's the number of distinct values of that label which is important, because that defines the number of timeseries.  Each timeseries has impacts on RAM usage and chunk storage.

If you have a limited set of log categories - say a few dozen values - then using that as a label is fine.  The problem is a label whose value varies from event to event, e.g. it contains a timestamp or an IP address or any varying value.  You will cause yourself great pain if you use such things as labels.

But don't take my word for it - please read

"CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values."

I completely understand your desire to get specific log messages in alerts. If you need to do that, then as I said before, use Loki instead of Prometheus.  Loki stores the entire log message, as well as labels.  It has its own LogQL query language inspired by PromQL, and integrates with Grafana and alerting.  It's what you need for handling logs, rather than metrics.

(If you still want to do this with prometheus, it would be an interesting project to see if you can get exemplars in an alert.  But I suspect this would involve hacking mtail, alertmanager and even prometheus itself.  This is something only to be attempted by a serious Go coder)

Loïc

unread,
Jun 24, 2022, 10:59:11 AM6/24/22
to Prometheus Users
Thanks Brian for your reply very interesting. 

As you indicated, I will avoid having too many time series. 
Regarding the lenght of the label, i search recommendation but i don't found. Have you information on this topic? 

Thanks
Loïc

Brian Candler

unread,
Jun 24, 2022, 1:17:42 PM6/24/22
to Prometheus Users
By default the actual length is unlimited, although if you use stupidly long label names or values, you will have impacts on the size of RAM used, the size of API responses etc.

Since v2.27 there are now configuration options to limit them to protect against misbehaving exporters, see:
Reply all
Reply to author
Forward
0 new messages