working of AND in rule expr

85 views
Skip to first unread message

rashmi Rashmitha

unread,
Jan 9, 2022, 12:44:31 PM1/9/22
to Prometheus Users
Hi Team,

I'm having this rule expr: 
up{appName="app"} == 1 and ON(appName) histogram_quantile(.99, sum(rate(http_server_requests_seconds_bucket{uri="<uri>",status="200",appName="app"}[10m])) by (le,appName)) > 0.01

It is throwing firing alert as expected. i.e. when app is up and percentile of specific uri is >10ms(0.01). when I receive a firing alert I'm doing some processing.
But Resolve alert is been thrown when either up is 0 or percentile < 10ms / if http_server_requests_seconds_bucket metric is not available.
Is it possible to get a resolve alert when the app is back up and running and the percentile is <10ms(0.01) and the metric should also be present.

Thanks,
Rashmitha Chatla

Brian Candler

unread,
Jan 9, 2022, 1:19:19 PM1/9/22
to Prometheus Users
An alert is "resolved" exactly when the alert expression no longer generates any value.  Since you have "up == 1" as part of the expression, then when up != 1 it will no longer fire - and therefore it resolves.  Similarly if the RHS has no available value.

I think the best advice here is simply: "don't send resolved messages".  See:

The philosophy is that if something went wrong, you should always investigate why it went wrong, and only treat the incident as closed when you've established manually what the issue was, having made whatever fix is required to stop it happening again.  If you send out resolved notifications, you're inviting people to say "oh OK, it's resolved by itself; nothing for me to do now" and ignore the problem.

So personally I'd turn off the resolved notifications, and instead make a dashboard where people can see the status of the service: is it up? Has it been handling any requests? If it has, what's the 99th percentile response time?  How has that changed over time?  Then they can examine this (and other metrics) to determine if the service is now healthy.

There is more good advice on alerting in general in this excellent document from a Google site reliability engineer:

rashmi Rashmitha

unread,
Jan 9, 2022, 1:59:31 PM1/9/22
to Brian Candler, Prometheus Users
Hi,

Essentially an automated workflow is called when I receive the above firing alert capturing it via webhook. 
In that automated workflow, I'm stopping the app and restarting the whole ec2 instance. so here, up becomes 0 when I stop the app.
and http_server_requests_seconds_bucket metric is not available. thus it is resulting in resolution.(expected functionality from Prometheus resolved principle)

so I'm in a plan of using this rule:
(assuming A1 will run first and A2 later in the sequence)
groups:
  - name: <name>
    rules:
    - alert: 'A1'
      expr: histogram_quantile(.99, sum(rate(http_server_requests_seconds_bucket{uri="<uri>",status="200",appName="app"}[5m])) by (le,appName)) > bool 0.01
      labels:
        app_name: app
    - alert: A2
      expr: up{job="X"} == 0
      labels:
        app_name: app

1.  A1 will run and throws a firing alert when the condition is met. and calls automated_worflow_1 that stops the app.
2. when the app is stopped --> A2 condition is met and fires alert that calls automated_workflow_2. and this workflow when called it restarts the instance.
3. once the instance is up. start the app --> (results in Resolved alert for A2  )
(In Between resolved alert can be thrown for the A1 as metric is not available when app is stopped.)

so is there a way, I can silent/ignore Resolved alert for only A1 in prometheus/alert manager? so that I can have Resolved alert for A2 alone, once the app is back.

Thanks,
Rashmitha

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e3594a0d-8b9b-420f-a59f-33e3f817803cn%40googlegroups.com.

Brian Candler

unread,
Jan 9, 2022, 3:43:04 PM1/9/22
to Prometheus Users
On Sunday, 9 January 2022 at 18:59:31 UTC rashmira...@gmail.com wrote:
so is there a way, I can silent/ignore Resolved alert for only A1 in prometheus/alert manager? so that I can have Resolved alert for A2 alone, once the app is back.

Sure.  Create two alert receivers, one with send_resolved: true and one without.  Route the A1 alert to one, and the A2 alert to the other.

rashmi Rashmitha

unread,
Jan 10, 2022, 1:56:10 PM1/10/22
to Prometheus Users
Hi

I have below alert manager config.

global:
  resolve_timeout: "5m"

receivers:
  - name: response_time_webhook
    webhook_configs:
    - send_resolved: false
      URL: "<webhook_url>"

  - name: up_mtr_webhook
    webhook_configs:
    - send_resolved: true
      url: "<webhook_url>"

route:
  receiver: response_time_webhook
  repeat_interval: 3h
  group_by: ['alertname']
  routes:
    - match:
        alertname: response_time_mtr
        receiver: response_time_webhook
    - match:
        alertname: up_mtr
        receiver: up_mtr_webhook

and have the below rule:

groups:
  - name: Spring-boot
    rules:
    - alert: response_time_mtr
      expr: histogram_quantile(.99, sum(rate(http_server_requests_seconds_bucket{uri="<uri>",status="200",appName="app"}[5m])) by (le,appName)) > bool 2
      labels:
        app_name: app
      annotations:
        summary: '(instance {{ $labels.instance }})'
    - alert: up_mtr

      expr: up{job="X"} == 0
      labels:
        app_name: app
      annotations:
        summary: "app down alert"

I would like to route my alerts based on alert name and for response_time_mtr, I don't require resolved alert and for up_mtr I do require. 
but the above is not working as expected and not throwing resolved alerts for both alerts. am I going wrong anywhere, please suggest.

Thanks,
Rashmitha Chatla

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Jan 10, 2022, 4:40:02 PM1/10/22
to Prometheus Users
You have a routing rule that matches on multiple labels:

    - match:
        alertname: response_time_mtr
        receiver: response_time_webhook

... but you've not set the "receiver" label anywhere as far as I can see, so this rule will fail to match.  It will only match if *all* the conditions are true.

You can add extra labels to your alert in the rule which generates the alert, hence:

groups:
  - name: Spring-boot
    rules:
    - alert: response_time_mtr
      expr: histogram_quantile(.99, sum(rate(http_server_requests_seconds_bucket{uri="<uri>",status="200",appName="app"}[5m])) by (le,appName)) > bool 2
      labels:
        app_name: app
        receiver: response_time_webhook

      annotations:
        summary: '(instance {{ $labels.instance }})'

    - alert: up_mtr
      expr: up{job="X"} == 0
      labels:
        app_name: app
        receiver: up_mtr_webhook
      annotations:
        summary: "app down alert"

However, you almost certainly don't want

    foo > bool 2

as your alerting expression.  This will always generate a value as long as "foo" exists, and therefore will always generate an alert.

Alerts are triggered by the *presence* of a value - any value.  Note that:

    foo > 2

is a filter; it will generate an instant vector containing all the timeseries with metric name "foo" whose value is greater than 2.  If none of them pass this filter, then it will be an empty instant vector - and hence no alert is generated.

    foo > bool 2

will generate as many values in its instant vector result as the number of timeseries for "foo", and will always alert unless there are zero timeseries with metric name "foo".

rashmi Rashmitha

unread,
Jan 10, 2022, 10:55:50 PM1/10/22
to Brian Candler, Prometheus Users
Hi,

Thanks Brian.
so how can I modify the above rule expr? as I wanted to get firing alert generated when the 99 percentile is > 2sec.

Thanks

rashmi Rashmitha

unread,
Jan 10, 2022, 11:53:02 PM1/10/22
to Prometheus Users
I got it now, Thanks
Reply all
Reply to author
Forward
0 new messages