Alert manager looping in firing -> resolved -> firing

262 views
Skip to first unread message

piyush sharma

unread,
Apr 28, 2020, 3:56:21 AM4/28/20
to Prometheus Users
Hi All ,

I am badly stuck in a problem .
One main thing is that .. alert manager sends resolve notification on its own but the alert is still active.
I want to disable this feature. I want "resolved " alert to be sent only when alert is really resolved.

Below is my alert manager configuration

apiVersion: v1
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 12h
    receivers:
    - name: alertnow
      slack_configs:
      - channel: '#stage_dict_app_events_and_alerts'
        send_resolved: true
        text: |-
          {{ range .Alerts }}
             *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
             *Description:* {{ .Annotations.description }}
             *Details:*
             {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
             {{ end }}
            {{ end }}
        title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
          | len }}{{ end }}] Cloud|Azure|Monitoring Event Notification'
      webhook_configs:
      - send_resolved: true
    route:
      group_by:
      - alertname
      - locale
      group_interval: 5m
      group_wait: 5m
      receiver: alertnow
      repeat_interval: 8h
kind: ConfigMap

Prometheus config : 

 prometheus.yaml.tmpl: |
    global:
      evaluation_interval: 1m
      external_labels:
        region: EastUS
        replica: $(POD_NAME)
        tier: stg
      scrape_interval: 1m
      scrape_timeout: 10s


Please help 

Brian Candler

unread,
Apr 28, 2020, 4:25:26 AM4/28/20
to Prometheus Users
On Tuesday, 28 April 2020 08:56:21 UTC+1, piyush sharma wrote:
I am badly stuck in a problem .
One main thing is that .. alert manager sends resolve notification on its own but the alert is still active.
I want to disable this feature. I want "resolved " alert to be sent only when alert is really resolved.


That's not true.  Alertmanager only sends resolved messages when the alert is resolved.

You therefore need to describe your setup further, in particular:

- what is the alerting rule which is generating this alert?

- are you using any sort of clustering?

And of course, it never hurts to mention the versions of prometheus and alertmanager you are running.

piyush sharma

unread,
Apr 28, 2020, 5:12:46 AM4/28/20
to Brian Candler, Prometheus Users
Dear Brain,

Thanks for the response, here is one of the alert that is causing this behaviour

apiVersion: v1
data:
  alerting_rules.yml: |
    groups:
    - name: k8s.rules
      rules:
      - alert: Health down Alert
        annotations:
          description: Attention !!! Health of dict-service in {{ $labels.locale }} is
            down !!!. Current value is {{ $value}} percent
          summary: Health of  dict-service for  {{ $labels.locale }} is down !!!
        expr: sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_sum{tier=~".*",rampcode=~"dict",
          region=~".*"}) by (locale) / sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_count{tier=~".*",rampcode=~"dict",
          region=~".*"}) by (locale) >= 80 < 99
        for: 5m
        labels:
          severity: warning
      - alert: Health down Alert
        annotations:
          description: Attention !!! Health of dict-service for {{ $labels.locale }}  is
            down !!!. Current value is {{ $value}} %
          summary: Health of  dict-service for  {{ $labels.locale }} is down !!!
        expr: sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_sum{tier=~".*",rampcode=~"dict",
          region=~".*"}) by (locale) / sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_count{tier=~".*",rampcode=~"dict",
          region=~".*"}) by (locale) >= 0 < 80
        for: 5m
        labels:
          severity: critical

--> Alertmanager version is 0.20
--> Prometheus version is ( Actually we are working with Thanos ) Version : 0.11
--> Both alertmanager and Prometheus are hosted as a stateful set and accessed as a stateless service



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6980ded2-9541-4334-b7a5-b62a7fe44058%40googlegroups.com.

Brian Candler

unread,
Apr 28, 2020, 5:52:22 AM4/28/20
to Prometheus Users
That's a complicated expression.

I suggest you paste the whole expression into the promql browser (i.e. prometheus port 9090) and look at the graph.  If you see gaps in the graph, that's where the expression does not have any value, and that's where the alert is getting resolved.

Note that while you can configure prometheus to require an alert to be firing for a certain amount of time before generating an alert ("for: 5m"), you cannot configure it for an alert to be *not firing* for a certain amount of time before it is resolved.  As soon as your expression does not generate a value, even for one evaluation cycle, it will be considered resolved.

piyush sharma

unread,
Apr 28, 2020, 6:03:45 AM4/28/20
to Brian Candler, Prometheus Users
Hey ,

You are truly a rockstar.

Yeah infact the data was not coming.

Is there any way. I can set no data condition as alerting .


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 28, 2020, 6:34:11 AM4/28/20
to Prometheus Users
On Tuesday, 28 April 2020 11:03:45 UTC+1, piyush sharma wrote:
Is there any way. I can set no data condition as alerting .

You can use "or" to give a default value, e.g.

(.... expr ....) or (up * 99)

(assuming that 'expr' and 'up' have the same set of labels - if not, then you can use grouping terms).  See also the tail end of

Or you can alert using the absent() function: see

Maybe also useful:

piyush sharma

unread,
Apr 28, 2020, 3:01:58 PM4/28/20
to Brian Candler, Prometheus Users
Thanks for all your help. I really appreciate 
One more doubt

My application metrics are based on locales 
As in ru-RU , en-GB, es-US. 
Actually there is some issue with application vai metrics are coming twice 

Like 
 ru-RU = 10
ru-ru = 10

I want to filter out values only having all alphabets in lower case 

How can I achieve this.

e

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 28, 2020, 3:26:36 PM4/28/20
to Prometheus Users
You haven't shown real examples of what these metrics look like.  I'm guessing you're not talking about metric names, but label values.

If you want to filter out metrics which have particular labels or label patterns, then you need to use metric_relabeling - although it's better to fix your exporters not to generate the bad metrics in the first place.

piyush sharma

unread,
Apr 29, 2020, 3:20:48 AM4/29/20
to Brian Candler, Prometheus Users
Hello again,

Sorry for so many queries but i am a newbie to prometheus .
I am running alertmanager in cluster.
Problem is both alert managers are sending alerts and due to this duplication is happening.
Below is my configuration 

Cluster-peer-timeout = 25s

global:
      resolve_timeout: 12h
      slack_api_url: 'https://hooks.slack.com/services/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'
    route:
      receiver: "alertnow"
      group_by: [alertname,locale]
      group_wait:      5m
      group_interval:  1m
      repeat_interval: 1h
Please suggest some idea to remove duplication 

I am getting alerts like this 

Alert: LatencyOfFirstASR in es-US is high - critical
  Description: LatencyOfFirstASR is high . Current value is 6144 ms
  Details:
   • alertname: LatencyOfFirstASR
   • locale: es-US
   • region: EastUS
   • replica: promthanos-monitoring-thanos-sts-0
   • severity: critical
   • tier: stg
 
 
  Alert: LatencyOfFirstASR in es-US is high - critical
  Description: LatencyOfFirstASR is high . Current value is 6144 ms
  Details:
   • alertname: LatencyOfFirstASR
   • locale: es-US
   • region: EastUS
   • replica: promthanos-monitoring-thanos-sts-1
   • severity: critical
   • tier: stg



On Wed, Apr 29, 2020 at 12:56 AM Brian Candler <b.ca...@pobox.com> wrote:
You haven't shown real examples of what these metrics look like.  I'm guessing you're not talking about metric names, but label values.

If you want to filter out metrics which have particular labels or label patterns, then you need to use metric_relabeling - although it's better to fix your exporters not to generate the bad metrics in the first place.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 29, 2020, 4:05:43 AM4/29/20
to Prometheus Users
I don't use alertmanager clustering myself, but if you search this group for "alertmanager duplicates" or "alertmanager alert_relabel_configs" you'll find the answer.  Example:
https://groups.google.com/d/topic/prometheus-users/S9Xmg8209xE/discussion

As I understand it, it's your responsibility to remove the "replica" label using alert_relabel_configs, using "action: labeldrop".  Otherwise, a different label set makes these look like different alerts.

You also need to make sure the alertmanagers themselves are properly configured to gossip to each other, so they'll deduplicate alerts amongst themselves.

piyush sharma

unread,
Apr 29, 2020, 5:06:21 AM4/29/20
to Brian Candler, Prometheus Users
Dear Brian,

Thanks for your valuable advise. I have one doubt about the functionality of labeldrop.
"replica" is my global label and requires to go with each and every metric ( Its a pre requisite for working of thanos)
Now my question to you is  : 

When label drop is applied , will it stop ignoring the label as a whole (i.e. will not even store the label along with metrics in the external data storage) or will it only remove labels from the alerts ?

If only alerts are effected, then is it a way to remove this label globally from all alerts ?

Regards

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 29, 2020, 6:19:01 AM4/29/20
to Prometheus Users
You remove it only from alerts, using alert_relabel_configs. The forum link I posted before has a working example config.

piyush sharma

unread,
Apr 29, 2020, 7:07:45 AM4/29/20
to Brian Candler, Prometheus Users
Hi ,

Just want you to verify my configuration

These are my external labels that are defined 

 alerts: |
    {}

  prometheus.yaml.tmpl: |
    global:
      evaluation_interval: 1m
      external_labels:
        region: EastUS
        replica: $(POD_NAME)
        tier: stg

I get alerts like 

Details:
   • alertname: LatencyOfFirstASR
   • locale: pt-br

   • region: EastUS
   • replica: promthanos-monitoring-thanos-sts-0
   • severity: critical
   • tier: stg 

I want to remove replica label so below is my configuration

alertRelabelConfigs:
  alert_relabel_configs:
  - source_labels: replica
    regex: promthanos-monitoring-thanos-sts.*
    action: drop

Am i doing it right ? Because when i tried , i got error that unable to convert replica to label.name format 

On Wed, Apr 29, 2020 at 3:49 PM Brian Candler <b.ca...@pobox.com> wrote:
You remove it only from alerts, using alert_relabel_configs. The forum link I posted before has a working example config.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 29, 2020, 7:46:31 AM4/29/20
to Prometheus Users
(1) labeldrop, not drop.

(2) source_labels is a list:  source_labels: [replica]

(3) as per the documentation, the alert_relabel_configs goes under the "alerting" section, as a sibling to "alertmanagers"

alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

Brian Candler

unread,
Apr 29, 2020, 8:09:00 AM4/29/20
to Prometheus Users
(4) with the labeldrop action, the regex matches against each of the label names (not values).  So I think what you want is (untested):

- action: labeldrop
  regex: replica

piyush sharma

unread,
Apr 29, 2020, 10:28:54 AM4/29/20
to Brian Candler, Prometheus Users
Now i am getting the below error 

level=error ts=2020-04-29T14:23:14.189Z caller=main.go:740 err="error loading config from \"/etc/config-shared/prometheus.yaml\": couldn't load configuration (--config.file=\"/etc/config-shared/prometheus.yaml\"): parsing YAML file /etc/config-shared/prometheus.yaml: labeldrop action requires only 'regex', and no other fields"

My configuration is as below

alertRelabelConfigs:
   alert_relabel_configs:
   - source_labels: [replica]
     regex: replica
     action: labeldrop
Should I remove source_label ?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 29, 2020, 11:35:57 AM4/29/20
to Prometheus Users
Sorry, I got this wrong initially and corrected it in point (4) in a reply to myself.

piyush sharma

unread,
Apr 30, 2020, 2:47:13 AM4/30/20
to Brian Candler, Prometheus Users
Thanks a lot for your guidance.

Is there any way , we can create admin user and password for alertmanager ?

On Wed, Apr 29, 2020 at 9:06 PM Brian Candler <b.ca...@pobox.com> wrote:
Sorry, I got this wrong initially and corrected it in point (4) in a reply to myself.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 30, 2020, 3:23:01 AM4/30/20
to Prometheus Users
You put a reverse proxy in front, like apache or nginx.  Same for prometheus itself.  Same for adding HTTPS.

If you want to proxy them on a particular path, like /alertmanager or /prometheus, then there are command-line flags you can set:

--web.external-url=https://mon.example.net/prometheus --web.route-prefix=/prometheus

--web.external-url=https://mon.example.net/alertmanager --web.route-prefix=/alertmanager

piyush sharma

unread,
Apr 30, 2020, 2:42:18 PM4/30/20
to Brian Candler, Prometheus Users
Hello. 

I am not receiving any alerts in slack channel
Nor I am getting any thing related to slack in the alert manager logs.

Though alerts are firing.
What leave of error logging is required for this. My current log level is debug.

Regards
Piyush

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Apr 30, 2020, 3:07:03 PM4/30/20
to Prometheus Users
That's the maximum.

Prometheus web interface (9090) shows them firing? What about alertmanager web interface (9093)?  Otherwise check your routing rules. Maybe run tcpdump to see if it's trying to connect to slack.

piyush sharma

unread,
May 1, 2020, 9:57:25 AM5/1/20
to Brian Candler, Prometheus Users
Hello All,

Is there any way we can send SMS notification but free of  cost for alerts using alertmanager. ?

On Fri, 1 May, 2020, 12:37 am Brian Candler, <b.ca...@pobox.com> wrote:
That's the maximum.

Prometheus web interface (9090) shows them firing? What about alertmanager web interface (9093)?  Otherwise check your routing rules. Maybe run tcpdump to see if it's trying to connect to slack.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Stuart Clark

unread,
May 1, 2020, 10:45:58 AM5/1/20
to piyush sharma, Brian Candler, Prometheus Users
On 2020-05-01 14:57, piyush sharma wrote:
> Hello All,
>
> Is there any way we can send SMS notification but free of cost for
> alerts using alertmanager. ?

The cost depends on whatever service/system you are using for SMS
messages.

Alertmanager does not send SMS messages, it just sends requests to other
systems which do.

--
Stuart Clark

piyush sharma

unread,
May 2, 2020, 9:57:32 AM5/2/20
to Stuart Clark, Brian Candler, Prometheus Users
Hello ,

I want to set alert with in a given range of values

Just wanted to check if
  critical if 
health_value > 0 < 80  

is a valid statement 

Brian Candler

unread,
May 2, 2020, 4:52:14 PM5/2/20
to Prometheus Users
Why not try it in the PromQL expression browser built in to prometheus (in the prometheus web interface at port 9090)

piyush sharma

unread,
May 7, 2020, 3:52:36 AM5/7/20
to Brian Candler, Prometheus Users
Hey Guys ,

I have a doubt on how the result of an alert condition is evaluated.
Below is my configuration for prometheus

 evaluation_interval: 1m
 scrape_interval: 1m

Now my query is as below 

avg(metric_first_asr{locale=~"en-gb"}) by (locale)   >= 80 AND  avg(metric_first_asr{locale=~"en-gb"}) by (locale) < 95 OR absent(metric_first_asr{locale=~"en-gb"}) == 1
for: 5m
        labels:
          severity: warning

Here i want to define a warning threshold when condition evaluates to a value of 80-95
I have specified absent condition so that when there is no data , value in that case is equal to 1 
now given my evaluation period is 1 min and I  am taking average for 5 min ... will my condition evaluated like this as below 

100 + 100 + 1 + 100 +100 / 5 
Considering data is evaluated after one minute and after each minute value was 100 in 4 cases and there was no data in one instance so that value will be replaced by 1 so my actual value will be 

401/5 = 80.02

Am I doing he right calculations ? or the way Prometheus calculates value  is different ? Please suggest 


On Sun, May 3, 2020 at 2:22 AM Brian Candler <b.ca...@pobox.com> wrote:
Why not try it in the PromQL expression browser built in to prometheus (in the prometheus web interface at port 9090)

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
May 7, 2020, 4:19:46 AM5/7/20
to Prometheus Users
Firstly, comparison operators don't work the way you imagine.  They are more like filters.  The expression "foo" is a vector of zero or more timeseries all with the metric name "foo".  So for example:

foo >= 80

returns all the timeseries for metric "foo" whose value is >= 80.  If none of the timeseries have this value, it returns nothing.  Try it in the PromQL browser in prometheus, and look at the graph view: you'll see timeseries values at the times where they are over 80, and gaps where they are below.

To filter to a range is therefore easy: you filter the results of the filter.

foo >= 80 < 95

Secondly, an alert is generated if the timeseries is present with any value.  If there's no value, there's no alert.  You can think of it as the presence of any value is treated as "true" from the point of view of generating an alert.

Thirdly:

expr: avg(...)
for: 5m

does not mean "taking average for 5 min" as you said.  What it means is:

- the expression is tested every 1 minute (your "evaluation_interval" for the rule group - defaults to global evaluation interval if not set)
- if the expression returns a value *every time* over a 5 minute period (i.e. for 6 evaluations consecutively), the alert is generated
- if there are any gaps, the alert is not generated

Fourthly, the AND, OR and UNLESS logical operators don't work how you imagine either; they are documented here.  For example:

foo AND bar

returns all the timeseries for metric "foo" for which there is a metric "bar" with an exactly matching label set (disregarding the value of "bar").

Filling in "default" values is not straightforward, because a metric like "foo" refers to a variable set of timeseries - each combination of labels is a different timeseries, and these can come and go over time.  So what you need is some other metric which you know is always present with the same set of labels, and can be used to force the missing value.  For example,

foo OR ((up * 0 + 1)

The metric "up" is generated on every scrape, with the value 1 if scrape is successful and 0 if not successful, so it reflects all the labels in your scrape job plus the "job" and "instance" labels added automatically.  If your metric foo has the same set of labels, then the expression above will fill in gaps with the value 1.

For more information see:

However I *strongly* recommend you play around with this in the PromQL expression browser - and try not to be distracted by pre-existing ideas about how booleans work.  Prometheus expressions work with vectors (i.e. multiple timeseries with different labels), not individual values.

Brian Candler

unread,
May 7, 2020, 4:23:52 AM5/7/20
to Prometheus Users
P.S. if you want to get "an average over 5 minutes" you need to use a range vector, which is a collection of metrics with all their values over a range of time: then you can do

avg_over_time( ... range vector ...)

You can get a range vector directly from an individual metric:

foo[5m]

Or you can use a subquery on an arbitrary PromQL expression:

( ... some expression ...)[5m:1m]

The latter will evaluate whatever subexpression you give, over the previous 5 minutes at 1 minute intervals, giving a range vector.

piyush sharma

unread,
May 7, 2020, 8:37:43 AM5/7/20
to Brian Candler, Prometheus Users
Thanks for such a comprehensive answer to the query :)

I have one more doubt 
Actually when there is an alerting situation , I get an alert on slack with the alerting value .
But after sometime when alert is resolved , I still get the old value which is below the threshold ( ideally should be in alerting state) but still I get notification as resolved

Alert: Overall Health of  dict-service  is down - warning
  Description: Attention !!!  Overall Health of dict-service is down !!!. Current value is 93.99 %
  Details:
   • alertname: Overall Health down Alert
   • rampcode: dict

I want the new value when after the alert gets resolved

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
May 7, 2020, 8:46:22 AM5/7/20
to Prometheus Users
On Thursday, 7 May 2020 13:37:43 UTC+1, piyush sharma wrote:

Actually when there is an alerting situation , I get an alert on slack with the alerting value .
But after sometime when alert is resolved , I still get the old value which is below the threshold ( ideally should be in alerting state) but still I get notification as resolved

This one has been answered recently on the group:

but hopefully from what you've just learned you'll understand why.

An expression "foo > 90" has the value 93, when foo has the value 93

An expression "foo > 90" has no value, when foo has the value 85

So in an alerting rule,

expr: foo > 90

will have no value when the metric drops below 90, and so prometheus stops generating alerts.  Therefore there is no value for alertmanager to report. All it can tell you is that an alert which *was* firing, is no longer firing; and it can tell you the labels and annotations the alert had when it was last active.

Brian Candler

unread,
May 7, 2020, 8:48:09 AM5/7/20
to Prometheus Users
P.S. If you change your annotation from "Current value is X" to "Most recent triggering value is X", then the resolved message may make more sense.

piyush sharma

unread,
May 7, 2020, 9:06:30 AM5/7/20
to Brian Candler, Prometheus Users
So is there any way out so that while alerting I get the value and when I get the resolved message ,,, i get only the message and not the value ?

On Thu, May 7, 2020 at 6:18 PM Brian Candler <b.ca...@pobox.com> wrote:
P.S. If you change your annotation from "Current value is X" to "Most recent triggering value is X", then the resolved message may make more sense.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
May 7, 2020, 10:41:52 AM5/7/20
to Prometheus Users
I don't know an easy way.  I guess you can do it using custom templates, since they are passed the list of firing and resolved alerts:

piyush sharma

unread,
May 7, 2020, 11:45:25 AM5/7/20
to Brian Candler, Prometheus Users
Hello ,

I am having an issue but nothing related to this comes up in logs 

Below is my alertmanager config 

alertmanager.yml:
----
global:
  resolve_timeout: 12h
receivers:
- name: slack-production
  slack_configs:
  - api_url: https://hooks.slack.com/services/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
    channel: '#azure-dict-prod-alerts'
    send_resolved: true
    text: |-
      {{ range .Alerts }}
         *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
         *Description:* {{ .Annotations.description }}
         *Details:*
         {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
         {{ end }}
        {{ end }}
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
      | len }}{{ end }}] Alert|Azure|Dict|Prod|WestUS2'
- name: slack-staging
  slack_configs:
  - api_url: https://hooks.slack.com/services/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
    channel: '#azure-dict-stg-alerts'
    send_resolved: true
    text: |-
      {{ range .Alerts }}
         *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
         *Description:* {{ .Annotations.description }}
         *Details:*
         {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
         {{ end }}
        {{ end }}
    title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
      | len }}{{ end }}] Alert|Azure|Dict|Stg|EastUS'
- name: alertnow
  webhook_configs:
  - send_resolved: true
    url: https://alertnowitgr.sec-alertnow.com/integration/prometheus/v1/67a54ac114e9d111ea4860650ac112ba32ebsfr43
route:
  group_by:
  - alertname
  - locale
  group_interval: 5m
  group_wait: 5m
  receiver: alertnow
  repeat_interval: 4h
  routes:
  - match:
      tier: prod
    receiver: slack-production
  - match:
      tier: stg
    receiver: slack-staging
  - match:
      region: EastUS
    receiver: alertnow


The issue here is that I am getting alerts on slack but not on the webhook that I have configured . Is there something wrong with my config ?
PS : Webhook URL is correct and there are no logs related to webhook :(

On Thu, May 7, 2020 at 8:11 PM Brian Candler <b.ca...@pobox.com> wrote:
I don't know an easy way.  I guess you can do it using custom templates, since they are passed the list of firing and resolved alerts:

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
May 7, 2020, 12:15:04 PM5/7/20
to Prometheus Users
The first matching route wins.  If you want a matched route to continue onto subsequent matches, add "continue: true".  You would need to set this on both of your first two routes, if you want those alerts to go to alertnow as well as to slack.


"Every alert enters the routing tree at the configured top-level route, which must match all alerts (i.e. not have any configured matchers). It then traverses the child nodes. If continue is set to false, it stops after the first matching child. If continue is true on a matching node, the alert will continue matching against subsequent siblings. If an alert does not match any children of a node (no matching child nodes, or none exist), the alert is handled based on the configuration parameters of the current node."

piyush sharma

unread,
May 7, 2020, 12:54:47 PM5/7/20
to Brian Candler, Prometheus Users
Hey 

I have a query like this 

if  sum by (locale) ( expression1)/sum by (locale) ( expression2) >=0

This gives me no data points and hence a broken graph 

sum_over_time did not work in this case as it does not take the "by" option 
Any way I can re write this ... can I make it avg of sum by locale or something like that ?

Please guide 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
May 7, 2020, 1:20:05 PM5/7/20
to Prometheus Users
On Thursday, 7 May 2020 17:54:47 UTC+1, piyush sharma wrote:
Hey 

I have a query like this 

if  sum by (locale) ( expression1)/sum by (locale) ( expression2) >=0

This gives me no data points and hence a broken graph 


Looks like a reasonable expression (without the "if" on the front).  Always check operator precedence (or add extra parentheses), but that looks OK: "/" binds more tightly than ">="

The way I'd recommend debugging this is to run the two parts of the query separately in the PromQL web interface:

sum by (locale) ( expression1)
sum by (locale) ( expression2)

Use the "console" view.  Check if these two sub-expressions both generate results.  Check if they both have the same set of labels - which of course should be "locale" in this case - because a bare "/" will only combine LHS and RHS values with exactly the same label set.

If you want further help, you should show your *actual* query expression, and some examples of the *actual* metrics you are working on (complete with labels and values).  The Console view in prometheus' expression interface can help you do this.

 
sum_over_time did not work in this case as it does not take the "by" option 
Any way I can re write this ... can I make it avg of sum by locale or something like that ?


Well, you've not described in concrete terms what you're trying to achieve, nor what the input data looks like.

avg_over_time needs a range vector as its input: this is two-dimensional.  It has a bunch of time series, and each timeseries has multiple data points at different times.  I mentioned before how to make a range vector out of an instant vector.

"sum" and "sum by" works over the seires dimension (i.e. combining values at the same time, but taken from different timeseries, meaning with different labels).

"sum_over_time" works over the time dimension, and sums separately for each timeseries.  "by" makes no sense with this, because each sum is across the same timeseries - in other words, every point being summed has the same set of labels.

piyush sharma

unread,
May 7, 2020, 1:57:48 PM5/7/20
to Brian Candler, Prometheus Users
So here are some more details 

The label locale is there in result from  both the expressions 

The query is 

-> Both queries when executed independently give us data but always have some missing data points.like this . This is the graphical representation of the numerator of the query

image.png


This is the denominator of the query , also has missing data points 

image.png

and now overall ( a/b) is as below . have missing data points 

image.png

 
 Is there any way we can get rid of no data points or give them a default value or atleast reduce them

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

piyush sharma

unread,
May 8, 2020, 4:26:12 AM5/8/20
to Brian Candler, Prometheus Users
Dear Brian. 

Please have a look on this once 

Brian Candler

unread,
May 8, 2020, 8:42:13 AM5/8/20
to Prometheus Users
(Aside: this thread has already taken a large chunk of this group's bandwidth, so this will be my last post on it)

I don't see missing data points - every point has a value.  I do see dips in the graphs.  You have chosen to graph:

    sum by (foo) (X) / sum by (foo) (Y)

You haven't described what X and Y represent, so I have no idea even if this is a sensible ratio, but let's assume it is.

If you see dips in the numerator or the denominator, it could be because:
- the individual values are lower
- the number of values being summed is lower - i.e. the number of timeseries where foo="some value" has gone down.

You can check if it's the latter by graphing:

count by (foo) (X)

count by (foo) (Y)

Let's suppose that the number of timeseries is indeed reducing.  I wouldn't be asking "how can I get rid of the dips in my graphs?"  I would be asking: "What are these dips in the graphs telling me?"

They might be telling you:

1. There's an intermittent problem with the system you're monitoring [if so, I'd want to find and fix it]
2. There's an intermittent problem with data collection, or the metrics themselves are bad [ditto]
3. There's an occasional normal event which naturally causes these dips [if so, I'd want to understand it]
4. The expression I'm graphing is the wrong one, i.e. it's calculating the wrong value

In any of these cases, I'd want to understand it and if possible fix the root cause.  If you just paper over the cracks in your graphs, you're pretending the problem doesn't exist.

Regards,

Brian.
Reply all
Reply to author
Forward
0 new messages