Alert manager looping in firing -> resolved -> firing

piyush sharma

unread,

Apr 28, 2020, 3:56:21 AM4/28/20

to Prometheus Users

Hi All ,

I am badly stuck in a problem .

One main thing is that .. alert manager sends resolve notification on its own but the alert is still active.

I want to disable this feature. I want "resolved " alert to be sent only when alert is really resolved.

Below is my alert manager configuration

apiVersion: v1

data:

alertmanager.yml: |

global:

resolve_timeout: 12h

slack_api_url: https://hooks.slack.com/services/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

receivers:

- name: alertnow

slack_configs:

- channel: '#stage_dict_app_events_and_alerts'

send_resolved: true

text: |-

*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`

*Description:* {{ .Annotations.description }}

*Details:*

{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`

title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing

| len }}{{ end }}] Cloud|Azure|Monitoring Event Notification'

webhook_configs:

- send_resolved: true

url: https://alertnowitgr.sec-alertnow.com/integration/prometheus/v1/67a54ac114e9d111ea4860650ac112ba32eb

route:

group_by:

- alertname

- locale

group_interval: 5m

group_wait: 5m

receiver: alertnow

repeat_interval: 8h

kind: ConfigMap

Prometheus config :

prometheus.yaml.tmpl: |

global:

evaluation_interval: 1m

external_labels:

region: EastUS

replica: $(POD_NAME)

tier: stg

scrape_interval: 1m

scrape_timeout: 10s

Please help

Brian Candler

unread,

Apr 28, 2020, 4:25:26 AM4/28/20

to Prometheus Users

On Tuesday, 28 April 2020 08:56:21 UTC+1, piyush sharma wrote:

I am badly stuck in a problem .
One main thing is that .. alert manager sends resolve notification on its own but the alert is still active.
I want to disable this feature. I want "resolved " alert to be sent only when alert is really resolved.

That's not true. Alertmanager only sends resolved messages when the alert is resolved.

You therefore need to describe your setup further, in particular:

- what is the alerting rule which is generating this alert?

- are you using any sort of clustering?

And of course, it never hurts to mention the versions of prometheus and alertmanager you are running.

piyush sharma

unread,

Apr 28, 2020, 5:12:46 AM4/28/20

to Brian Candler, Prometheus Users

Dear Brain,

Thanks for the response, here is one of the alert that is causing this behaviour

apiVersion: v1
data:
alerting_rules.yml: |
groups:
- name: k8s.rules
rules:
- alert: Health down Alert
annotations:
description: Attention !!! Health of dict-service in {{ $labels.locale }} is
down !!!. Current value is {{ $value}} percent
summary: Health of dict-service for {{ $labels.locale }} is down !!!
expr: sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_sum{tier=~".*",rampcode=~"dict",
region=~".*"}) by (locale) / sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_count{tier=~".*",rampcode=~"dict",
region=~".*"}) by (locale) >= 80 < 99
for: 5m
labels:
severity: warning
- alert: Health down Alert
annotations:
description: Attention !!! Health of dict-service for {{ $labels.locale }} is
down !!!. Current value is {{ $value}} %
summary: Health of dict-service for {{ $labels.locale }} is down !!!
expr: sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_sum{tier=~".*",rampcode=~"dict",
region=~".*"}) by (locale) / sum(stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_count{tier=~".*",rampcode=~"dict",
region=~".*"}) by (locale) >= 0 < 80
for: 5m
labels:
severity: critical

--> Alertmanager version is 0.20

--> Prometheus version is ( Actually we are working with Thanos ) Version : 0.11

--> Both alertmanager and Prometheus are hosted as a stateful set and accessed as a stateless service

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6980ded2-9541-4334-b7a5-b62a7fe44058%40googlegroups.com.

Brian Candler

unread,

Apr 28, 2020, 5:52:22 AM4/28/20

to Prometheus Users

That's a complicated expression.

I suggest you paste the whole expression into the promql browser (i.e. prometheus port 9090) and look at the graph. If you see gaps in the graph, that's where the expression does not have any value, and that's where the alert is getting resolved.

Note that while you can configure prometheus to require an alert to be firing for a certain amount of time before generating an alert ("for: 5m"), you cannot configure it for an alert to be *not firing* for a certain amount of time before it is resolved. As soon as your expression does not generate a value, even for one evaluation cycle, it will be considered resolved.

piyush sharma

unread,

Apr 28, 2020, 6:03:45 AM4/28/20

to Brian Candler, Prometheus Users

Hey ,

You are truly a rockstar.

Yeah infact the data was not coming.

Is there any way. I can set no data condition as alerting .

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/de883b91-8e96-4538-b911-4afdbb083dc6%40googlegroups.com.

Brian Candler

unread,

Apr 28, 2020, 6:34:11 AM4/28/20

to Prometheus Users

On Tuesday, 28 April 2020 11:03:45 UTC+1, piyush sharma wrote:

Is there any way. I can set no data condition as alerting .

You can use "or" to give a default value, e.g.

(.... expr ....) or (up * 99)

(assuming that 'expr' and 'up' have the same set of labels - if not, then you can use grouping terms). See also the tail end of

https://www.robustperception.io/existential-issues-with-metrics

Or you can alert using the absent() function: see

https://www.robustperception.io/absent-alerting-for-jobs

Maybe also useful:

https://www.robustperception.io/booleans-logic-and-math

https://www.robustperception.io/combining-alert-conditions

piyush sharma

unread,

Apr 28, 2020, 3:01:58 PM4/28/20

to Brian Candler, Prometheus Users

Thanks for all your help. I really appreciate

One more doubt

My application metrics are based on locales

As in ru-RU , en-GB, es-US.

Actually there is some issue with application vai metrics are coming twice

Like

ru-RU = 10

ru-ru = 10

I want to filter out values only having all alphabets in lower case

How can I achieve this.

e

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9aa379c2-9a95-4f56-a7a0-1c6483e17fb0%40googlegroups.com.

Brian Candler

unread,

Apr 28, 2020, 3:26:36 PM4/28/20

to Prometheus Users

You haven't shown real examples of what these metrics look like. I'm guessing you're not talking about metric names, but label values.

If you want to filter out metrics which have particular labels or label patterns, then you need to use metric_relabeling - although it's better to fix your exporters not to generate the bad metrics in the first place.

piyush sharma

unread,

Apr 29, 2020, 3:20:48 AM4/29/20

to Brian Candler, Prometheus Users

Hello again,

Sorry for so many queries but i am a newbie to prometheus .

I am running alertmanager in cluster.

Problem is both alert managers are sending alerts and due to this duplication is happening.

Below is my configuration

Cluster-peer-timeout = 25s

global:
resolve_timeout: 12h
slack_api_url: 'https://hooks.slack.com/services/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'
route:
receiver: "alertnow"
group_by: [alertname,locale]
group_wait: 5m
group_interval: 1m
repeat_interval: 1h

Please suggest some idea to remove duplication

I am getting alerts like this

[FIRING:2] Cloud|Azure|Monitoring Event Notification

Alert: LatencyOfFirstASR in es-US is high - critical
  Description: LatencyOfFirstASR is high . Current value is 6144 ms
  Details:
   • alertname: LatencyOfFirstASR
   • locale: es-US
   • region: EastUS
   • replica: promthanos-monitoring-thanos-sts-0
   • severity: critical
   • tier: stg

  Alert: LatencyOfFirstASR in es-US is high - critical
  Description: LatencyOfFirstASR is high . Current value is 6144 ms
  Details:
   • alertname: LatencyOfFirstASR
   • locale: es-US
   • region: EastUS
   • replica: promthanos-monitoring-thanos-sts-1
   • severity: critical
   • tier: stg

On Wed, Apr 29, 2020 at 12:56 AM Brian Candler <b.ca...@pobox.com> wrote:

You haven't shown real examples of what these metrics look like. I'm guessing you're not talking about metric names, but label values.

If you want to filter out metrics which have particular labels or label patterns, then you need to use metric_relabeling - although it's better to fix your exporters not to generate the bad metrics in the first place.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/17f066f8-49d0-484a-800b-2d872cd2dc79%40googlegroups.com.

Brian Candler

unread,

Apr 29, 2020, 4:05:43 AM4/29/20

to Prometheus Users

I don't use alertmanager clustering myself, but if you search this group for "alertmanager duplicates" or "alertmanager alert_relabel_configs" you'll find the answer. Example:

https://groups.google.com/d/topic/prometheus-users/S9Xmg8209xE/discussion

As I understand it, it's your responsibility to remove the "replica" label using alert_relabel_configs, using "action: labeldrop". Otherwise, a different label set makes these look like different alerts.

You also need to make sure the alertmanagers themselves are properly configured to gossip to each other, so they'll deduplicate alerts amongst themselves.

piyush sharma

unread,

Apr 29, 2020, 5:06:21 AM4/29/20

to Brian Candler, Prometheus Users

Dear Brian,

Thanks for your valuable advise. I have one doubt about the functionality of labeldrop.

"replica" is my global label and requires to go with each and every metric ( Its a pre requisite for working of thanos)

Now my question to you is :

When label drop is applied , will it stop ignoring the label as a whole (i.e. will not even store the label along with metrics in the external data storage) or will it only remove labels from the alerts ?

If only alerts are effected, then is it a way to remove this label globally from all alerts ?

Regards

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/14201529-86e8-4ff1-903a-89d93fd2ffad%40googlegroups.com.

Brian Candler

unread,

Apr 29, 2020, 6:19:01 AM4/29/20

to Prometheus Users

You remove it only from alerts, using alert_relabel_configs. The forum link I posted before has a working example config.

piyush sharma

unread,

Apr 29, 2020, 7:07:45 AM4/29/20

to Brian Candler, Prometheus Users

Hi ,

Just want you to verify my configuration

These are my external labels that are defined

alerts: |
{}

prometheus.yaml.tmpl: |
global:
evaluation_interval: 1m
external_labels:
region: EastUS
replica: $(POD_NAME)
tier: stg

I get alerts like

Details:
• alertname: LatencyOfFirstASR
• locale: pt-br

   • region: EastUS
   • replica: promthanos-monitoring-thanos-sts-0
   • severity: critical
   • tier: stg

I want to remove replica label so below is my configuration

alertRelabelConfigs:
alert_relabel_configs:
- source_labels: replica
regex: promthanos-monitoring-thanos-sts.*

action: drop

Am i doing it right ? Because when i tried , i got error that unable to convert replica to label.name format

On Wed, Apr 29, 2020 at 3:49 PM Brian Candler <b.ca...@pobox.com> wrote:

You remove it only from alerts, using alert_relabel_configs. The forum link I posted before has a working example config.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/363c41b7-96f9-4548-85bb-ab56299f7447%40googlegroups.com.

Brian Candler

unread,

Apr 29, 2020, 7:46:31 AM4/29/20

to Prometheus Users

(1) labeldrop, not drop.

(2) source_labels is a list: source_labels: [replica]

(3) as per the documentation, the alert_relabel_configs goes under the "alerting" section, as a sibling to "alertmanagers"

alerting:
  alert_relabel_configs:
    [ - <relabel_config> ... ]
  alertmanagers:
    [ - <alertmanager_config> ... ]

Brian Candler

unread,

Apr 29, 2020, 8:09:00 AM4/29/20

to Prometheus Users

(4) with the labeldrop action, the regex matches against each of the label names (not values). So I think what you want is (untested):

- action: labeldrop
regex: replica

piyush sharma

unread,

Apr 29, 2020, 10:28:54 AM4/29/20

to Brian Candler, Prometheus Users

Now i am getting the below error

level=error ts=2020-04-29T14:23:14.189Z caller=main.go:740 err="error loading config from \"/etc/config-shared/prometheus.yaml\": couldn't load configuration (--config.file=\"/etc/config-shared/prometheus.yaml\"): parsing YAML file /etc/config-shared/prometheus.yaml: labeldrop action requires only 'regex', and no other fields"

My configuration is as below

alertRelabelConfigs:
alert_relabel_configs:
- source_labels: [replica]

regex: replica
action: labeldrop

Should I remove source_label ?

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f88d6534-f673-4589-a535-80c5a0c3e762%40googlegroups.com.

Brian Candler

unread,

Apr 29, 2020, 11:35:57 AM4/29/20

to Prometheus Users

Sorry, I got this wrong initially and corrected it in point (4) in a reply to myself.

piyush sharma

unread,

Apr 30, 2020, 2:47:13 AM4/30/20

to Brian Candler, Prometheus Users

Thanks a lot for your guidance.

Is there any way , we can create admin user and password for alertmanager ?

On Wed, Apr 29, 2020 at 9:06 PM Brian Candler <b.ca...@pobox.com> wrote:

Sorry, I got this wrong initially and corrected it in point (4) in a reply to myself.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f33ae51e-b5f9-4198-b708-f5e87621c7fb%40googlegroups.com.

Brian Candler

unread,

Apr 30, 2020, 3:23:01 AM4/30/20

to Prometheus Users

You put a reverse proxy in front, like apache or nginx. Same for prometheus itself. Same for adding HTTPS.

If you want to proxy them on a particular path, like /alertmanager or /prometheus, then there are command-line flags you can set:

--web.external-url=https://mon.example.net/prometheus --web.route-prefix=/prometheus

--web.external-url=https://mon.example.net/alertmanager --web.route-prefix=/alertmanager

piyush sharma

unread,

Apr 30, 2020, 2:42:18 PM4/30/20

to Brian Candler, Prometheus Users

Hello.

I am not receiving any alerts in slack channel

Nor I am getting any thing related to slack in the alert manager logs.

Though alerts are firing.

What leave of error logging is required for this. My current log level is debug.

Regards

Piyush

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e134cd7e-fa42-4726-9dac-b05b85473eff%40googlegroups.com.

Brian Candler

unread,

Apr 30, 2020, 3:07:03 PM4/30/20

to Prometheus Users

That's the maximum.

Prometheus web interface (9090) shows them firing? What about alertmanager web interface (9093)? Otherwise check your routing rules. Maybe run tcpdump to see if it's trying to connect to slack.

piyush sharma

unread,

May 1, 2020, 9:57:25 AM5/1/20

to Brian Candler, Prometheus Users

Hello All,

Is there any way we can send SMS notification but free of cost for alerts using alertmanager. ?

On Fri, 1 May, 2020, 12:37 am Brian Candler, <b.ca...@pobox.com> wrote:

That's the maximum.

Prometheus web interface (9090) shows them firing? What about alertmanager web interface (9093)? Otherwise check your routing rules. Maybe run tcpdump to see if it's trying to connect to slack.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8bf16e28-846e-4a8e-ac3d-bcecba1967e0%40googlegroups.com.

Stuart Clark

unread,

May 1, 2020, 10:45:58 AM5/1/20

to piyush sharma, Brian Candler, Prometheus Users

On 2020-05-01 14:57, piyush sharma wrote:
> Hello All,
>
> Is there any way we can send SMS notification but free of cost for
> alerts using alertmanager. ?

The cost depends on whatever service/system you are using for SMS
messages.

Alertmanager does not send SMS messages, it just sends requests to other
systems which do.

--
Stuart Clark

piyush sharma

unread,

May 2, 2020, 9:57:32 AM5/2/20

to Stuart Clark, Brian Candler, Prometheus Users

Hello ,

I want to set alert with in a given range of values

Just wanted to check if

critical if

health_value > 0 < 80

is a valid statement

Brian Candler

unread,

May 2, 2020, 4:52:14 PM5/2/20

to Prometheus Users

Why not try it in the PromQL expression browser built in to prometheus (in the prometheus web interface at port 9090)

piyush sharma

unread,

May 7, 2020, 3:52:36 AM5/7/20

to Brian Candler, Prometheus Users

Hey Guys ,

I have a doubt on how the result of an alert condition is evaluated.

Below is my configuration for prometheus

evaluation_interval: 1m

scrape_interval: 1m

Now my query is as below

avg(metric_first_asr{locale=~"en-gb"}) by (locale) >= 80 AND avg(metric_first_asr{locale=~"en-gb"}) by (locale) < 95 OR absent(metric_first_asr{locale=~"en-gb"}) == 1

for: 5m
labels:
severity: warning

Here i want to define a warning threshold when condition evaluates to a value of 80-95

I have specified absent condition so that when there is no data , value in that case is equal to 1

now given my evaluation period is 1 min and I am taking average for 5 min ... will my condition evaluated like this as below

100 + 100 + 1 + 100 +100 / 5
Considering data is evaluated after one minute and after each minute value was 100 in 4 cases and there was no data in one instance so that value will be replaced by 1 so my actual value will be

401/5 = 80.02

Am I doing he right calculations ? or the way Prometheus calculates value is different ? Please suggest

On Sun, May 3, 2020 at 2:22 AM Brian Candler <b.ca...@pobox.com> wrote:

Why not try it in the PromQL expression browser built in to prometheus (in the prometheus web interface at port 9090)

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0de10032-0ae8-43d5-8ea1-faef17a3e07b%40googlegroups.com.

Brian Candler

unread,

May 7, 2020, 4:19:46 AM5/7/20

to Prometheus Users

Firstly, comparison operators don't work the way you imagine. They are more like filters. The expression "foo" is a vector of zero or more timeseries all with the metric name "foo". So for example:

foo >= 80

returns all the timeseries for metric "foo" whose value is >= 80. If none of the timeseries have this value, it returns nothing. Try it in the PromQL browser in prometheus, and look at the graph view: you'll see timeseries values at the times where they are over 80, and gaps where they are below.

To filter to a range is therefore easy: you filter the results of the filter.

foo >= 80 < 95

Secondly, an alert is generated if the timeseries is present with any value. If there's no value, there's no alert. You can think of it as the presence of any value is treated as "true" from the point of view of generating an alert.

Thirdly:

expr: avg(...)

for: 5m

does not mean "taking average for 5 min" as you said. What it means is:

- the expression is tested every 1 minute (your "evaluation_interval" for the rule group - defaults to global evaluation interval if not set)

- if the expression returns a value *every time* over a 5 minute period (i.e. for 6 evaluations consecutively), the alert is generated

- if there are any gaps, the alert is not generated

Fourthly, the AND, OR and UNLESS logical operators don't work how you imagine either; they are documented here. For example:

foo AND bar

returns all the timeseries for metric "foo" for which there is a metric "bar" with an exactly matching label set (disregarding the value of "bar").

Filling in "default" values is not straightforward, because a metric like "foo" refers to a variable set of timeseries - each combination of labels is a different timeseries, and these can come and go over time. So what you need is some other metric which you know is always present with the same set of labels, and can be used to force the missing value. For example,

foo OR ((up * 0 + 1)

The metric "up" is generated on every scrape, with the value 1 if scrape is successful and 0 if not successful, so it reflects all the labels in your scrape job plus the "job" and "instance" labels added automatically. If your metric foo has the same set of labels, then the expression above will fill in gaps with the value 1.

For more information see:

https://www.robustperception.io/existential-issues-with-metrics

https://www.robustperception.io/left-joins-in-promql

However I *strongly* recommend you play around with this in the PromQL expression browser - and try not to be distracted by pre-existing ideas about how booleans work. Prometheus expressions work with vectors (i.e. multiple timeseries with different labels), not individual values.

Brian Candler

unread,

May 7, 2020, 4:23:52 AM5/7/20

to Prometheus Users

P.S. if you want to get "an average over 5 minutes" you need to use a range vector, which is a collection of metrics with all their values over a range of time: then you can do

avg_over_time( ... range vector ...)

You can get a range vector directly from an individual metric:

foo[5m]

Or you can use a subquery on an arbitrary PromQL expression:

( ... some expression ...)[5m:1m]

The latter will evaluate whatever subexpression you give, over the previous 5 minutes at 1 minute intervals, giving a range vector.

piyush sharma

unread,

May 7, 2020, 8:37:43 AM5/7/20

to Brian Candler, Prometheus Users

Thanks for such a comprehensive answer to the query :)

I have one more doubt

Actually when there is an alerting situation , I get an alert on slack with the alerting value .

But after sometime when alert is resolved , I still get the old value which is below the threshold ( ideally should be in alerting state) but still I get notification as resolved

[RESOLVED] Cloud|Azure|Monitoring Event Notification(opens in new tab)

Alert: Overall Health of  dict-service  is down - warning
  Description: Attention !!!  Overall Health of dict-service is down !!!. Current value is 93.99 %
  Details:
   • alertname: Overall Health down Alert
   • rampcode: dict

I want the new value when after the alert gets resolved

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bd3c2c7e-36ea-4af2-a8ba-d6e168f2879c%40googlegroups.com.

Brian Candler

unread,

May 7, 2020, 8:46:22 AM5/7/20

to Prometheus Users

On Thursday, 7 May 2020 13:37:43 UTC+1, piyush sharma wrote:

Actually when there is an alerting situation , I get an alert on slack with the alerting value .
But after sometime when alert is resolved , I still get the old value which is below the threshold ( ideally should be in alerting state) but still I get notification as resolved

This one has been answered recently on the group:

https://groups.google.com/d/topic/prometheus-users/LLsPBIvLIME/discussion

but hopefully from what you've just learned you'll understand why.

An expression "foo > 90" has the value 93, when foo has the value 93

An expression "foo > 90" has no value, when foo has the value 85

So in an alerting rule,

expr: foo > 90

will have no value when the metric drops below 90, and so prometheus stops generating alerts. Therefore there is no value for alertmanager to report. All it can tell you is that an alert which *was* firing, is no longer firing; and it can tell you the labels and annotations the alert had when it was last active.

Brian Candler

unread,

May 7, 2020, 8:48:09 AM5/7/20

to Prometheus Users

P.S. If you change your annotation from "Current value is X" to "Most recent triggering value is X", then the resolved message may make more sense.

piyush sharma

unread,

May 7, 2020, 9:06:30 AM5/7/20

to Brian Candler, Prometheus Users

So is there any way out so that while alerting I get the value and when I get the resolved message ,,, i get only the message and not the value ?

On Thu, May 7, 2020 at 6:18 PM Brian Candler <b.ca...@pobox.com> wrote:

P.S. If you change your annotation from "Current value is X" to "Most recent triggering value is X", then the resolved message may make more sense.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c3400b51-6840-4989-996e-301c414bd5ba%40googlegroups.com.

Brian Candler

unread,

May 7, 2020, 10:41:52 AM5/7/20

to Prometheus Users

I don't know an easy way. I guess you can do it using custom templates, since they are passed the list of firing and resolved alerts:

https://prometheus.io/docs/alerting/notifications/

piyush sharma

unread,

May 7, 2020, 11:45:25 AM5/7/20

to Brian Candler, Prometheus Users

Hello ,

I am having an issue but nothing related to this comes up in logs

Below is my alertmanager config

alertmanager.yml:
----
global:
resolve_timeout: 12h
receivers:
- name: slack-production
slack_configs:
- api_url: https://hooks.slack.com/services/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
channel: '#azure-dict-prod-alerts'
send_resolved: true
text: |-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
| len }}{{ end }}] Alert|Azure|Dict|Prod|WestUS2'
- name: slack-staging
slack_configs:
- api_url: https://hooks.slack.com/services/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
channel: '#azure-dict-stg-alerts'
send_resolved: true
text: |-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
*Description:* {{ .Annotations.description }}
*Details:*
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
{{ end }}
{{ end }}
title: '[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing
| len }}{{ end }}] Alert|Azure|Dict|Stg|EastUS'
- name: alertnow
webhook_configs:
- send_resolved: true
url: https://alertnowitgr.sec-alertnow.com/integration/prometheus/v1/67a54ac114e9d111ea4860650ac112ba32ebsfr43
route:
group_by:
- alertname
- locale
group_interval: 5m
group_wait: 5m
receiver: alertnow
repeat_interval: 4h
routes:
- match:
tier: prod
receiver: slack-production
- match:
tier: stg
receiver: slack-staging
- match:
region: EastUS
receiver: alertnow

The issue here is that I am getting alerts on slack but not on the webhook that I have configured . Is there something wrong with my config ?

PS : Webhook URL is correct and there are no logs related to webhook :(

On Thu, May 7, 2020 at 8:11 PM Brian Candler <b.ca...@pobox.com> wrote:

I don't know an easy way. I guess you can do it using custom templates, since they are passed the list of firing and resolved alerts:
https://prometheus.io/docs/alerting/notifications/

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/1032dce7-25e2-43b9-b064-f7a507c56626%40googlegroups.com.

Brian Candler

unread,

May 7, 2020, 12:15:04 PM5/7/20

to Prometheus Users

The first matching route wins. If you want a matched route to continue onto subsequent matches, add "continue: true". You would need to set this on both of your first two routes, if you want those alerts to go to alertnow as well as to slack.

The documentation you need is here: https://prometheus.io/docs/alerting/configuration/#route

"Every alert enters the routing tree at the configured top-level route, which must match all alerts (i.e. not have any configured matchers). It then traverses the child nodes. If continue is set to false, it stops after the first matching child. If continue is true on a matching node, the alert will continue matching against subsequent siblings. If an alert does not match any children of a node (no matching child nodes, or none exist), the alert is handled based on the configuration parameters of the current node."

piyush sharma

unread,

May 7, 2020, 12:54:47 PM5/7/20

to Brian Candler, Prometheus Users

Hey

I have a query like this

if sum by (locale) ( expression1)/sum by (locale) ( expression2) >=0

This gives me no data points and hence a broken graph

sum_over_time did not work in this case as it does not take the "by" option

Any way I can re write this ... can I make it avg of sum by locale or something like that ?

Please guide

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/a43ba829-1460-46a2-adb1-5516fdf726c2%40googlegroups.com.

Brian Candler

unread,

May 7, 2020, 1:20:05 PM5/7/20

to Prometheus Users

On Thursday, 7 May 2020 17:54:47 UTC+1, piyush sharma wrote:

Hey

I have a query like this

if sum by (locale) ( expression1)/sum by (locale) ( expression2) >=0

This gives me no data points and hence a broken graph

Looks like a reasonable expression (without the "if" on the front). Always check operator precedence (or add extra parentheses), but that looks OK: "/" binds more tightly than ">="

The way I'd recommend debugging this is to run the two parts of the query separately in the PromQL web interface:

sum by (locale) ( expression1)

sum by (locale) ( expression2)

Use the "console" view. Check if these two sub-expressions both generate results. Check if they both have the same set of labels - which of course should be "locale" in this case - because a bare "/" will only combine LHS and RHS values with exactly the same label set.

If you want further help, you should show your *actual* query expression, and some examples of the *actual* metrics you are working on (complete with labels and values). The Console view in prometheus' expression interface can help you do this.

sum_over_time did not work in this case as it does not take the "by" option
Any way I can re write this ... can I make it avg of sum by locale or something like that ?

Well, you've not described in concrete terms what you're trying to achieve, nor what the input data looks like.

avg_over_time needs a range vector as its input: this is two-dimensional. It has a bunch of time series, and each timeseries has multiple data points at different times. I mentioned before how to make a range vector out of an instant vector.

"sum" and "sum by" works over the seires dimension (i.e. combining values at the same time, but taken from different timeseries, meaning with different labels).

"sum_over_time" works over the time dimension, and sums separately for each timeseries. "by" makes no sense with this, because each sum is across the same timeseries - in other words, every point being summed has the same set of labels.

piyush sharma

unread,

May 7, 2020, 1:57:48 PM5/7/20

to Brian Candler, Prometheus Users

So here are some more details

The label locale is there in result from both the expressions

The query is

sum by(locale) (stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_sum{rampcode=~"dict",region=~".*",tier=~".*"}) / sum by(locale) (stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_count{rampcode=~"dict",region=~".*",tier=~".*"})

-> Both queries when executed independently give us data but always have some missing data points.like this . This is the graphical representation of the numerator of the query

sum by(locale) (stackdriver_aws_ec_2_instance_logging_googleapis_com_user_azure_qaops_client_health_successrate_count{rampcode=~"dict",region=~".*",tier=~".*"})

This is the denominator of the query , also has missing data points

and now overall ( a/b) is as below . have missing data points

Is there any way we can get rid of no data points or give them a default value or atleast reduce them

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5057f221-f67c-48fe-a33a-a6f31e6c0581%40googlegroups.com.

piyush sharma

unread,

May 8, 2020, 4:26:12 AM5/8/20

to Brian Candler, Prometheus Users

Dear Brian.

Please have a look on this once

Brian Candler

unread,

May 8, 2020, 8:42:13 AM5/8/20

to Prometheus Users

(Aside: this thread has already taken a large chunk of this group's bandwidth, so this will be my last post on it)

I don't see missing data points - every point has a value. I do see dips in the graphs. You have chosen to graph:

sum by (foo) (X) / sum by (foo) (Y)

You haven't described what X and Y represent, so I have no idea even if this is a sensible ratio, but let's assume it is.

If you see dips in the numerator or the denominator, it could be because:

- the individual values are lower

- the number of values being summed is lower - i.e. the number of timeseries where foo="some value" has gone down.

You can check if it's the latter by graphing:

count by (foo) (X)

count by (foo) (Y)

Let's suppose that the number of timeseries is indeed reducing. I wouldn't be asking "how can I get rid of the dips in my graphs?" I would be asking: "What are these dips in the graphs telling me?"

They might be telling you:

1. There's an intermittent problem with the system you're monitoring [if so, I'd want to find and fix it]

2. There's an intermittent problem with data collection, or the metrics themselves are bad [ditto]

3. There's an occasional normal event which naturally causes these dips [if so, I'd want to understand it]

4. The expression I'm graphing is the wrong one, i.e. it's calculating the wrong value

In any of these cases, I'd want to understand it and if possible fix the root cause. If you just paper over the cracks in your graphs, you're pretending the problem doesn't exist.