Alert goes to Firing --> Resolved --> Firing immediately.

100 views
Skip to first unread message

yagyans...@gmail.com

unread,
Nov 25, 2020, 6:00:59 AM11/25/20
to Prometheus Users

Hi. I am using Alertmanager 0.21.0. Occasionally, the active alerts go to resolved state for a second and then come back to firing state immediately. There is no pattern of this happening, it happens randomly. Haven't been able to identify why this is happening.
Any thoughts here? Where should I start to look for this? Checked Alertmanager's logs, everything seems normal.

Thanks in advance!

Matthias Rampke

unread,
Nov 25, 2020, 6:33:42 AM11/25/20
to yagyans...@gmail.com, Prometheus Users
This could be many things, likely it has to do with the formulation of the alert. What does it look like in Prometheus? Specifically

- the ALERTS metric shows what is pending or firing over time
- evaluate the alert expression in Prometheus for the given time period. Are there gaps or does e.g. a label change between before and after the gap?

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5dc917e0-d35f-4666-b61b-36afa7851d15n%40googlegroups.com.

yagyans...@gmail.com

unread,
Nov 25, 2020, 6:46:38 AM11/25/20
to Prometheus Users
The alert formation doesn't seem to be a problem here, because it happens for different alerts randomly. Below is the alert for Exporter being down for which it has happened thrice today.

  - alert: ExporterDown
    expr: up == 0
    for: 10m
    labels:
      severity: "CRITICAL"
    annotations:
      summary: "Exporter down on *{{ $labels.instance }}*"
      description: "Not able to fetch application metrics from *{{ $labels.instance }}*"

- the ALERTS metric shows what is pending or firing over time
>> But the problem is that one of my ExporterDown alerts is active since the past 10 days, there is no genuine reason for the alert to go to a resolved state.

- evaluate the alert expression in Prometheus for the given time period. Are there gaps or does e.g. a label change between before and after the gap?
>> No gaps in the Prometheus GUI Console for the time period. The value for UP is zero constantly for last 6 hours but still the alert when to resolved state during that time and went to firing again.

Stuart Clark

unread,
Nov 25, 2020, 8:26:47 AM11/25/20
to yagyans...@gmail.com, Prometheus Users
On 25/11/2020 11:46, yagyans...@gmail.com wrote:
> The alert formation doesn't seem to be a problem here, because it
> happens for different alerts randomly. Below is the alert for Exporter
> being down for which it has happened thrice today.
>
>   - alert: ExporterDown
>     expr: up == 0
>     for: 10m
>     labels:
>       severity: "CRITICAL"
>     annotations:
>       summary: "Exporter down on *{{ $labels.instance }}*"
>       description: "Not able to fetch application metrics from *{{
> $labels.instance }}*"
>
> - the ALERTS metric shows what is pending or firing over time
> >> But the problem is that one of my ExporterDown alerts is active
> since the past 10 days, there is no genuine reason for the alert to go
> to a resolved state.
>
What do you have evaluation_interval set to in Prometheus, and
resolve_timeout in Alertmanager?

Is the alert definitely being resolved, as in you are getting a resolved
email/notification, or could it just be an email/notification for a long
running alert? - you should get another email/notification every now and
then based on repeat_interval.


Yagyansh S. Kumar

unread,
Nov 25, 2020, 9:08:00 AM11/25/20
to Stuart Clark, Prometheus Users
Hi Stuart.


On Wed, 25 Nov, 2020, 6:56 pm Stuart Clark, <stuart...@jahingo.com> wrote:
On 25/11/2020 11:46, yagyans...@gmail.com wrote:
> The alert formation doesn't seem to be a problem here, because it
> happens for different alerts randomly. Below is the alert for Exporter
> being down for which it has happened thrice today.
>
>   - alert: ExporterDown
>     expr: up == 0
>     for: 10m
>     labels:
>       severity: "CRITICAL"
>     annotations:
>       summary: "Exporter down on *{{ $labels.instance }}*"
>       description: "Not able to fetch application metrics from *{{
> $labels.instance }}*"
>
> - the ALERTS metric shows what is pending or firing over time
> >> But the problem is that one of my ExporterDown alerts is active
> since the past 10 days, there is no genuine reason for the alert to go
> to a resolved state.
>
What do you have evaluation_interval set to in Prometheus, and
resolve_timeout in Alertmanager?
>> My evaluation interval is 1m whereas my scrape timeout and scrape interval are 25s. Resolve timeout in Alertmanager is 5m. 

Is the alert definitely being resolved, as in you are getting a resolved
email/notification, or could it just be an email/notification for a long
running alert? - you should get another email/notification every now and
then based on repeat_interval.
>> Yes, I suspected that too in the beginning but I am logging each and every alert notification and found that I am indeed getting resolved notification for that alert and again firing notification the very next second.


Stuart Clark

unread,
Nov 25, 2020, 9:56:13 AM11/25/20
to promethe...@googlegroups.com, Yagyansh S. Kumar, Prometheus Users
How many Alertmanager instances are there? Can they talk to each other and is Prometheus configured and able to push alerts to them all?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Yagyansh S. Kumar

unread,
Nov 25, 2020, 9:59:08 AM11/25/20
to Stuart Clark, Prometheus Users


On Wed, 25 Nov, 2020, 8:26 pm Stuart Clark, <stuart...@jahingo.com> wrote:
How many Alertmanager instances are there? Can they talk to each other and is Prometheus configured and able to push alerts to them all?
>> Single instance as of now. I did setup a Alertmanager Mesh of 2 Alertmanagers but I am facing duplicate alert issue in that setup. Another issue that is pending for me. Hence, currently only a single Alertmanager is receiving alerts from my Prometheus instance. 

Stuart Clark

unread,
Nov 25, 2020, 11:04:33 AM11/25/20
to Yagyansh S. Kumar, Prometheus Users
Is the second instance still running?

If you are having some cluster communications issues that could result in what you are seeing. Both instances learn of an alert but then one instance missed some of the renewal messages, so resolves it. Then it gets updated and the alert is fired again.

If you look in Prometheus (UI or ALERTS metric) does the alert continue for the whole period or does it have a gap?

Yagyansh S. Kumar

unread,
Nov 25, 2020, 11:27:52 AM11/25/20
to Stuart Clark, Prometheus Users


On Wed, 25 Nov, 2020, 9:34 pm Stuart Clark, <stuart...@jahingo.com> wrote:
Is the second instance still running?

If you are having some cluster communications issues that could result in what you are seeing. Both instances learn of an alert but then one instance missed some of the renewal messages, so resolves it. Then it gets updated and the alert is fired again.
>> Sorry, my bad. I forgot I enabled the mesh again. I have 2 Alertmanager instances running and Prometheus is sending the data to both the Alertmanagers.

Instance 1 -  /usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --storage.path /mnt/vol2/alertmanager --data.retention=120h --log.level=debug --web.listen-address=x.x.x.x:9093 --cluster.listen-address=x.x.x.x:9094 --cluster.peer=y.y.y.y:9094

Instance 2 - /usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --storage.path /mnt/vol2/alertmanager --data.retention=120h --log.level=debug --web.listen-address=y.y.y.y:9093 --cluster.listen-address=y.y.y.y:9094 --cluster.peer=x.x.x.x:9094

Snippet from Prometheus config where both the alertmanagers are defined.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 'x.x.x.x:9093'
      - 'y.y.y.y:9093'

If you look in Prometheus (UI or ALERTS metric) does the alert continue for the whole period or does it have a gap?
>> In the last 1 day I do see 1 gap but the timing of this gap and the resolved notification does not match.
image.png

Stuart Clark

unread,
Nov 25, 2020, 11:40:13 AM11/25/20
to Yagyansh S. Kumar, Prometheus Users
On 25/11/2020 16:27, Yagyansh S. Kumar wrote:


On Wed, 25 Nov, 2020, 9:34 pm Stuart Clark, <stuart...@jahingo.com> wrote:
Is the second instance still running?

If you are having some cluster communications issues that could result in what you are seeing. Both instances learn of an alert but then one instance missed some of the renewal messages, so resolves it. Then it gets updated and the alert is fired again.
>> Sorry, my bad. I forgot I enabled the mesh again. I have 2 Alertmanager instances running and Prometheus is sending the data to both the Alertmanagers.

Instance 1 -  /usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --storage.path /mnt/vol2/alertmanager --data.retention=120h --log.level=debug --web.listen-address=x.x.x.x:9093 --cluster.listen-address=x.x.x.x:9094 --cluster.peer=y.y.y.y:9094

Instance 2 - /usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --storage.path /mnt/vol2/alertmanager --data.retention=120h --log.level=debug --web.listen-address=y.y.y.y:9093 --cluster.listen-address=y.y.y.y:9094 --cluster.peer=x.x.x.x:9094

Snippet from Prometheus config where both the alertmanagers are defined.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 'x.x.x.x:9093'
      - 'y.y.y.y:9093'

If you look in Prometheus (UI or ALERTS metric) does the alert continue for the whole period or does it have a gap?
>> In the last 1 day I do see 1 gap but the timing of this gap and the resolved notification does not match.
image.png

If the alert did continue throughout that suggests either a Prometheus -> Alertmanager communications issue (if enough updates are missed Alertmnager would assume the alert has been resolved) or a clustering issue (as mentioned you can end up with an instance being out of sync, again assuming an alert is resolved due to lack of updates).

Alertmanager does expose various metrics, including ones about the clustering. Do you see anything within those that matches roughly the times you saw the blip?

Yagyansh S. Kumar

unread,
Nov 25, 2020, 11:50:36 AM11/25/20
to Stuart Clark, Prometheus Users
>> Cluster metrics look perfectly fine for last 24 hours.
image.png

Although, I do see a difference in number of Notifications for both the instances. Is it normal?
image.png

But then again the number of alerts(Firing + Resolved) received by both the instances is exactly the same.
image.png

Reply all
Reply to author
Forward
0 new messages