--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/5dc917e0-d35f-4666-b61b-36afa7851d15n%40googlegroups.com.
On 25/11/2020 11:46, yagyans...@gmail.com wrote:
> The alert formation doesn't seem to be a problem here, because it
> happens for different alerts randomly. Below is the alert for Exporter
> being down for which it has happened thrice today.
>
> - alert: ExporterDown
> expr: up == 0
> for: 10m
> labels:
> severity: "CRITICAL"
> annotations:
> summary: "Exporter down on *{{ $labels.instance }}*"
> description: "Not able to fetch application metrics from *{{
> $labels.instance }}*"
>
> - the ALERTS metric shows what is pending or firing over time
> >> But the problem is that one of my ExporterDown alerts is active
> since the past 10 days, there is no genuine reason for the alert to go
> to a resolved state.
>
What do you have evaluation_interval set to in Prometheus, and
resolve_timeout in Alertmanager?
Is the alert definitely being resolved, as in you are getting a resolved
email/notification, or could it just be an email/notification for a long
running alert? - you should get another email/notification every now and
then based on repeat_interval.
How many Alertmanager instances are there? Can they talk to each other and is Prometheus configured and able to push alerts to them all?
Is the second instance still running?
If you are having some cluster communications issues that could result in what you are seeing. Both instances learn of an alert but then one instance missed some of the renewal messages, so resolves it. Then it gets updated and the alert is fired again.
If you look in Prometheus (UI or ALERTS metric) does the alert continue for the whole period or does it have a gap?

On Wed, 25 Nov, 2020, 9:34 pm Stuart Clark, <stuart...@jahingo.com> wrote:
Is the second instance still running?
If you are having some cluster communications issues that could result in what you are seeing. Both instances learn of an alert but then one instance missed some of the renewal messages, so resolves it. Then it gets updated and the alert is fired again.
>> Sorry, my bad. I forgot I enabled the mesh again. I have 2 Alertmanager instances running and Prometheus is sending the data to both the Alertmanagers.
Instance 1 - /usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --storage.path /mnt/vol2/alertmanager --data.retention=120h --log.level=debug --web.listen-address=x.x.x.x:9093 --cluster.listen-address=x.x.x.x:9094 --cluster.peer=y.y.y.y:9094
Instance 2 - /usr/local/bin/alertmanager --config.file /etc/alertmanager/alertmanager.yml --storage.path /mnt/vol2/alertmanager --data.retention=120h --log.level=debug --web.listen-address=y.y.y.y:9093 --cluster.listen-address=y.y.y.y:9094 --cluster.peer=x.x.x.x:9094
Snippet from Prometheus config where both the alertmanagers are defined.
alerting:
alertmanagers:
- static_configs:
- targets:
- 'x.x.x.x:9093'
- 'y.y.y.y:9093'
If you look in Prometheus (UI or ALERTS metric) does the alert continue for the whole period or does it have a gap?
>> In the last 1 day I do see 1 gap but the timing of this gap and the resolved notification does not match.
If the alert did continue throughout that suggests either a Prometheus -> Alertmanager communications issue (if enough updates are missed Alertmnager would assume the alert has been resolved) or a clustering issue (as mentioned you can end up with an instance being out of sync, again assuming an alert is resolved due to lack of updates).
Alertmanager does expose various metrics, including ones about
the clustering. Do you see anything within those that matches
roughly the times you saw the blip?


