alertmanager alert is duplicated

419 views
Skip to first unread message

Annonyme1

unread,
Sep 29, 2020, 11:09:02 AM9/29/20
to Prometheus Users
Can anyone help on how to avoid receiving the same alert three time successively ? (duplicated error) 
'

Brian Candler

unread,
Sep 29, 2020, 11:14:17 AM9/29/20
to Prometheus Users
For assistance, you should:

1. Explain the circumstances that led up to the alert; give some samples of the alerts; explain what makes you think they are "duplicates" not fresh alerts.

2. Show the alerting rule which generated this alert, and your alertmanager configuration.

3. Describe whether you have any sort of HA configuration (multiple prometheus servers and/or multiple alertmanager servers)

Annonyme1

unread,
Sep 29, 2020, 11:45:18 AM9/29/20
to Prometheus Users
1-i set the alert for ssl expiry i receive alert when the ssl will expire soon so i receive the same alert with the same time and date 3 times 
2-
alertmanager.yml
global:
  smtp_smarthost: 'mx2.so.com:25'
  smtp_from: 'alertm...@so.com'
  smtp_require_tls: false
  smtp_hello: 'alertmanager'
  smtp_auth_username: 'username'
  smtp_auth_password: 'password'


route:
  group_by: ['instance', 'alert']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 15m
  receiver: so

receivers:
  - name: 'so'
    email_configs:
      - to: 'md....@so.com'
      - to: 'infrast...@so.com'
    slack_configs:
      - channel: 'monitoring'
      - username: 'AlertManager'
      - icon_emoji: ':joy:'
alert.rules.yml

groups:
- name: example
  rules:
  - alert: SSLCertExpiringSoon

    expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 15

    for: 10s

    labels:

      severity: critical CASE sifast Sites

    annotations:

       summary: 'SSL certificat should be renewed as soon as possible '
3-i don't have HAproxy and i'am working on only one prometheus server 

Annonyme1

unread,
Sep 29, 2020, 11:59:03 AM9/29/20
to Prometheus Users

Brian Candler

unread,
Sep 29, 2020, 12:18:12 PM9/29/20
to Prometheus Users
Without seeing the content of those alert mails it's hard to know what's going on.  Do the alerts have identical labels?  Do they have identical timestamps, down to the second?

Is it possible that md..@so.com and infrast...@so.com are being expanded as aliases and forward to the same destinations?  Since you are relaying via your own smarthost ('mx2.so.com:25') it should be possible to look at logs on this host and check whether it's receiving one message from alertmanager or three separate copies.

What's your rule evaluation interval?  "for: 10s" is very short.  For test purposes, I'd be inclined to add "send_resolved: true" to your receiver.  This would let you see if the alert is triggering, resolving, and triggering again.  Having said that, I think it's unlikely that this is happening with the SSL expiry rule you posted.

Do you get any relevant logs in alertmanager? Try "journalctl -eu alertmanager".

Annonyme1

unread,
Sep 30, 2020, 3:56:11 AM9/30/20
to Prometheus Users



Le mardi 29 septembre 2020 17:18:12 UTC+1, Brian Candler a écrit :
Without seeing the content of those alert mails it's hard to know what's going on.  Do the alerts have identical labels?  Do they have identical timestamps, down to the second?

Is it possible that m...@so.com and infrast...@so.com are being expanded as aliases and forward to the same destinations?  Since you are relaying via your own smarthost ('mx2.so.com:25') it should be possible to look at logs on this host and check whether it's receiving one message from alertmanager or three separate copies.

Annonyme1

unread,
Sep 30, 2020, 4:03:21 AM9/30/20
to Prometheus Users

Brian Candler

unread,
Sep 30, 2020, 5:01:33 AM9/30/20
to Prometheus Users
OK, if you get the slack alert 3 times too, that probably rules out an E-mail issue.

I think the next thing is to look at the stdout/stderr from alertmanager.  If you are running alertmanager under systemd, then:
journalctl -eu alertmanager

Annonyme1

unread,
Sep 30, 2020, 5:17:32 AM9/30/20
to Prometheus Users
 alertmanager[6531]: level=info ts=2020-09-30T09:06:37.230Z caller=silence.go:379 component=silences msg="Running maintenance failed" err="open data/silences.54bf81a81abc

Annonyme1

unread,
Sep 30, 2020, 5:25:03 AM9/30/20
to Prometheus Users

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.png

Brian Candler

unread,
Sep 30, 2020, 6:36:26 AM9/30/20
to Prometheus Users
On Wednesday, 30 September 2020 10:17:32 UTC+1, Annonyme1 wrote:
 alertmanager[6531]: level=info ts=2020-09-30T09:06:37.230Z caller=silence.go:379 component=silences msg="Running maintenance failed" err="open data/silences.54bf81a81abc



That shows a permissions problem (file permissions or SELinux or Apparmor), which means that alertmanager isn't able to update its state on disk.  You certainly want to fix that, it might be the underlying problem causing your alerts to be resent.

I was also wondering if you saw any other messages around the time that the alerts are sent out.  They normally have "component=dispatcher".

Annonyme1

unread,
Sep 30, 2020, 10:00:12 AM9/30/20
to Prometheus Users
i solved that permission error but i still got the 3 alerts successively i already eliminate"repeat intervalle" from .yml file 

Brian Candler

unread,
Sep 30, 2020, 10:26:58 AM9/30/20
to Prometheus Users
I'll ask one last time for the logs from alertmanager around the time of the deliveries:

journalctl -eu alertmanager

Or: run alertmanager in the foreground, and watch it generate logs to stdout/stderr.

Annonyme1

unread,
Sep 30, 2020, 12:02:40 PM9/30/20
to Prometheus Users
sorry for disturbing you sir ! 
But i got that error on the logs and couldn't solve it :
-systemctl status alertmanager :
alertmanager[8182]: level=info ts=2020-09-30T14:52:18.288Z caller=silence.go:379 component=silences msg="Running maintenance failed" err="open data/silences.9c966613546cce2: permission denied"
-journalctl -eu alertmanager
alertmanager[8303]: level=info ts=2020-09-30T11:00:08.753Z caller=cluster.go:632 component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=32.5215
alertmanager[8303]: level=info ts=2020-09-30T11:00:08.753Z caller=silence.go:388 component=silences msg="Creating shutdown snapshot failed" err="open data/silences.1f7ca

Brian Candler

unread,
Sep 30, 2020, 3:43:31 PM9/30/20
to Prometheus Users
You will need to sort out the permissions problem yourself.  I don't have visibility of your system for file ownership and permissions, SELinux settings etc.  If necessary, ask a local system administrator for help.

I don't get those errors:

Sep 30 14:46:15 prometheus alertmanager[409]: level=debug ts=2020-09-30T14:46:15.718Z caller=nflog.go:336 component=nflog msg="Running maintenance"
Sep 30 14:46:15 prometheus alertmanager[409]: level=debug ts=2020-09-30T14:46:15.750Z caller=nflog.go:338 component=nflog msg="Maintenance done" duration=31.326992ms size=413

As for getting additional logs when delivering alerts, I expect you need to run alertmanager with flag "--log.level=debug".  I am running with that flag, and I get messages like

Sep 30 14:22:00 prometheus alertmanager[409]: level=debug ts=2020-09-30T14:22:00.559Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=PPPoE[6cf3489][active]
Sep 30 14:22:30 prometheus alertmanager[409]: level=debug ts=2020-09-30T14:22:30.558Z caller=dispatch.go:135 component=dispatcher msg="Received alert" alert=PPPoE[6cf3489][resolved]
Message has been deleted

Annonyme1

unread,
Oct 1, 2020, 7:16:31 AM10/1/20
to Prometheus Users
I solved the permission problem ! 
now when i run :
[root@grafana ~]# alertmanager --log.level=debug
level=info ts=2020-10-01T11:10:14.872Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cccc149d7298022df36039858ca)"
level=info ts=2020-10-01T11:10:14.872Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
level=debug ts=2020-10-01T11:10:14.872Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=
level=warn ts=2020-10-01T11:10:14.873Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
level=debug ts=2020-10-01T11:10:14.873Z caller=cluster.go:306 component=cluster memberlist="2020/10/01 12:10:14 [DEBUG] memberlist: Got bind error: Failed to start TCP listener on \"0.0.0.0\" port 9094: listen tcp 0.0.0.0:9094: bind: address already in use\n"
level=error ts=2020-10-01T11:10:14.873Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 9094: listen tcp 0.0.0.0:9094: bind: address already in use"
and journalctl -eu alertmanager : doesn't shows any error :
systemd[1]: Started Alertmanager.
alertmanager[10888]: level=info ts=2020-10-01T08:01:39.035Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101cc
alertmanager[10888]: level=info ts=2020-10-01T08:01:39.035Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
alertmanager[10888]: level=info ts=2020-10-01T08:01:39.041Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
alertmanager[10888]: level=info ts=2020-10-01T08:01:39.070Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/aler
alertmanager[10888]: level=info ts=2020-10-01T08:01:39.070Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/aler
alertmanager[10888]: level=info ts=2020-10-01T08:01:39.074Z caller=main.go:429 msg=Listening address=:9093
alertmanager[10888]: level=info ts=2020-10-01T08:01:41.041Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000123808s
alertmanager[10888]: level=info ts=2020-10-01T08:01:49.042Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.001040498s
systemd[1]: Stopping Alertmanager...
alertmanager[10888]: level=info ts=2020-10-01T10:29:39.684Z caller=main.go:468 msg="Received SIGTERM, exiting gracefully..."
Stopped Alertmanager.
systemd[1]: Started Alertmanager.
alertmanager[7983]: level=info ts=2020-10-01T10:29:39.750Z caller=main.go:197 msg="Starting Alertmanager" version="(version=0.18.0, branch=HEAD, revision=1ace0f76b7101ccc
alertmanager[7983]: level=info ts=2020-10-01T10:29:39.750Z caller=main.go:198 build_context="(go=go1.12.6, user=root@868685ed3ed0, date=20190708-14:31:49)"
alertmanager[7983]: level=info ts=2020-10-01T10:29:39.752Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
alertmanager[7983]: level=info ts=2020-10-01T10:29:39.779Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alert
alertmanager[7983]: level=info ts=2020-10-01T10:29:39.779Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alert
alertmanager[7983]: level=info ts=2020-10-01T10:29:39.782Z caller=main.go:429 msg=Listening address=:9093
alertmanager[7983]: level=info ts=2020-10-01T10:29:41.752Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000137598s
alertmanager[7983]: level=info ts=2020-10-01T10:29:49.763Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.01103275s

but i still have the same alert duplicated problem 

Brian Candler

unread,
Oct 1, 2020, 7:37:16 AM10/1/20
to Prometheus Users
On Thursday, 1 October 2020 12:16:31 UTC+1, Annonyme1 wrote:
level=debug ts=2020-10-01T11:10:14.873Z caller=cluster.go:306 component=cluster memberlist="2020/10/01 12:10:14 [DEBUG] memberlist: Got bind error: Failed to start TCP listener on \"0.0.0.0\" port 9094: listen tcp 0.0.0.0:9094: bind: address already in use\n"
level=error ts=2020-10-01T11:10:14.873Z caller=main.go:222 msg="unable to initialize gossip mesh" err="create memberlist: Could not set up network transport: failed to obtain an address: Failed to start TCP listener on \"0.0.0.0\" port 9094: listen tcp 0.0.0.0:9094: bind: address already in use"

You have another instance of alertmanager running.  You need to stop that one (e.g. "systemctl stop alertmanager"), before running alertmanager at the command line.

Alternatively, if you want to run your systemd instance of alertmanager with --log.level=debug, then you need to edit the unit file.


but i still have the same alert duplicated problem 


Let's see logs from alertmanager, with --log.level=debug enabled, at the time the alerts are sent out.

Annonyme1

unread,
Oct 1, 2020, 10:26:42 AM10/1/20
to Prometheus Users
1-I couldn't find the other instance sir ! when i stop alertmanager.service i don't receive any alert 
2-how can i edit file the file ? oO
3-i added --log.level=debug than 
: level=debug ts=2020-10-01T14:15:19.477Z caller=cluster.go:233 component=cluster msg="joined cluster" peers=0
alertmanager[1430]: level=info ts=2020-10-01T14:15:19.510Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
alertmanager[1430]: level=info ts=2020-10-01T14:15:19.550Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/alert
alertmanager[1430]: level=info ts=2020-10-01T14:15:19.551Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alert
alertmanager[1430]: level=info ts=2020-10-01T14:15:19.553Z caller=main.go:429 msg=Listening address=:9093
llevel=info ts=2020-10-01T14:15:21.510Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000135331s
level=debug ts=2020-10-01T14:15:23.510Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=4.000586485s
level=debug ts=2020-10-01T14:15:25.510Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=6.000806992s
 level=debug ts=2020-10-01T14:15:27.511Z caller=cluster.go:645 component=cluster msg="gossip looks settled" elapsed=8.001014456s
level=info ts=2020-10-01T14:15:29.515Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.005216728s
level=debug ts=2020-10-01T14:16:15.562Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SSLCertExpiringSoon[a5cda38][active]
level=debug ts=2020-10-01T14:16:15.563Z caller=dispatch.go:430 component=dispatcher aggrGroup="{}:{instance=\"https://lacompagnie.com\"}" msg=fl
level=debug ts=2020-10-01T14:18:15.561Z caller=dispatch.go:104 component=dispatcher msg="Received alert" alert=SSLCertExpiringSoon[a5cda38][active]

Brian Candler

unread,
Oct 1, 2020, 10:39:23 AM10/1/20
to Prometheus Users
I think we're getting off topic here; we are now just into general system administration. Briefly:

* "systemctl stop alertmanager" will stop the systemd-managed version of alertmanager, so you can run a separate instance at the CLI.

* The file which contains the systemd config depends on where you (or your packager) put it.  It might be /etc/systemd/system/alertmanager.service. After editing this file, you need to do "systemctl daemon-reload" to pick up the changes.

Your logs show a single alert at 14:16:15 and another single alert at 14:18:15.  Did you get three copies of both of these delivered?

I'm afraid I don't really have any idea what's happening.  Your system sends out alerts 3 times; my system (and apparently everybody else's) only sends out alerts once.

I can only suggest that you check the whole of your prometheus configuration from top to bottom.  Could you be installing three separate alerting rules, all of which alert on the same condition?? The problem will be something silly like that.

FWIW, I have alerts very similar to yours:

# In prometheus.yml
rule_files:
  - /etc/prometheus/rules.d/*.yml

# /etc/prometheus/rules.d/alert-certificate.yml
groups:
  - name: Certificates
    interval: 1m
    rules:
      - alert: CertificateInvalid
        expr: probe_success{module="certificate"} != 1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: 'Certificate is invalid or service could not be reached'
  - name: CertificateLifetime
    interval: 60m
    rules:
      - alert: CertificateExpiring
        expr: (probe_ssl_earliest_cert_expiry - time())/86400 < 14
        for: 120m
        labels:
          severity: warning
        annotations:
          summary: 'Certificate is expiring soon: {{ $value }} days'

and I don't get duplicates.

Annonyme1

unread,
Oct 1, 2020, 11:58:19 AM10/1/20
to Prometheus Users
yes i got three alerts 
in my case i dit 
# In prometheus.yml
# Global config
global:
  scrape_interval:     1m # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 1m # Evaluate rules every 15 seconds. The default is every 1 minute.
  scrape_timeout: 15s  # scrape_timeout is set to the global default (10s).

rule_files:
 - 'alert.rules.yml'

# A scrape configuration containing exactly one endpoint to scrape:# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090','localhost:9115','localhost:9182']


# /etc/prometheus/alert.rules.yml
groups:
- name: example
  rules:
  - alert: SSLCertExpiringSoon

    expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30

    for: 10s

    labels:

      severity: critical CASE sifast Sites

    annotations:

       summary: 'SSL certificat should be renewed as soon as possible '

#in /etc/alertmanager/alertmanager.yml
global:
  smtp_smarthost: 'mx2.sot.com:25'
  smtp_from: 'alertm...@so.com'
  smtp_require_tls: false
  smtp_hello: 'alertmanager'
  smtp_auth_username: 'username'
  smtp_auth_password: 'password'


route:
  group_by: ['instance', 'alert']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 5d
  receiver: so

receivers:
  - name: 'so'
    email_configs:
      - to: 'infrast...@so.com'
    slack_configs:
      - channel: 'monitoring'
      - username: 'AlertManager-user'
      - icon_emoji: ':joy:'
i recongise that alerts duplicated comes with 2 differents usernames ''AlertManager-user" and "alertmanager"

Reply all
Reply to author
Forward
0 new messages