Cluster Listen address vs Advertise address in AlertManager

369 views
Skip to first unread message

sweta Das

unread,
Jun 28, 2019, 2:29:09 PM6/28/19
to Prometheus Users

I am trying to set up alertmanager in HA mode. I use docker-compose to spin up my alert managers. Here are the 2 instances configs:

alertmanager:
image: prom/alertmanager
restart: always
logging:
  # limit logs retained on host to 25MB
  driver: "json-file"
  options:
    max-size: "500k"
    max-file: "50"
volumes:
  - ./config:/prometheus
  - /var/lib/grafana/alertmanager:/data
command:
  - '--config.file=/prometheus/alertmanager.yml'
  - '--storage.path=/data'
  - '--cluster.listen-address=localhost:9093'
  - '--cluster.peer=1xx.xx.xx.136:9093'
ports:
  - 9093:9093
  - 9094:9094/udp
alertmanager:
image: prom/alertmanager
restart: always
logging:
  # limit logs retained on host to 25MB
  driver: "json-file"
  options:
    max-size: "500k"
    max-file: "50"
volumes:
  - ./config:/prometheus
  - /var/lib/grafana/alertmanager:/data
command:
  - '--config.file=/prometheus/alertmanager.yml'
  - '--storage.path=/data'
  - '--cluster.listen-address=localhost:9093'
  - '--cluster.peer=1xx.xx.xx.137:9093'
ports:
  - 9093:9093
  - 9094:9094/udp

Each one complains about joining the other with below error (This is just from 1 alert manager):

level=warn ts=2019-06-28T16:38:58.104296695Z caller=cluster.go:154 component=cluster err="couldn't deduce an advertise address: failed to parse bind addr 'localhost'"
level=warn ts=2019-06-28T16:39:08.107555731Z caller=cluster.go:226 component=cluster msg="failed to join cluster" err="1 error occurred:\n\t* Failed to join 1xx.xx.xx.136: read tcp 1xx.19.0.5:41214->1xx.xx.xx.136: i/o timeout\n\n"
level=info ts=2019-06-28T16:39:08.107599804Z caller=cluster.go:228 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-06-28T16:39:08.107631853Z caller=main.go:230 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to join 1xx.xx.xx.136: read tcp 1xx.19.0.5:41214->1xx.xx.xx.136:9093: i/o timeout\n\n"
level=info ts=2019-06-28T16:39:08.107693688Z caller=cluster.go:613 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-06-28T16:39:08.140619467Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/prometheus/alertmanager.yml
level=info ts=2019-06-28T16:39:08.141617461Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/prometheus/alertmanager.yml
level=info ts=2019-06-28T16:39:08.145128833Z caller=main.go:365 msg=Listening address=:9093
level=error ts=2019-06-28T16:39:08.145275648Z caller=main.go:367 msg="Listen error" err="listen tcp :9093: bind: address already in use"

I checked that 9093 belongs to just the alert manager on that host and nothing else is using that port either. Also, there is connectivity between the hosts on port 9093 as telnet works just fine. udp connection is fine too

And if I remove the listen or advertise paramaters, I get the below errors:

level=info ts=2019-06-28T16:57:54.175757472Z caller=main.go:141 build_context="(go=go1.12.4, user=root@932a86a52b76, date=20190503-09:10:07)"
level=info ts=2019-06-28T16:57:54.1764299Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=172.19.0.5 port=9094
level=warn ts=2019-06-28T16:57:54.18422936Z caller=cluster.go:226 component=cluster msg="failed to join cluster" err="1 error occurred:\n\t* Failed to join 1xx.xx.xx.136: received invalid msgType (72), expected pushPullMsg (6) from=1xx.xx.xx.136:9093\n\n"
level=info ts=2019-06-28T16:57:54.184265727Z caller=cluster.go:228 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-06-28T16:57:54.184284236Z caller=main.go:230 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to join 1xx.xx.xx.136: received invalid msgType (72), expected pushPullMsg (6) from=172.17.21.137:9093\n\n"
level=info ts=2019-06-28T16:57:54.191170679Z caller=cluster.go:613 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-06-28T16:57:54.222369961Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/prometheus/alertmanager.yml
level=info ts=2019-06-28T16:57:54.222773958Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/prometheus/alertmanager.yml
level=info ts=2019-06-28T16:57:54.225423449Z caller=main.go:365 msg=Listening address=:9093
level=info ts=2019-06-28T16:57:56.191493442Z caller=cluster.go:638 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.000213756s
level=info ts=2019-06-28T16:58:04.193151572Z caller=cluster.go:630 component=cluster msg="gossip settled; proceeding" elapsed=10.001876299s
level=warn ts=2019-06-28T16:58:09.1931086Z caller=cluster.go:428 component=cluster msg=refresh result=failure addr=1xx.xx.xx.136:9093

Can anyone confirm if I am using listen and advertise address parameters incorrectly?

Simon Pasquier

unread,
Jul 4, 2019, 8:35:04 AM7/4/19
to sweta Das, Prometheus Users
By default, port 9093 is for HTTP and 9094 for clustering. I suggest
you use the default setting: "--cluster.listen-address=:9094", using
"localhost" won't work as the AlertManager instances are running in
different containers IIUC.
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0bcbf7fc-4bc5-4ad1-a929-2925d5e138f7%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages