AlertManager Mesh - Behaviour & Advantages

1,153 views
Skip to first unread message

Rahul Srivastava

unread,
May 8, 2017, 5:25:23 AM5/8/17
to Prometheus Developers
Hi,

Seems AlertManager supports HA by creating a mesh of AlertManagers [1]. The Prometheus server, can then point to each of these instances from the AlertManager mesh by specifying their URLs in the alertmanager.url as follows
[[
./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9095,http://localhost:9094,http://localhost:9093
]]

However, when the config file for one of the AlertManagers in the mesh is changed, it seems the change is not replicated across other instances in the mesh ? 
Would appreciate if someone could highlight the advantages of using a mesh of AlertManager ?


Thanks,
Rahul.

Brian Brazil

unread,
May 8, 2017, 5:36:43 AM5/8/17
to Rahul Srivastava, Prometheus Developers
On 8 May 2017 at 10:25, Rahul Srivastava <norm...@gmail.com> wrote:
Hi,

Seems AlertManager supports HA by creating a mesh of AlertManagers [1]. The Prometheus server, can then point to each of these instances from the AlertManager mesh by specifying their URLs in the alertmanager.url as follows
[[
./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9095,http://localhost:9094,http://localhost:9093
]]

However, when the config file for one of the AlertManagers in the mesh is changed, it seems the change is not replicated across other instances in the mesh ? 

That's correct. You need to update all of them.
 
Would appreciate if someone could highlight the advantages of using a mesh of AlertManager ?

If one AM goes down or there's a network partition, notifications will still happen.

Brian
 

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/96eaf9e3-9e4e-41d2-a1e1-746add555b83%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Rahul Srivastava

unread,
May 8, 2017, 6:03:58 AM5/8/17
to Prometheus Developers, norm...@gmail.com
On Monday, 8 May 2017 15:06:43 UTC+5:30, Brian Brazil wrote:
On 8 May 2017 at 10:25, Rahul Srivastava <norm...@gmail.com> wrote:
Hi,

Seems AlertManager supports HA by creating a mesh of AlertManagers [1]. The Prometheus server, can then point to each of these instances from the AlertManager mesh by specifying their URLs in the alertmanager.url as follows
[[
./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9095,http://localhost:9094,http://localhost:9093
]]

However, when the config file for one of the AlertManagers in the mesh is changed, it seems the change is not replicated across other instances in the mesh ? 

That's correct. You need to update all of them.

So it seems they are just a set of individual AlertManagers running.
 
 
Would appreciate if someone could highlight the advantages of using a mesh of AlertManager ?

If one AM goes down or there's a network partition, notifications will still happen.

Well, IIUC, the AlertManagers need not necessarily be in a mesh from that perspective. Prometheus is configured to point to the URL/s for all the AlertManagers, so if one AlertManager goes down, Prometheus can still send Alerts to the other AlertManagers. How does it really matter if those AlertManagers are part of a mesh or not ? -- notification would be sent out, if there is atleast one AlertManager alive. So that brings us back to -- what is the advantage that *mesh* offers vs individual AlertManagers ?

Thanks,
Rahul.

 

Brian Brazil

unread,
May 8, 2017, 6:06:20 AM5/8/17
to Rahul Srivastava, Prometheus Developers
On 8 May 2017 at 11:03, Rahul Srivastava <norm...@gmail.com> wrote:
On Monday, 8 May 2017 15:06:43 UTC+5:30, Brian Brazil wrote:
On 8 May 2017 at 10:25, Rahul Srivastava <norm...@gmail.com> wrote:
Hi,

Seems AlertManager supports HA by creating a mesh of AlertManagers [1]. The Prometheus server, can then point to each of these instances from the AlertManager mesh by specifying their URLs in the alertmanager.url as follows
[[
./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9095,http://localhost:9094,http://localhost:9093
]]

However, when the config file for one of the AlertManagers in the mesh is changed, it seems the change is not replicated across other instances in the mesh ? 

That's correct. You need to update all of them.

So it seems they are just a set of individual AlertManagers running.

Largely.
 
 
 
Would appreciate if someone could highlight the advantages of using a mesh of AlertManager ?

If one AM goes down or there's a network partition, notifications will still happen.

Well, IIUC, the AlertManagers need not necessarily be in a mesh from that perspective. Prometheus is configured to point to the URL/s for all the AlertManagers, so if one AlertManager goes down, Prometheus can still send Alerts to the other AlertManagers. How does it really matter if those AlertManagers are part of a mesh or not ? -- notification would be sent out, if there is atleast one AlertManager alive. So that brings us back to -- what is the advantage that *mesh* offers vs individual AlertManagers ?

In normal operations, you only get one notification no matter how many AMs are in the mesh. With n separate AMs, you'd get n notifications.

Brian 

Thanks,
Rahul.

 

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Julius Volz

unread,
May 8, 2017, 6:11:59 AM5/8/17
to Brian Brazil, Rahul Srivastava, Prometheus Developers
See how this works in Fabian's 5-minute lightning talk from PromCon 2016: https://www.youtube.com/watch?v=XvqaYbiTOMg

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.



--

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

Rahul Srivastava (र।हुल श्रीवास्तव)

unread,
May 8, 2017, 6:46:23 AM5/8/17
to Brian Brazil, Prometheus Developers
On Mon, May 8, 2017 at 3:36 PM, Brian Brazil <brian....@robustperception.io> wrote:
On 8 May 2017 at 11:03, Rahul Srivastava <norm...@gmail.com> wrote:
On Monday, 8 May 2017 15:06:43 UTC+5:30, Brian Brazil wrote:
On 8 May 2017 at 10:25, Rahul Srivastava <norm...@gmail.com> wrote:
Hi,

Seems AlertManager supports HA by creating a mesh of AlertManagers [1]. The Prometheus server, can then point to each of these instances from the AlertManager mesh by specifying their URLs in the alertmanager.url as follows
[[
./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9095,http://localhost:9094,http://localhost:9093
]]

However, when the config file for one of the AlertManagers in the mesh is changed, it seems the change is not replicated across other instances in the mesh ? 

That's correct. You need to update all of them.

So it seems they are just a set of individual AlertManagers running.

Largely.

So it seems the participants in an AletManager mesh can all have different config ? -- If so, how does the aggregation of the Alert works ? -- say the group_wait, etc. may have different values in each of the AlertManager participant in the mesh.
 
 
 
 
Would appreciate if someone could highlight the advantages of using a mesh of AlertManager ?

If one AM goes down or there's a network partition, notifications will still happen.

Well, IIUC, the AlertManagers need not necessarily be in a mesh from that perspective. Prometheus is configured to point to the URL/s for all the AlertManagers, so if one AlertManager goes down, Prometheus can still send Alerts to the other AlertManagers. How does it really matter if those AlertManagers are part of a mesh or not ? -- notification would be sent out, if there is atleast one AlertManager alive. So that brings us back to -- what is the advantage that *mesh* offers vs individual AlertManagers ?

In normal operations, you only get one notification no matter how many AMs are in the mesh. With n separate AMs, you'd get n notifications.

That makes a lot of sense. 

Btw, is there a Primary node in the AlertManager mesh ? If so, what happens when that primary goes down ?

Primary in Mesh: ./alertmanager ... -mesh.listen-address=:8001 
Participant in Mesh: ./alertmanager ... -mesh.peer=127.0.0.1:8001 

Thanks,
Rahul.

 

Brian 

Thanks,
Rahul.

 

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.



--

Rahul Srivastava (र।हुल श्रीवास्तव)

unread,
May 8, 2017, 6:49:50 AM5/8/17
to Julius Volz, Brian Brazil, Prometheus Developers
That's a great 5-min talk. Thanks for sharing.

From the talk it seems like there is information that is propagated to other nodes in the mesh. Why then the config changes are not sync-ed in the mesh ?

Thanks,
Rahul.


On Mon, May 8, 2017 at 3:41 PM, Julius Volz <juliu...@gmail.com> wrote:
See how this works in Fabian's 5-minute lightning talk from PromCon 2016: https://www.youtube.com/watch?v=XvqaYbiTOMg

Brian Brazil

unread,
May 8, 2017, 6:57:45 AM5/8/17
to Rahul Srivastava (र।हुल श्रीवास्तव), Prometheus Developers


On 8 May 2017 at 11:46, Rahul Srivastava (र।हुल श्रीवास्तव) <norm...@gmail.com> wrote:

On Mon, May 8, 2017 at 3:36 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 8 May 2017 at 11:03, Rahul Srivastava <norm...@gmail.com> wrote:
On Monday, 8 May 2017 15:06:43 UTC+5:30, Brian Brazil wrote:
On 8 May 2017 at 10:25, Rahul Srivastava <norm...@gmail.com> wrote:
Hi,

Seems AlertManager supports HA by creating a mesh of AlertManagers [1]. The Prometheus server, can then point to each of these instances from the AlertManager mesh by specifying their URLs in the alertmanager.url as follows
[[
./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9095,http://localhost:9094,http://localhost:9093
]]

However, when the config file for one of the AlertManagers in the mesh is changed, it seems the change is not replicated across other instances in the mesh ? 

That's correct. You need to update all of them.

So it seems they are just a set of individual AlertManagers running.

Largely.

So it seems the participants in an AletManager mesh can all have different config ?

Yes, don't do that.
 
-- If so, how does the aggregation of the Alert works ? -- say the group_wait, etc. may have different values in each of the AlertManager participant in the mesh.

You may get duplicate alerts.
 
 
 
 
 
Would appreciate if someone could highlight the advantages of using a mesh of AlertManager ?

If one AM goes down or there's a network partition, notifications will still happen.

Well, IIUC, the AlertManagers need not necessarily be in a mesh from that perspective. Prometheus is configured to point to the URL/s for all the AlertManagers, so if one AlertManager goes down, Prometheus can still send Alerts to the other AlertManagers. How does it really matter if those AlertManagers are part of a mesh or not ? -- notification would be sent out, if there is atleast one AlertManager alive. So that brings us back to -- what is the advantage that *mesh* offers vs individual AlertManagers ?

In normal operations, you only get one notification no matter how many AMs are in the mesh. With n separate AMs, you'd get n notifications.

That makes a lot of sense. 

Btw, is there a Primary node in the AlertManager mesh ? If so, what happens when that primary goes down ?
 
Kinda. That's handled gracefully, there'll be a new AM that's get first shot at sending notifications.

Brian
 

Primary in Mesh: ./alertmanager ... -mesh.listen-address=:8001 
Participant in Mesh: ./alertmanager ... -mesh.peer=127.0.0.1:8001 

Thanks,
Rahul.

 

Brian 

Thanks,
Rahul.

 

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/ff42dcc1-f4ab-4a4c-8e41-3b8053c0be04%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--




--

Rahul Srivastava (र।हुल श्रीवास्तव)

unread,
May 8, 2017, 7:15:22 AM5/8/17
to Brian Brazil, Prometheus Developers
On Mon, May 8, 2017 at 4:27 PM, Brian Brazil <brian....@robustperception.io> wrote:


On 8 May 2017 at 11:46, Rahul Srivastava (र।हुल श्रीवास्तव) <norm...@gmail.com> wrote:
On Mon, May 8, 2017 at 3:36 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:
On 8 May 2017 at 11:03, Rahul Srivastava <norm...@gmail.com> wrote:
On Monday, 8 May 2017 15:06:43 UTC+5:30, Brian Brazil wrote:
On 8 May 2017 at 10:25, Rahul Srivastava <norm...@gmail.com> wrote:
Hi,

Seems AlertManager supports HA by creating a mesh of AlertManagers [1]. The Prometheus server, can then point to each of these instances from the AlertManager mesh by specifying their URLs in the alertmanager.url as follows
[[
./prometheus -config.file=prometheus.yml -alertmanager.url http://localhost:9095,http://localhost:9094,http://localhost:9093
]]

However, when the config file for one of the AlertManagers in the mesh is changed, it seems the change is not replicated across other instances in the mesh ? 

That's correct. You need to update all of them.

So it seems they are just a set of individual AlertManagers running.

Largely.

So it seems the participants in an AletManager mesh can all have different config ?

Yes, don't do that.

Not intentionally, but say the update succeeded in one AM, but failed in the other leaving the AM/s in an inconsistent (for lack of better word). 
Would have been great if the configs were are synced among all participants in the mesh, automatically :-) -- create a receiver in one and is available in all the AM/s!!
 
 
-- If so, how does the aggregation of the Alert works ? -- say the group_wait, etc. may have different values in each of the AlertManager participant in the mesh.

You may get duplicate alerts.

Sure.
 
 
 
 
 
 
Would appreciate if someone could highlight the advantages of using a mesh of AlertManager ?

If one AM goes down or there's a network partition, notifications will still happen.

Well, IIUC, the AlertManagers need not necessarily be in a mesh from that perspective. Prometheus is configured to point to the URL/s for all the AlertManagers, so if one AlertManager goes down, Prometheus can still send Alerts to the other AlertManagers. How does it really matter if those AlertManagers are part of a mesh or not ? -- notification would be sent out, if there is atleast one AlertManager alive. So that brings us back to -- what is the advantage that *mesh* offers vs individual AlertManagers ?

In normal operations, you only get one notification no matter how many AMs are in the mesh. With n separate AMs, you'd get n notifications.

That makes a lot of sense. 

Btw, is there a Primary node in the AlertManager mesh ? If so, what happens when that primary goes down ?
 
Kinda. That's handled gracefully, there'll be a new AM that's get first shot at sending notifications.

Interesting -- so the mesh is not broken when the (so called) Primary goes down. If that's the case, then the Prmary isn't really a Primary (I guess that's what you meant by -- kinda :-))

Sorry, but I didn't understand this completely -- when the primary goes down (which was listening on 8001, say), port 8001 no longer exists. So how come the other participants are still gossiping on 8001 in the mesh ?

Thanks,
Rahul.

 




--

Julius Volz

unread,
May 8, 2017, 7:20:48 AM5/8/17
to Rahul Srivastava (र।हुल श्रीवास्तव), Brian Brazil, Prometheus Developers
On Mon, May 8, 2017 at 1:15 PM, Rahul Srivastava (र।हुल श्रीवास्तव) <norm...@gmail.com> wrote:

On Mon, May 8, 2017 at 4:27 PM, Brian Brazil <brian.brazil@robustperception.io> wrote:


On 8 May 2017 at 11:46, Rahul Srivastava (र।हुल श्रीवास्तव) <norm...@gmail.com> wrote:
Btw, is there a Primary node in the AlertManager mesh ? If so, what happens when that primary goes down ?
 
Kinda. That's handled gracefully, there'll be a new AM that's get first shot at sending notifications.

Interesting -- so the mesh is not broken when the (so called) Primary goes down. If that's the case, then the Prmary isn't really a Primary (I guess that's what you meant by -- kinda :-))

Sorry, but I didn't understand this completely -- when the primary goes down (which was listening on 8001, say), port 8001 no longer exists. So how come the other participants are still gossiping on 8001 in the mesh ?

Since it's a mesh, each AM is talking to all other AMs. They construct an order among themselves, but there is no traditional primary/secondary relationship, just a sorted list of AMs with notification delays between them. That's all nicely explained in https://www.youtube.com/watch?v=XvqaYbiTOMg 

Rahul Srivastava (र।हुल श्रीवास्तव)

unread,
May 8, 2017, 7:32:18 AM5/8/17
to Julius Volz, Brian Brazil, Prometheus Developers
Right. Is the creation of mesh limited to command line options or can they be specified in the configuration file ?

Thanks,
Rahul.


Julius Volz

unread,
May 8, 2017, 7:37:15 AM5/8/17
to Rahul Srivastava (र।हुल श्रीवास्तव), Brian Brazil, Prometheus Developers
That's limited to command-line options. 

Rahul Srivastava (र।हुल श्रीवास्तव)

unread,
May 8, 2017, 8:05:55 AM5/8/17
to Julius Volz, Brian Brazil, Prometheus Developers
Thanks!!

Rahul Srivastava

unread,
May 9, 2017, 2:14:50 AM5/9/17
to Prometheus Developers
Hi Again,

IIUC, there is no primary/secondary when an AlertManager mesh is created; rather its a fully connected mesh. Further, as per my understanding, once a new participant joins the mesh on startup, it gets connected to all the participants (or nodes) in the mesh; and thereafter, it really doesn't matter if any one of the participant goes down, including the first one.

Now, when all the nodes in the mesh are up, the silences created in any one node is visible to all the other nodes (almost immediately) in the mesh. However, I don't see the same behaviour in the following scenario...

The Scenario:
Using AlertManager 0.6.1, I create a mesh of 3 nodes (A, B, C) on localhost. "A" comes up first and listens on say 8001. B, and C, comes up later (say in the order) and connects to the mesh on 8001, which node "A" is listening on. At this point, any silences created in any of the nodes is visible to all. All is well so far.

But, when node "A" goes down, nodes B and C stop to sync -- i.e. any silences created in node B or C, are not synced and therefore not visible in the other node. Bring up node "A" again, and the silences are synced almost immediately.

Question: If the mesh is a fully connected n/w of nodes, why does the sync stop when the node, to which all the other nodes connect to on startup, goes down ?

Additional Details...
$ /home/rahul/prometheus/alertmanager-0.6.1.linux-amd64/alertmanager -log.level=debug -web.listen-address=:9093 -mesh.peer-id=00:00:00:00:00:01 -mesh.nickname=A -mesh.listen-address=:8001 -config.file=examples/ha/alertmanager.yaml

$ /home/rahul/prometheus/alertmanager-0.6.1.linux-amd64/alertmanager -log.level=debug -web.listen-address=:9094 -mesh.peer-id=00:00:00:00:00:02 -mesh.nickname=B -mesh.listen-address=:8002 -mesh.peer=127.0.0.1:8001 -config.file=examples/ha/alertmanager.yaml

$ /home/rahul/prometheus/alertmanager-0.6.1.linux-amd64/alertmanager -log.level=debug -web.listen-address=:9095 -mesh.peer-id=00:00:00:00:00:03 -mesh.nickname=C -mesh.listen-address=:8003 -mesh.peer=127.0.0.1:8001 -config.file=examples/ha/alertmanager.yaml

Thanks,
Rahul.

Julius Volz

unread,
May 9, 2017, 6:42:35 AM5/9/17
to Rahul Srivastava, Prometheus Developers
See the help text description (./alertmanager -h) of the -mesh.peer flag: "initial peers (may be repeated)"

You will need to list that flag three times, once for each of the Alertmanagers. Otherwise indeed everything is only connected to the first one.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.

Rahul Srivastava (र।हुल श्रीवास्तव)

unread,
May 9, 2017, 8:00:11 AM5/9/17
to Julius Volz, Prometheus Developers
On Tue, May 9, 2017 at 4:12 PM, Julius Volz <juliu...@gmail.com> wrote:
See the help text description (./alertmanager -h) of the -mesh.peer flag: "initial peers (may be repeated)"

You will need to list that flag three times, once for each of the Alertmanagers. Otherwise indeed everything is only connected to the first one.

So far I was under the impression that when a new participant joins an existing mesh, it gets connected to every single member in the mesh, automatically. But seems like that is not the case. If every alertmanager should explicitly specify *all* the alertmanagers it should connect to in the mesh, then any new participant joining the mesh, should somehow figure out all the running alertmanagers and the mesh.listen-address for each of those alertmanagers so that it can specify multiple mesh.peer flags on startup ? This may get tricky when alertmanagers in a mesh are scaled dynamically based on load.

Thanks,
Rahul.

 

--
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.

Julius Volz

unread,
May 9, 2017, 8:35:29 AM5/9/17
to Rahul Srivastava (र।हुल श्रीवास्तव), Prometheus Developers
On Tue, May 9, 2017 at 2:00 PM, Rahul Srivastava (र।हुल श्रीवास्तव) <norm...@gmail.com> wrote:
On Tue, May 9, 2017 at 4:12 PM, Julius Volz <juliu...@gmail.com> wrote:
See the help text description (./alertmanager -h) of the -mesh.peer flag: "initial peers (may be repeated)"

You will need to list that flag three times, once for each of the Alertmanagers. Otherwise indeed everything is only connected to the first one.

So far I was under the impression that when a new participant joins an existing mesh, it gets connected to every single member in the mesh, automatically. But seems like that is not the case. If every alertmanager should explicitly specify *all* the alertmanagers it should connect to in the mesh, then any new participant joining the mesh, should somehow figure out all the running alertmanagers and the mesh.listen-address for each of those alertmanagers so that it can specify multiple mesh.peer flags on startup ? This may get tricky when alertmanagers in a mesh are scaled dynamically based on load.

Actually it seems I was misinformed on this. I had assumed that the Mesh library we're using only ensures that gossip messages make it from one node to all other nodes as long as there is a statically configured path between the nodes (not even necessarily a full mesh). I thought that would be ok, as AM would normally be operated as a relatively static cluster.

However, I've just learned from Peter (author of the Weave Mesh lib) and Fabian that memberships should also be gossiped, as you expected. So the question is why this is not working here.

Fabian Reinartz

unread,
May 9, 2017, 9:43:01 AM5/9/17
to Julius Volz, Rahul Srivastava (र।हुल श्रीवास्तव), Prometheus Developers
After some debugging and circling back with folks from Weave, the problem simply seems to be running all instances on the same host/IP.
The general assumption is that all peers use the same mesh port or all possible ports are part of the initial peer list. Three instances running in the same host on 3 different ports, where just one of them is provided in the initial peer list, doesn't fulfil that.
Hence B and C never initiate a connection between themselves and depend on A.

Summed up, this problem should never really occur in a valid setup across multiple nodes. I'm still not 100% sure how this technical limitation comes to be, but I don't think it affects any production deployments.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CA%2BT6YoyXVzL1a1PzwE3AzZd7D3Gu8tDkEV3dpQtn9XVJJGdrVg%40mail.gmail.com.

Silas Snider

unread,
May 9, 2017, 9:47:29 AM5/9/17
to Fabian Reinartz, Julius Volz, Rahul Srivastava (र।हुल श्रीवास्तव), Prometheus Developers
Well, that ruins some plans I had for running alert manager on top of a cluster scheduler -- I get assigned a random port on each instance, so it seems like this would end up failing for my use case as well. Is there an upstream issue # to track about this?

Fabian Reinartz

unread,
May 9, 2017, 9:50:17 AM5/9/17
to Silas Snider, Julius Volz, Rahul Srivastava (र।हुल श्रीवास्तव), Prometheus Developers
I haven't created an issue yet. Would you like to create one? (https://github.com/weaveworks/mesh) – always better if done by someone with a use case at hand :)

Silas Snider

unread,
May 9, 2017, 10:10:08 AM5/9/17
to Fabian Reinartz, Julius Volz, Rahul Srivastava (र।हुल श्रीवास्तव), Prometheus Developers

Rahul Srivastava (र।हुल श्रीवास्तव)

unread,
May 9, 2017, 10:28:55 AM5/9/17
to Fabian Reinartz, Julius Volz, Prometheus Developers
On Tue, May 9, 2017 at 7:12 PM, Fabian Reinartz <fab.re...@gmail.com> wrote:
After some debugging and circling back with folks from Weave, the problem simply seems to be running all instances on the same host/IP.
The general assumption is that all peers use the same mesh port

Well, in an on-prem kind of env, one has full control on which host to provision the instances, as the steps are manual. However, on the cloud this can probably get tricky because of lack of control on choosing specific instances, ports, etc. On the cloud, say when one requests to add one more instance of alertmanager to manage the load, one really doesn't know if the new instance would be provisioned on the same host or on a different host. The decision depends on the capacity of the host, and maybe some other factors. This may also vary from one cloud provider to the other. So the general assumption of all participants using the same mesh listen port might not always hold true, even in a prod env.
 
or all possible ports are part of the initial peer list.

Ports may be assigned dynamically, and the number of instances might not be known upfront as the instances are continuously provisioned and teared down based on load. So specifying all possible ports in the initial peer list might not always be possible.

 
Three instances running in the same host on 3 different ports, where just one of them is provided in the initial peer list, doesn't fulfil that.
Hence B and C never initiate a connection between themselves and depend on A.

Summed up, this problem should never really occur in a valid setup across multiple nodes. I'm still not 100% sure how this technical limitation comes to be, but I don't think it affects any production deployments.

IMHO, this problem looks real to me, and may occur in any real prod env, with a higher probability when the setup is on the cloud.

Thanks,
Rahul.

 

On Tue, May 9, 2017 at 2:35 PM Julius Volz <juliu...@gmail.com> wrote:
On Tue, May 9, 2017 at 2:00 PM, Rahul Srivastava (र।हुल श्रीवास्तव) <norm...@gmail.com> wrote:
On Tue, May 9, 2017 at 4:12 PM, Julius Volz <juliu...@gmail.com> wrote:
See the help text description (./alertmanager -h) of the -mesh.peer flag: "initial peers (may be repeated)"

You will need to list that flag three times, once for each of the Alertmanagers. Otherwise indeed everything is only connected to the first one.

So far I was under the impression that when a new participant joins an existing mesh, it gets connected to every single member in the mesh, automatically. But seems like that is not the case. If every alertmanager should explicitly specify *all* the alertmanagers it should connect to in the mesh, then any new participant joining the mesh, should somehow figure out all the running alertmanagers and the mesh.listen-address for each of those alertmanagers so that it can specify multiple mesh.peer flags on startup ? This may get tricky when alertmanagers in a mesh are scaled dynamically based on load.

Actually it seems I was misinformed on this. I had assumed that the Mesh library we're using only ensures that gossip messages make it from one node to all other nodes as long as there is a statically configured path between the nodes (not even necessarily a full mesh). I thought that would be ok, as AM would normally be operated as a relatively static cluster.

However, I've just learned from Peter (author of the Weave Mesh lib) and Fabian that memberships should also be gossiped, as you expected. So the question is why this is not working here.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.

Julius Volz

unread,
May 9, 2017, 10:31:54 AM5/9/17
to Rahul Srivastava (र।हुल श्रीवास्तव), Fabian Reinartz, Prometheus Developers
Agreed.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

To post to this group, send email to prometheus-developers@googlegroups.com.

Fabian Reinartz

unread,
May 9, 2017, 10:38:21 AM5/9/17
to Julius Volz, Rahul Srivastava (र।हुल श्रीवास्तव), Prometheus Developers
Okay, issue is filed – thanks Silas. Let's hope this gets addressed.
Agreed.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages