Thanks for the response.
Firstly, when I said 2 replicas I meant one master and two replicas, so 3 in total. It shows +2 in the web ui next to the queue name.
1. Fargate nodes do not disappear often, I have been running rabbitmq without one disappearing on Fargate since October of last year. But it can happen and I'd rather the cluster self healed.
2. Fargate nodes will have a new hostname and thus FQDN which becomes the rabbitmq nodename. I guess I could use EFS (Amazon's NFS) to mount the same volume back to a restarted node, but that means running 3 Fargate services, so as to set the 3
RABBITMQ_NODENAME in the ENV which can only be different in a different service. This might turn out to be the best solution.
3. Yes the post_start.sh would need to: find the old dead node by piping rabbitmqctl cluster_status into awk, grep a couple of times and diffing them, run rabbitmqctl forget_cluster_node for the dead node, run rabbitmq-queues grow all on the new node and then rebalance.
4. There's no equivalent exec command for Fargate (that I've found yet). I could install an ssh daemon on the container and then shell in, although I'd rather not expose more ports than necessary. But I was more wondering about automated commands which run after the rabbitmq server is up, so the growing of the queues and the cleaning up of dead nodes happens as part of the startup sequence, clearly after the rabbitmq node is up. A post_start.sh I don't think I can do that without modifying your docker image significantly and then needing to maintain my mutant version, which I'd rather avoid.
I have been considering abusing the Fargate (Docker) healthcheck to do this. It doesn't seem to harm the cluster to grow queues onto a node which already has them, but the cleaning up of dead nodes seems more tricky to ensure we don't accidentally kill all the nodes when unexpected output occurs from the part of the script which is working out which nodes to forget.
I have run ElasticSearch in a similar configuration as to the suggestion in 2. (3 services running 1 node each with EFS permanent storage). I think this is likely to prove the least kludgey.
Thank-you very much for your advice.
--
M