RabbitMQ cluster on AWS ECS Fargate proof of concept, review requested

2,039 views
Skip to first unread message

Rasmus Larsson

unread,
May 7, 2020, 4:03:15 AM5/7/20
to rabbitmq-users
Hi,

I was encouraged to write here to persist this knowledge for future generations (ty Luke for the help on slack). :)

We are currently doing a proof of concept (PoC) setup where we use AWS ECS and Fargate for a cluster. As of Fargate platform version 1.4.0 it supports using AWS EFS as persistent storage. As far as we can tell EFS is accessed from a Fargate task using NFS. So far we've done everything manually in the AWS Console but will move over to CloudFormation soon.

Our PoC setup is at time of writing as follows:

- 2 RMQ nodes (it's a PoC, we will expand to 3 in the final configuration)
- 2 ECS services named "rabbitmq-node1" and "rabbitmq-node2"
- 2 ECS tasks named "rabbitmq-node1" and "rabbitmq-node2"
- 2 EFS volumes named as above
- AWS Cloud Map for service discovery

Basic concepts:

- a service launches N instances of a task
- a task mounts zero or more EFS volumes
- a task consists of one or more containers
- a service maps to one logical domain name in Cloud Map/Service discovery

Constraints:

- nodename should be "known", i.e. must persist over restarts since the RMQ datastore files depend on nodename
- a task cannot expose several containers that listen on the same port
- Fargate requires using the `awsvpc` network mode, which basically means instances get dynamic IPs and hostnames on each restart and there's not much you can do about it, except AWS service discovery

Which leads to:

- the same task can't launch multiple instances and at the same time have known nodenames
- we cannot have two RMQ containers in one task
- we need one service and task per node in the cluster

Environment variables:

| Name                  | Value                                   |
| --------------------- | --------------------------------------- |
| RABBITMQ_NODENAME     | rabbit@rabbitmq-node{N}.example.private |
| RABBITMQ_USE_LONGNAME | true                                    |

Configuration:

| Config                                   | Value                                 |
| ---------------------------------------- | ------------------------------------- |
| cluster_formation.peer_discovery_backend | rabbit_peer_discovery_classic_config  |
| cluster_formation.classic_config.nodes.1 | rab...@rabbitmq-node1.example.private |
| cluster_formation.classic_config.nodes.2 | rab...@rabbitmq-node2.example.private |

The resolution of hostnames has been one of the key issues for us. Each time a task instance is spun up it is dyanamically allocated IP address and hostname. Thankfully service discovery does allow us to assign a DNS A record to the instance once it has spun up. But there is a race condition between service discovery and the node trying to resolve its own name:

`ERROR: epmd error for host rabbitmq-node2.example.private: nxdomain (non-existing domain)`

In [this writeup](https://medium.com/thron-tech/clustering-rabbitmq-on-ecs-using-ec2-autoscaling-groups-107426a87b98) Andrea/THRON overcame the problem by adding an entry in `/etc/hosts`. We were unfortunately unable to do so, in our experiments we tried doing this as an `echo "rabbitmq-node1... localhost" >> /etc/hosts` in the docker CMD but were denied with an error stating it's a read only filesystem. Doing this as part of the docker ENTRYPOINT failed with the same error. So we worked around it by doing a `sleep 60s` inside our container before spinning up rabbitmq. 60s coincides with the refresh rate we've set for the DNS A record in Cloud Map/Service Discovery.

Current issues:

- the 60s sleep feels very hacky and is bad for failover, it would be nice to get THRON's solution working
- restarts from EFS are extremely slow, over 5 minutes, we have no idea why, but I'm starting to suspect Mnesia, example logs here: https://gist.github.com/stoft/e2bf69793563f36af5500db6234317ad
- we have yet to solve providing a single DNS name for our clients but we haven't looked into it yet

Other thoughts and comments:

- we haven't tried the AWS specific peering even though we are loading the plugin, we're unsure whether it would work on Fargate instead of EC2 and since classic peering works we may just stick with that

I will most likely be converting this into a blog post on dev.to but I'm guessing there will be a few more iterations on it first.

Any suggestions for improvement or solutions to our problems are warmly welcome.

Best regards,
Rasmus

Luke Bakken

unread,
May 7, 2020, 9:17:53 PM5/7/20
to rabbitmq-users
Hi Rasmus,

I suspect you can fix DNS resolution for Erlang by using the inetrc feature: https://erlang.org/doc/apps/erts/inet_cfg.html

You can set the ERL_INETRC environment variable to point to it, and add host entries to the file.

Let us know if you have questions about that!
Luke

Rasmus Larsson

unread,
May 12, 2020, 5:22:06 AM5/12/20
to rabbitmq-users
Hi Luke,

Thank you for the tip. We've now solved it with this little config which we stick in a file that ERL_INETRC points to:

```
{file, hosts, "/tmp/hosts"}.
{lookup, [native, file]}.
```

and `echo "127.0.0.1    ${RABBITMQ_NODENAME#*@}" > /tmp/hosts` on startup in a custom init.sh that then launches rabbit.

/Rasmus

Rasmus Larsson

unread,
May 12, 2020, 5:38:21 AM5/12/20
to rabbitmq-users
We've now also solved the extremely slow startups, more details in this thread: https://groups.google.com/forum/#!topic/rabbitmq-users/ojq9qwmwLCo

Rasmus Larsson

unread,
May 26, 2020, 5:17:55 AM5/26/20
to rabbitmq-users
Short update, we now have this configured using CloudFormation templates. AWS does not yet support mounting EFS on ECS/Fargate in CloudFormation so that is a manual step we'll have to do at deploy time.

We are still using classic peer discovery. I did some experiments with DNS peer discovery but it fails since RabbitMQ wants to do a reverse lookup from IP to node and Service Discovery does not add the necessary DNS PTR records to Route53, which means that the node names resolve to ipx-x-x-x.compute.internal instead of the service discovery name we're using. I'm guessing this could be worked around using a custom init script that does a ´dig` command or similar, parses the result and writes custom records to /etc/hosts or erl_inetrc, but in our case it's not worth the effort so I tabled it for now.

Michael Klishin

unread,
May 26, 2020, 11:50:44 AM5/26/20
to rabbitm...@googlegroups.com

Not objecting to your idea to use DNS but would it be easier to use Consul or etcd for discovery?

Or would such additional dependency be not worth the effort?

 

On 26.05.2020, 12:18, rabbitm...@googlegroups.com on behalf of Rasmus Larsson wrote:

 

Short update, we now have this configured using CloudFormation templates. AWS does not yet support mounting EFS on ECS/Fargate in CloudFormation so that is a manual step we'll have to do at deploy time.

 

We are still using classic peer discovery. I did some experiments with DNS peer discovery but it fails since RabbitMQ wants to do a reverse lookup from IP to node and Service Discovery does not add the necessary DNS PTR records to Route53, which means that the node names resolve to ipx-x-x-x.compute.internal instead of the service discovery name we're using. I'm guessing this could be worked around using a custom init script that does a ´dig` command or similar, parses the result and writes custom records to /etc/hosts or erl_inetrc, but in our case it's not worth the effort so I tabled it for now.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/6f2f54e7-5560-4eb3-9105-3bd2f0c19447%40googlegroups.com.

Rasmus Larsson

unread,
May 27, 2020, 5:03:10 AM5/27/20
to rabbitmq-users
Since classic is good enough for us we haven't looked into it. Our workload is quite small, today we're running on a single instance without problems, we mostly want a cluster for availability and seamless upgrade. I like the idea of having a dynamic cluster instead of a cluster with fixed number of nodes, that's why I peeked a bit at DNS peering, but it's not super important for us.

Avoiding an extra dependency is nice so it's probably not something we would spend time on. 

Rasmus Larsson

unread,
Oct 14, 2020, 2:36:51 AM10/14/20
to rabbitmq-users
A short update. CloudFormation now supports EFS, we've updated our templates accordingly.

Sander Wilrycx

unread,
Oct 21, 2022, 3:02:05 AM10/21/22
to rabbitmq-users
Hi Rasmus,
We're currently also trying to run RabbitMQ on Fargate. We seem to hit a block when trying to configure amqps over our network loadbalancer. (management ui, webstomp and prometheus are all configured on an application loadbalancer and work just fine). When trying to connect to amqps we get the message stream_socket_client() unable to connect to ssl://dns-before-nl.com:5671 . We tried tls offloading on the nlb, we got ssl successful tls handshakes using openssl. Yet when trying to connect we keep getting the same error. Could you be so kind to share you insights / cloudformation templates.. on your fargate setup?

Op woensdag 14 oktober 2020 om 08:36:51 UTC+2 schreef rasmus....@gmail.com:

David Whynot

unread,
Jul 29, 2023, 5:09:55 PM7/29/23
to rabbitmq-users
I'm also interested in running RabbitMQ on Fargate and wondered if you were able to share more details about your cloudformation templates. I'm specifically wondering how things are configured to work when deploying new task definitions. How is data persisted in a blue green deployment or are you using some kind of rolling update for task def changes?
Reply all
Reply to author
Forward
0 new messages