Extremely short netsplit on three node setup

Wenceslas des Deserts

unread,

Apr 27, 2021, 10:20:57 AM4/27/21

to rabbitmq-users

Hi everybody,

I am seeing a strange behavior on a RabbitMQ cluster and couldn't find an explanation even after a somewhat thorough investigation.

This is a 3 nodes cluster running on GCP, all nodes in the same region, one per zone.

Each node has 8 vCPUs, 50 GB of RAM, and a 200 GB network attached SSD.

Cluster handles around 1k messages per second (publish + delivered). Queues are not currently replicated, and replication apparently worsened the issue.

Rabbit 3.8.7, Erlang 23.0.3.

Every few days, one node will think another one is down because of a timeout/net_ticktime error. The netsplit always resolves immediately, which makes me think this might not be a real network issue. It is annoying though because it triggers autoheal, and a lot of connections are terminated/recreated in the process.

Nothing unusual shows up in the monitoring tools I have (rabbit interface, Datadog, GCP investigation tools): no load spike, no networking issues for other components on the VMs (Consul agent and Datadog agent), no dropped packets. CPU and RAM usage are well within acceptable ranges. Connections operations are < 1/s. Queue operations are <10/s (lots of background workers creating temporary queues when they spin up).

The Datadog agent shows that the management API always becomes irresponsive a few seconds before Rabbit itself declares the other node as down.

The only unusual thing I observed in logs are somewhat frequent (~ every 30 minutes at least) busy_dist_port errors like this:

2021-04-27 10:30:05.412 [warning] <0.278.0> rabbit_sysmon_handler busy_dist_port <0.1166.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] <0.6829.5895>

Metric erlang_vm_dist_node_queue_size_bytes never displays high values (in fact it is almost always zero).

I already tried increasing the dsitribution buffer size (up to 1 GB), and increasing the net_ticktime (up to 180s). I also tried adding tracing in the cluster to understand what fills the buffer and if it could be related to the netsplit issue, but didn't really find a good place to observe that.

Could there be something somewhere freezing or delaying the process responsible for net_ticktime monitoring, which could explain what I see?

I'm really eager to understand what's going on here, and hopefully get the problem solved, so any idea you could have is welcome! :)

I included relevant logs and current configuration below.

Thank you very much!

Node2 logs

2021-04-13 02:17:36.550 [info] <0.17505.6147> rabbit on node '<Node1>' down

2021-04-13 02:17:36.550 [error] <0.27290.6398> ** Node '<Node1>' not responding **

** Removing (timedout) connection **

2021-04-13 02:17:36.581 [info] <0.17505.6147> Node <Node1> is down, deleting its listeners

2021-04-13 02:17:36.593 [warning] <0.13334.6151> Management delegate query returned errors:

[{<21830.1323.0>,{exit,{nodedown,'<Node1>'},[]}}]

2021-04-13 02:17:36.636 [error] <0.14815.5137> Error in process <0.14815.5137> on node '<Node2>' with exit value:

{{badmatch,{error,closed}},[{inet_tls_dist,accept_loop,4,[{file,"inet_tls_dist.erl"},{line,254}]}]}

2021-04-13 02:17:36.636 [error] emulator Error in process <0.14815.5137> on node '<Node2>' with exit value:

{{badmatch,{error,closed}},[{inet_tls_dist,accept_loop,4,[{file,"inet_tls_dist.erl"},{line,254}]}]}

2021-04-13 02:17:36.720 [error] <0.1908.6147> Mnesia('<Node2>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node1>'}

2021-04-13 02:17:39.995 [info] <0.17505.6147> node '<Node1>' down: net_tick_timeout

2021-04-13 02:17:39.995 [info] <0.17505.6147> node '<Node1>' up

2021-04-13 02:17:40.007 [info] <0.17505.6147> Autoheal request sent to '<Node1>'

2021-04-13 02:17:40.008 [error] <0.17505.6147> Partial partition detected:

* We saw DOWN from <Node1>

* We can still see <Node3> which can see <Node1>

We will therefore intentionally disconnect from <Node3>

2021-04-13 02:17:41.009 [info] <0.17505.6147> rabbit on node '<Node3>' down

2021-04-13 02:17:41.016 [info] <0.17505.6147> Node <Node3> is down, deleting its listeners

2021-04-13 02:17:41.320 [info] <0.17505.6147> node '<Node3>' down: disconnect

2021-04-13 02:17:41.366 [info] <0.17505.6147> node '<Node3>' up

2021-04-13 02:17:41.367 [error] <0.1908.6147> Mnesia('<Node2>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node3>'}

2021-04-13 02:17:41.372 [info] <0.17505.6147> Autoheal request sent to '<Node1>'

2021-04-13 02:17:41.543 [warning] <0.17505.6147> Autoheal: we were selected to restart; winner is '<Node3>'

2021-04-13 02:1

2021-04-13 02:17:36.576 [info] <0.547.0> rabbit on node '<Node2>' down

2021-04-13 02:17:36.589 [warning] <0.1722.0> Management delegate query returned errors:

[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]

2021-04-13 02:17:36.593 [info] <0.547.0> Node <Node2> is down, deleting its listeners

2021-04-13 02:17:36.648 [warning] <0.3333.0> Management delegate query returned errors:

[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]

2021-04-13 02:17:36.706 [error] <0.381.0> Mnesia('<Node1>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node2>'}

2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' down: {shutdown,transport_closed}

2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' up

2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request sent to '<Node1>'

2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request received from '<Node2>'

2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request denied: Remote nodes disconnected:

[{'<Node2>',['<Node3>']}]

2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request received from '<Node1>'

2021-04-13 02:17:41.545 [info] <0.547.0> Autoheal decision

* Partitions: [['<Node3>','<Node1>'],['<Node2>']]

* Winner: '<Node3>'

* Losers: ['<Node2>']

Node1 logs

2021-04-13 02:17:36.576 [info] <0.547.0> rabbit on node '<Node2>' down

2021-04-13 02:17:36.589 [warning] <0.1722.0> Management delegate query returned errors:

[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]

2021-04-13 02:17:36.593 [info] <0.547.0> Node <Node2> is down, deleting its listeners

2021-04-13 02:17:36.648 [warning] <0.3333.0> Management delegate query returned errors:

[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]

2021-04-13 02:17:36.706 [error] <0.381.0> Mnesia('<Node1>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node2>'}

2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' down: {shutdown,transport_closed}

2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' up

2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request sent to '<Node1>'

2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request received from '<Node2>'

2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request denied: Remote nodes disconnected:

[{'<Node2>',['<Node3>']}]

2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request received from '<Node1>'

2021-04-13 02:17:41.545 [info] <0.547.0> Autoheal decision

* Partitions: [['<Node3>',<Node1>'],['<Node2>']]

* Winner: '<Node1>'

* Losers: ['<Node2>']

[...]

RabbitMQ config

cluster_formation.consul.acl_token = <REDACTED>

cluster_formation.consul.host = 127.0.0.1

cluster_formation.consul.port = 8500

cluster_formation.consul.scheme = http

cluster_formation.consul.svc = rabbitmq

cluster_formation.consul.svc_addr = <REDACTED>

cluster_formation.consul.svc_addr_auto = false

cluster_formation.consul.svc_port = 5673

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul

cluster_partition_handling = autoheal

listeners.ssl.default = 5673

listeners.tcp = none

log.dir = /var/log/rabbitmq

log.file = rabbit.log

net_ticktime = 180

ssl_options.certfile = <REDACTED>

ssl_options.keyfile = <REDACTED>

RabbitMQ Env config

CTL_ERL_ARGS="-pa $ERL_SSL_PATH -proto_dist inet_tls -ssl_dist_opt server_certfile <REDACTED> server_keyfile <REDACTED> -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true"

ERL_SSL_PATH="/usr/lib/erlang/lib/ssl-9.5.1/ebin"

RABBITMQ_DISTRIBUTION_BUFFER_SIZE=1024000

RABBITMQ_NODENAME="<Node2>"

RABBITMQ_USE_LONGNAME=true

SERVER_ADDITIONAL_ERL_ARGS="-pa $ERL_SSL_PATH -proto_dist inet_tls -ssl_dist_opt server_certfile <REDACTED> server_keyfile <REDACTED> -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true"

Enabled plugins

rabbitmq_federation,rabbitmq_federation_management,rabbitmq_management,rabbitmq_peer_discovery_consul,rabbitmq_prometheus,rabbitmq_shovel,rabbitmq_shovel_management

(note: federation is not used anymore in this cluster)

Wenceslas des Deserts

unread,

May 5, 2021, 4:24:54 AM5/5/21

to rabbitmq-users

Hi,

No ideas? :) I have absolutely no clue of what could possibly be going on here...

Regards,

Wenceslas des Déserts

Michal Kuratczyk

unread,

May 5, 2021, 5:19:11 AM5/5/21

to rabbitm...@googlegroups.com

Hi,

Seems like you are using Management API for monitoring. I've never seen this causing an issue like described but with a large number of objects (queues, connections, channels), returning metrics through the API is pretty expensive. That would match your observation of the Management seemingly being down just before the issue. I don't know how often you call the API to get the metrics and how often the issue occurs but perhaps you can try querying it less frequently and see whether the problem still occurs (or at least less often)? Ultimately you should migrate to the prometheus endpoint anyway, which is much faster and doesn't cause inter-node traffic at all (the management API does because in one call, you get data from all of the nodes so it needs to request all details from the other nodes).

You can also use Observer and try to see what's at the top, especially when the problem occurs: https://www.rabbitmq.com/monitoring.html#diagnostics-observer.

Relevant video:

https://www.youtube.com/watch?v=NWISW6AwpOE

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/850b5688-d090-4d23-b432-12ada9e568e8n%40googlegroups.com.

--

Michał

Wenceslas des Deserts

unread,

May 6, 2021, 4:15:22 PM5/6/21

to rabbitmq-users

Hi,

Thanks for taking the time to answer! This is an interesting lead, I am going to try changing how often the Management API is called (or even directly switching to the Prometheus endpoint entirely). Didn't know about Observer, I will make sure to test it! I will let you know if it helps.

Thanks again!

Joachim Schuler

unread,

Jun 10, 2024, 4:00:05 AMJun 10

to rabbitmq-users

Hello Wenceslas,

we have recently been facing a similar observation in our RabbitMQ environment. Although our infrastructure is quite different (we are on virtual machines on-premise, not on GCP), I was wondering if you found something like a root cause, or at least some configuration that makes your cluster more resilient against such extremely short hiccups.

Any hint would be greatly appreciated!

Best regards,

- Joachim

jo...@cloudamqp.com

unread,

Jun 11, 2024, 12:17:01 PMJun 11

to rabbitmq-users

Hi,

Note that this thread is from 2021, and Erlang 23.x had some issues with netsplits that are not seen on later versions. I suggest you start a new thread with your exact details and we might be able to help from there.

/Johan

Wenceslas des Deserts

unread,

Jun 11, 2024, 4:01:53 PMJun 11

to rabbitmq-users

Hi everyone,

Wow, it has been a long time. We investigated a lot, including putting breakpoints in our production RabbitMQ instance, but never managed to find the root cause. We ended up replacing RabbitMQ entirely as it was a nightmare for us.

Very interesting to learn that Erlang 23.x had some netsplits issues, if you have some references about this, I'd like to see them! :D

Sorry I cannot help more,

Wenceslas.

Joachim Schuler

unread,

Jun 13, 2024, 12:05:06 PMJun 13

to rabbitmq-users

Hello @both and thanks for your feedback!

Now in our case, we are on Erlang 25.3.2.2, so at least we can rule that out.

Status today is that we have stabilized the situation after moving the underlying VM to a different hypervisor.

So if things stay calm from now on, we would probably leave as is, and consider the (virtual) infrastructure as the root cause.