Extremely short netsplit on three node setup

326 views
Skip to first unread message

Wenceslas des Deserts

unread,
Apr 27, 2021, 10:20:57 AM4/27/21
to rabbitmq-users
Hi everybody,

I am seeing a strange behavior on a RabbitMQ cluster and couldn't find an explanation even after a somewhat thorough investigation.

This is a 3 nodes cluster running on GCP, all nodes in the same region, one per zone.
Each node has 8 vCPUs, 50 GB of RAM, and a 200 GB network attached SSD.
Cluster handles around 1k messages per second (publish + delivered). Queues are not currently replicated, and replication apparently worsened the issue.
Rabbit 3.8.7, Erlang 23.0.3. 

Every few days, one node will think another one is down because of a timeout/net_ticktime error. The netsplit always resolves immediately, which makes me think this might not be a real network issue. It is annoying though because it triggers autoheal, and a lot of connections are terminated/recreated in the process.

Nothing unusual shows up in the monitoring tools I have (rabbit interface, Datadog, GCP investigation tools): no load spike, no networking issues for other components on the VMs (Consul agent and Datadog agent), no dropped packets. CPU and RAM usage are well within acceptable ranges. Connections operations are < 1/s. Queue operations are <10/s (lots of background workers creating temporary queues when they spin up).

The Datadog agent shows that the management API always becomes irresponsive a few seconds before Rabbit itself declares the other node as down.

The only unusual thing I observed in logs are somewhat frequent (~ every 30 minutes at least) busy_dist_port errors like this:

2021-04-27 10:30:05.412 [warning] <0.278.0> rabbit_sysmon_handler busy_dist_port <0.1166.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] <0.6829.5895>

Metric erlang_vm_dist_node_queue_size_bytes never displays high values (in fact it is almost always zero).

I already tried increasing the dsitribution buffer size (up to 1 GB), and increasing the net_ticktime (up to 180s). I also tried adding tracing in the cluster to understand what fills the buffer and if it could be related to the netsplit issue, but didn't really find a good place to observe that.

Could there be something somewhere freezing or delaying the process responsible for net_ticktime monitoring, which could explain what I see?
I'm really eager to understand what's going on here, and hopefully get the problem solved, so any idea you could have is welcome! :)

I included relevant logs and current configuration below.

Thank you very much!
 
Node2 logs

2021-04-13 02:17:36.550 [info] <0.17505.6147> rabbit on node '<Node1>' down
2021-04-13 02:17:36.550 [error] <0.27290.6398> ** Node '<Node1>' not responding **
** Removing (timedout) connection **
2021-04-13 02:17:36.581 [info] <0.17505.6147> Node <Node1> is down, deleting its listeners
2021-04-13 02:17:36.593 [warning] <0.13334.6151> Management delegate query returned errors:
[{<21830.1323.0>,{exit,{nodedown,'<Node1>'},[]}}]
2021-04-13 02:17:36.636 [error] <0.14815.5137> Error in process <0.14815.5137> on node '<Node2>' with exit value:
{{badmatch,{error,closed}},[{inet_tls_dist,accept_loop,4,[{file,"inet_tls_dist.erl"},{line,254}]}]}
2021-04-13 02:17:36.636 [error] emulator Error in process <0.14815.5137> on node '<Node2>' with exit value:
{{badmatch,{error,closed}},[{inet_tls_dist,accept_loop,4,[{file,"inet_tls_dist.erl"},{line,254}]}]}
2021-04-13 02:17:36.720 [error] <0.1908.6147> Mnesia('<Node2>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node1>'}
2021-04-13 02:17:39.995 [info] <0.17505.6147> node '<Node1>' down: net_tick_timeout
2021-04-13 02:17:39.995 [info] <0.17505.6147> node '<Node1>' up
2021-04-13 02:17:40.007 [info] <0.17505.6147> Autoheal request sent to '<Node1>'
2021-04-13 02:17:40.008 [error] <0.17505.6147> Partial partition detected:
 * We saw DOWN from <Node1>
 * We can still see <Node3> which can see <Node1>
We will therefore intentionally disconnect from <Node3>
2021-04-13 02:17:41.009 [info] <0.17505.6147> rabbit on node '<Node3>' down
2021-04-13 02:17:41.016 [info] <0.17505.6147> Node <Node3> is down, deleting its listeners
2021-04-13 02:17:41.320 [info] <0.17505.6147> node '<Node3>' down: disconnect
2021-04-13 02:17:41.366 [info] <0.17505.6147> node '<Node3>' up
2021-04-13 02:17:41.367 [error] <0.1908.6147> Mnesia('<Node2>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node3>'}
2021-04-13 02:17:41.372 [info] <0.17505.6147> Autoheal request sent to '<Node1>'
2021-04-13 02:17:41.543 [warning] <0.17505.6147> Autoheal: we were selected to restart; winner is '<Node3>'
2021-04-13 02:1
2021-04-13 02:17:36.576 [info] <0.547.0> rabbit on node '<Node2>' down
2021-04-13 02:17:36.589 [warning] <0.1722.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.593 [info] <0.547.0> Node <Node2> is down, deleting its listeners
2021-04-13 02:17:36.648 [warning] <0.3333.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.706 [error] <0.381.0> Mnesia('<Node1>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node2>'}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' down: {shutdown,transport_closed}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' up
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request sent to '<Node1>'
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request received from '<Node2>'
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request denied: Remote nodes disconnected:
 [{'<Node2>',['<Node3>']}]
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request received from '<Node1>'
2021-04-13 02:17:41.545 [info] <0.547.0> Autoheal decision
  * Partitions: [['<Node3>','<Node1>'],['<Node2>']]
  * Winner:     '<Node3>'
  * Losers:     ['<Node2>']

Node1 logs

2021-04-13 02:17:36.576 [info] <0.547.0> rabbit on node '<Node2>' down
2021-04-13 02:17:36.589 [warning] <0.1722.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.593 [info] <0.547.0> Node <Node2> is down, deleting its listeners
2021-04-13 02:17:36.648 [warning] <0.3333.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.706 [error] <0.381.0> Mnesia('<Node1>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node2>'}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' down: {shutdown,transport_closed}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' up
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request sent to '<Node1>'
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request received from '<Node2>'
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request denied: Remote nodes disconnected:
[{'<Node2>',['<Node3>']}]
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request received from '<Node1>'
2021-04-13 02:17:41.545 [info] <0.547.0> Autoheal decision
 * Partitions: [['<Node3>',<Node1>'],['<Node2>']]
 * Winner: '<Node1>'
 * Losers: ['<Node2>']
[...]

RabbitMQ config

cluster_formation.consul.acl_token = <REDACTED>
cluster_formation.consul.host = 127.0.0.1
cluster_formation.consul.port = 8500
cluster_formation.consul.scheme = http
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.svc_addr = <REDACTED>
cluster_formation.consul.svc_addr_auto = false
cluster_formation.consul.svc_port = 5673
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_partition_handling = autoheal
listeners.ssl.default = 5673
listeners.tcp = none
log.dir = /var/log/rabbitmq
log.file = rabbit.log
net_ticktime = 180
ssl_options.certfile = <REDACTED>
ssl_options.keyfile = <REDACTED>

RabbitMQ Env config

CTL_ERL_ARGS="-pa $ERL_SSL_PATH -proto_dist inet_tls -ssl_dist_opt server_certfile <REDACTED> server_keyfile <REDACTED> -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true"
ERL_SSL_PATH="/usr/lib/erlang/lib/ssl-9.5.1/ebin"
RABBITMQ_DISTRIBUTION_BUFFER_SIZE=1024000
RABBITMQ_NODENAME="<Node2>"
RABBITMQ_USE_LONGNAME=true
SERVER_ADDITIONAL_ERL_ARGS="-pa $ERL_SSL_PATH -proto_dist inet_tls -ssl_dist_opt server_certfile <REDACTED> server_keyfile <REDACTED> -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true"

Enabled plugins

rabbitmq_federation,rabbitmq_federation_management,rabbitmq_management,rabbitmq_peer_discovery_consul,rabbitmq_prometheus,rabbitmq_shovel,rabbitmq_shovel_management

(note: federation is not used anymore in this cluster)

Wenceslas des Deserts

unread,
May 5, 2021, 4:24:54 AM5/5/21
to rabbitmq-users
Hi,

No ideas? :) I have absolutely no clue of what could possibly be going on here...

Regards,
Wenceslas des Déserts

Michal Kuratczyk

unread,
May 5, 2021, 5:19:11 AM5/5/21
to rabbitm...@googlegroups.com
Hi,

Seems like you are using Management API for monitoring. I've never seen this causing an issue like described but with a large number of objects (queues, connections, channels), returning metrics through the API is pretty expensive. That would match your observation of the Management seemingly being down just before the issue. I don't know how often you call the API to get the metrics and how often the issue occurs but perhaps you can try querying it less frequently and see whether the problem still occurs (or at least less often)? Ultimately you should migrate to the prometheus endpoint anyway, which is much faster and doesn't cause inter-node traffic at all (the management API does because in one call, you get data from all of the nodes so it needs to request all details from the other nodes).

You can also use Observer and try to see what's at the top, especially when the problem occurs: https://www.rabbitmq.com/monitoring.html#diagnostics-observer

Relevant video:

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/850b5688-d090-4d23-b432-12ada9e568e8n%40googlegroups.com.


--
Michał

Wenceslas des Deserts

unread,
May 6, 2021, 4:15:22 PM5/6/21
to rabbitmq-users
Hi,

Thanks for taking the time to answer! This is an interesting lead, I am going to try changing how often the Management API is called (or even directly switching to the Prometheus endpoint entirely). Didn't know about Observer, I will make sure to test it! I will let you know if it helps.

Thanks again!

Joachim Schuler

unread,
Jun 10, 2024, 4:00:05 AMJun 10
to rabbitmq-users
Hello Wenceslas,

we have recently been facing a similar observation in our RabbitMQ environment. Although our infrastructure is quite different (we are on virtual machines on-premise, not on GCP), I was wondering if you found something like a root cause, or at least some configuration that makes your cluster more resilient against such extremely short hiccups.

Any hint would be greatly appreciated!

Best regards,
- Joachim

jo...@cloudamqp.com

unread,
Jun 11, 2024, 12:17:01 PMJun 11
to rabbitmq-users
Hi,
Note that this thread is from 2021, and Erlang 23.x had some issues with netsplits that are not seen on later versions. I suggest you start a new thread with your exact details and we might be able to help from there.

/Johan

Wenceslas des Deserts

unread,
Jun 11, 2024, 4:01:53 PMJun 11
to rabbitmq-users
Hi everyone,

Wow, it has been a long time. We investigated a lot, including putting breakpoints in our production RabbitMQ instance, but never managed to find the root cause. We ended up replacing RabbitMQ entirely as it was a nightmare for us.

Very interesting to learn that Erlang 23.x had some netsplits issues, if you have some references about this, I'd like to see them! :D

Sorry I cannot help more,
Wenceslas.

Joachim Schuler

unread,
Jun 13, 2024, 12:05:06 PM (13 days ago) Jun 13
to rabbitmq-users
Hello @both and thanks for your feedback!
Now in our case, we are on Erlang 25.3.2.2, so at least we can rule that out.
Status today is that we have stabilized the situation after moving the underlying VM to a different hypervisor.
So if things stay calm from now on, we would probably leave as is, and consider the (virtual) infrastructure as the root cause.
Thanks again & best regards,
- Joachim
Reply all
Reply to author
Forward
0 new messages