Hi everybody,
I am seeing a strange behavior on a RabbitMQ cluster and couldn't find an explanation even after a somewhat thorough investigation.
This is a 3 nodes cluster running on GCP, all nodes in the same region, one per zone.
Each node has 8 vCPUs, 50 GB of RAM, and a 200 GB network attached SSD.
Cluster handles around 1k messages per second (publish + delivered). Queues are not currently replicated, and replication apparently worsened the issue.
Rabbit 3.8.7, Erlang 23.0.3.
Every few days, one node will think another one is down because of a timeout/net_ticktime error. The netsplit always resolves immediately, which makes me think this might not be a real network issue. It is annoying though because it triggers autoheal, and a lot of connections are terminated/recreated in the process.
Nothing unusual shows up in the monitoring tools I have (rabbit interface, Datadog, GCP investigation tools): no load spike, no networking issues for other components on the VMs (Consul agent and Datadog agent), no dropped packets. CPU and RAM usage are well within acceptable ranges. Connections operations are < 1/s. Queue operations are <10/s (lots of background workers creating temporary queues when they spin up).
The Datadog agent shows that the management API always becomes irresponsive a few seconds before Rabbit itself declares the other node as down.
The only unusual thing I observed in logs are somewhat frequent (~ every 30 minutes at least) busy_dist_port errors like this:
2021-04-27 10:30:05.412 [warning] <0.278.0> rabbit_sysmon_handler busy_dist_port <0.1166.0> [{name,delegate_management_0},{initial_call,{delegate,init,1}},{gen_server2,process_next_msg,1},{message_queue_len,0}] <0.6829.5895>
Metric erlang_vm_dist_node_queue_size_bytes never displays high values (in fact it is almost always zero).
I already tried increasing the dsitribution buffer size (up to 1 GB), and increasing the net_ticktime (up to 180s). I also tried adding tracing in the cluster to understand what fills the buffer and if it could be related to the netsplit issue, but didn't really find a good place to observe that.
Could there be something somewhere freezing or delaying the process responsible for net_ticktime monitoring, which could explain what I see?
I'm really eager to understand what's going on here, and hopefully get the problem solved, so any idea you could have is welcome! :)
I included relevant logs and current configuration below.
Thank you very much!
Node2 logs
2021-04-13 02:17:36.550 [info] <0.17505.6147> rabbit on node '<Node1>' down
2021-04-13 02:17:36.550 [error] <0.27290.6398> ** Node '<Node1>' not responding **
** Removing (timedout) connection **
2021-04-13 02:17:36.581 [info] <0.17505.6147> Node <Node1> is down, deleting its listeners
2021-04-13 02:17:36.593 [warning] <0.13334.6151> Management delegate query returned errors:
[{<21830.1323.0>,{exit,{nodedown,'<Node1>'},[]}}]
2021-04-13 02:17:36.636 [error] <0.14815.5137> Error in process <0.14815.5137> on node '<Node2>' with exit value:
{{badmatch,{error,closed}},[{inet_tls_dist,accept_loop,4,[{file,"inet_tls_dist.erl"},{line,254}]}]}
2021-04-13 02:17:36.636 [error] emulator Error in process <0.14815.5137> on node '<Node2>' with exit value:
{{badmatch,{error,closed}},[{inet_tls_dist,accept_loop,4,[{file,"inet_tls_dist.erl"},{line,254}]}]}
2021-04-13 02:17:36.720 [error] <0.1908.6147> Mnesia('<Node2>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node1>'}
2021-04-13 02:17:39.995 [info] <0.17505.6147> node '<Node1>' down: net_tick_timeout
2021-04-13 02:17:39.995 [info] <0.17505.6147> node '<Node1>' up
2021-04-13 02:17:40.007 [info] <0.17505.6147> Autoheal request sent to '<Node1>'
2021-04-13 02:17:40.008 [error] <0.17505.6147> Partial partition detected:
* We saw DOWN from <Node1>
* We can still see <Node3> which can see <Node1>
We will therefore intentionally disconnect from <Node3>
2021-04-13 02:17:41.009 [info] <0.17505.6147> rabbit on node '<Node3>' down
2021-04-13 02:17:41.016 [info] <0.17505.6147> Node <Node3> is down, deleting its listeners
2021-04-13 02:17:41.320 [info] <0.17505.6147> node '<Node3>' down: disconnect
2021-04-13 02:17:41.366 [info] <0.17505.6147> node '<Node3>' up
2021-04-13 02:17:41.367 [error] <0.1908.6147> Mnesia('<Node2>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node3>'}
2021-04-13 02:17:41.372 [info] <0.17505.6147> Autoheal request sent to '<Node1>'
2021-04-13 02:17:41.543 [warning] <0.17505.6147> Autoheal: we were selected to restart; winner is '<Node3>'
2021-04-13 02:1
2021-04-13 02:17:36.576 [info] <0.547.0> rabbit on node '<Node2>' down
2021-04-13 02:17:36.589 [warning] <0.1722.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.593 [info] <0.547.0> Node <Node2> is down, deleting its listeners
2021-04-13 02:17:36.648 [warning] <0.3333.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.706 [error] <0.381.0> Mnesia('<Node1>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node2>'}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' down: {shutdown,transport_closed}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' up
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request sent to '<Node1>'
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request received from '<Node2>'
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request denied: Remote nodes disconnected:
[{'<Node2>',['<Node3>']}]
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request received from '<Node1>'
2021-04-13 02:17:41.545 [info] <0.547.0> Autoheal decision
* Partitions: [['<Node3>','<Node1>'],['<Node2>']]
* Winner: '<Node3>'
* Losers: ['<Node2>']
Node1 logs
2021-04-13 02:17:36.576 [info] <0.547.0> rabbit on node '<Node2>' down
2021-04-13 02:17:36.589 [warning] <0.1722.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.593 [info] <0.547.0> Node <Node2> is down, deleting its listeners
2021-04-13 02:17:36.648 [warning] <0.3333.0> Management delegate query returned errors:
[{<24932.5977.5834>,{exit,{nodedown,'<Node2>'},[]}}]
2021-04-13 02:17:36.706 [error] <0.381.0> Mnesia('<Node1>'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, '<Node2>'}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' down: {shutdown,transport_closed}
2021-04-13 02:17:40.554 [info] <0.547.0> node '<Node2>' up
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request sent to '<Node1>'
2021-04-13 02:17:40.680 [info] <0.547.0> Autoheal request received from '<Node2>'
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request denied: Remote nodes disconnected:
[{'<Node2>',['<Node3>']}]
2021-04-13 02:17:41.540 [info] <0.547.0> Autoheal request received from '<Node1>'
2021-04-13 02:17:41.545 [info] <0.547.0> Autoheal decision
* Partitions: [['<Node3>',<Node1>'],['<Node2>']]
* Winner: '<Node1>'
* Losers: ['<Node2>']
[...]
RabbitMQ config
cluster_formation.consul.acl_token = <REDACTED>
cluster_formation.consul.host = 127.0.0.1
cluster_formation.consul.port = 8500
cluster_formation.consul.scheme = http
cluster_formation.consul.svc = rabbitmq
cluster_formation.consul.svc_addr = <REDACTED>
cluster_formation.consul.svc_addr_auto = false
cluster_formation.consul.svc_port = 5673
cluster_formation.peer_discovery_backend = rabbit_peer_discovery_consul
cluster_partition_handling = autoheal
listeners.ssl.default = 5673
listeners.tcp = none
log.dir = /var/log/rabbitmq
log.file = rabbit.log
net_ticktime = 180
ssl_options.certfile = <REDACTED>
ssl_options.keyfile = <REDACTED>
RabbitMQ Env config
CTL_ERL_ARGS="-pa $ERL_SSL_PATH -proto_dist inet_tls -ssl_dist_opt server_certfile <REDACTED> server_keyfile <REDACTED> -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true"
ERL_SSL_PATH="/usr/lib/erlang/lib/ssl-9.5.1/ebin"
RABBITMQ_DISTRIBUTION_BUFFER_SIZE=1024000
RABBITMQ_NODENAME="<Node2>"
RABBITMQ_USE_LONGNAME=true
SERVER_ADDITIONAL_ERL_ARGS="-pa $ERL_SSL_PATH -proto_dist inet_tls -ssl_dist_opt server_certfile <REDACTED> server_keyfile <REDACTED> -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true"
Enabled plugins
rabbitmq_federation,rabbitmq_federation_management,rabbitmq_management,rabbitmq_peer_discovery_consul,rabbitmq_prometheus,rabbitmq_shovel,rabbitmq_shovel_management
(note: federation is not used anymore in this cluster)