RabbitMQ cluster instability, failure to autoheal on basic timeout situation

758 views
Skip to first unread message

Pier Castonguay

unread,
Feb 3, 2020, 4:01:36 PM2/3/20
to rabbitmq-users
We are evaluating RabbitMQ cluster mode on several hosts to improve stability of our services, and up to now it seems to add much more instability and more problems than helping.

A bit of context:
Running a 3 nodes cluster on a local network on RabbitMQ 3.7.14, Erlang 22 and Windows Server 2016

Cluster configuration is as follow:

cluster_formation.peer_discovery_backend = rabbit_peer_discovery_classic_config

cluster_formation.classic_config.nodes.1 = rabbit@Developpement
cluster_formation.classic_config.nodes.2 = rabbit@server2
cluster_formation.classic_config.nodes.3 = rabbit@server3

cluster_partition_handling = autoheal

HA Policy is as follow:

Name: HA
Pattern: .*
Apply to: Exchanges and queues

ha-mode: exactly
ha-params: 2
ha-sync-mode: automatic
max-length-bytes: 50000000


Each node have 13 clients connected to the RabbitMQ service on the same host. They are Windows Services using Particular.NServiceBus library in its turn using the C# RabbitMQ.Client nuget package.

On normal usage, it seems to work. Management console shows all services, queues get synchronized and messages get handled. Stopping RabbitMQ on one node and restarting it will restart successfully.
One strange behavior I noticed is that if my clients are actively trying to connect, it will takes around 3-4 minutes for RabbitMQ service to restart, but if clients are stopped the RabbitMQ service restart in a few seconds.


Now for the problem. During the weekend in the middle of the night it seems like the network had a small hiccup (nothing severe), but RabbitMQ never managed to get back on its feet.
This is the kind of situation we created the cluster for, but instead of helping it just crashed the whole thing (which wouldn't have happened in a non-cluster mode).

I tried to analyze the hundreds of megabytes of logs, but as an user I can't say I understand everything about the inner working of RabbitMQ and would like help from some expert to explain why it happened and why it didn't recover.
From my comprehension, node server3 timed out and got partitioned. Autoheal mechanism elected node server2 as the winner. Node server3 came back and server2 told it to restart. Node server2 waited for node server3 to come back, but it only stopped and never restarted so everything froze. Am I right with my reading of the logs, and why didn't server3 restart?

Here's the cleaned up log lines I got from each server node, keeping only what seemed important and cutting out repetitive lines:

Developpement

2020-02-02 05:56:23.354 [error] <0.14508.12> ** Node rabbit@server3 not responding **
** Removing (timedout) connection **
2020-02-02 05:56:23.354 [info] <0.15721.12> rabbit on node rabbit@server3 down
2020-02-02 05:56:25.882 [info] <0.15721.12> Node rabbit@server3 is down, deleting its listeners
2020-02-02 05:56:25.882 [info] <0.15721.12> node rabbit@server3 down: net_tick_timeout
2020-02-02 05:56:27.114 [info] <0.15721.12> node rabbit@server3 up
2020-02-02 05:56:30.609 [info] <0.25618.22> Mirrored queue 'nsb.delay-level-11' in vhost '/': Promoting slave <rab...@Developpement.3.25618.22> to master
2020-02-02 05:56:30.609 [info] <0.25618.22> Mirrored queue 'nsb.delay-level-11' in vhost '/': Adding mirror on node rabbit@server2: <10809.23762.18>
2020-02-02 05:56:32.840 [info] <0.15721.12> Autoheal request received from rabbit@server3
2020-02-02 05:56:32.840 [info] <0.15721.12> Autoheal decision
  * Partitions: [[rabbit@server3],[rabbit@server2,rabbit@Developpement]]
  * Winner:     rabbit@server2
  * Losers:     [rabbit@server3]
2020-02-02 05:56:32.840 [info] <0.15721.12> Autoheal request received from rabbit@server3 when healing; ignoring

2020-02-02 05:59:44.689 [error] <0.27620.22> Channel error on connection <0.17370.12> ([::1]:57912 -> [::1]:5672, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: failed to perform operation on queue 'Aggregator_NODE2' in vhost '/' due to timeout

[Last line got repeated non-stop for all queues for the rest of the day]

Server2

2020-02-02 05:56:25.984 [error] <0.11374.10> ** Node rabbit@server3 not responding **
** Removing (timedout) connection **
2020-02-02 05:56:25.984 [info] <0.13995.14> rabbit on node rabbit@server3 down
2020-02-02 05:56:28.236 [info] <0.1272.18> Mirrored queue 'DataLogger_NODE3_Orchestrator' in vhost '/': Master <rab...@server2.3.15590.14> saw deaths of mirrors <rab...@server3.1.7022.10>
[Last line getting repeated for all queues]
2020-02-02 05:56:32.909 [info] <0.13995.14> Autoheal: I am the winner, waiting for [rabbit@server3] to stop

Server3

2020-02-02 05:56:27.158 [info] <0.13678.7> rabbit on node rabbit@server2 down
2020-02-02 05:56:28.277 [error] <0.13479.7> Mnesia(rabbit@server3): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, rabbit@server2}
2020-02-02 05:56:29.367 [info] <0.7233.10> Mirrored queue 'TaskManager-NODE3' in vhost '/': Slave <rab...@server3.1.7233.10> saw deaths of mirrors <rab...@server2.3.15798.14>
[Last line getting repeated for all queues]
2020-02-02 05:56:32.914 [info] <0.13678.7> Keeping rabbit@Developpement listeners: the node is already back
2020-02-02 05:56:32.914 [info] <0.13678.7> node rabbit@Developpement down: connection_closed
2020-02-02 05:56:32.914 [info] <0.13678.7> node rabbit@Developpement up
2020-02-02 05:56:32.914 [info] <0.13678.7> node rabbit@server2 up
2020-02-02 05:56:32.914 [info] <0.13678.7> Autoheal request sent to rabbit@Developpement
2020-02-02 05:56:32.914 [info] <0.13678.7> Autoheal request sent to rabbit@Developpement
2020-02-02 05:56:32.914 [warning] <0.13678.7> Autoheal: we were selected to restart; winner is rabbit@server2
2020-02-02 05:56:32.914 [info] <0.4145.16> RabbitMQ is asked to stop...
2020-02-02 05:56:42.856 [error] <0.9446.10> Error on AMQP connection <0.9446.10> ([::1]:58851 -> [::1]:5672 - Publish, vhost: '/', user: 'guest', state: running), channel 0:
 operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"
2020-02-02 05:56:42.888 [error] <0.10792.10> Supervisor {<0.10792.10>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2020-02-02 05:56:43.374 [warning] <0.6624.10> Mirrored queue 'Particular.ServiceControl.staging' in vhost '/': Stopping all nodes on master shutdown since no synchronised slave is available
2020-02-02 05:57:18.962 [info] <0.4145.16> Successfully stopped RabbitMQ and its dependencies

My clients applications
At this point, all my services on node Developpement and Server2 were non functional and getting their logs flooded with:
System.Exception: Message could not be routed to MyQueueName: 312 NO_ROUTE

But the services on Server3 had their logs flooded with:
RabbitMQ.Client.Exceptions.AlreadyClosedException: Already closed: The AMQP operation was interrupted: AMQP close-reason, initiated by Peer, code=320, text="CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'", classId=0, methodId=0, cause=


The next day
This went on for the whole day, with only node Developpement logging timeouts in the rabbitmq logs. Node server2/server3 had nothing going on in their logs. Management console shown an uptime of 10 days on all nodes as if everything was fine. Then my co-worker came to work, and restarted RabbitMQ on Developpement. This logged a whole lot of strange errors on all nodes, but after a few minutes it all came back. My client services stopped logging errors, but they could only send message and not receive any. After restarting them too, everything came back functional.

Here's a few errors that was logged when restarting RabbitMQ on Developpement:

Developpement

2020-02-03 14:43:42.131 [error] <0.17859.12> Supervisor {<0.17859.12>,rabbit_channel_sup_sup} had child channel_sup started with rabbit_channel_sup:start_link() at undefined exit with reason shutdown in context shutdown_error
2020-02-03 14:43:45.103 [error] <0.17370.12> CRASH REPORT Process <0.17370.12> with 0 neighbours exited with reason: channel_termination_timeout in rabbit_reader:wait_for_channel_termination/3 line 764
2020-02-03 14:43:45.103 [error] <0.17368.12> Supervisor {<0.17368.12>,rabbit_connection_sup} had child reader started with rabbit_reader:start_link(<0.17369.12>, {acceptor,{0,0,0,0,0,0,0,0},5672}) at <0.17370.12> exit with reason channel_termination_timeout in context shutdown_error
2020-02-03 14:43:47.098 [info] <0.15713.12> Closing all connections in vhost '/' on node 'rabbit@Developpement' because the vhost is stopping
2020-02-03 14:43:47.145 [info] <0.16874.12> Stopping message store for directory 's:/JamLogic3/RabbitMQ/db/rabbit@Developpement-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent'
2020-02-03 14:43:47.204 [info] <0.16874.12> Message store for directory 's:/JamLogic3/RabbitMQ/db/rabbit@Developpement-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_persistent' is stopped
2020-02-03 14:43:47.204 [info] <0.16871.12> Stopping message store for directory 's:/JamLogic3/RabbitMQ/db/rabbit@Developpement-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient'
2020-02-03 14:43:47.215 [info] <0.16871.12> Message store for directory 's:/JamLogic3/RabbitMQ/db/rabbit@Developpement-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/msg_store_transient' is stopped
2020-02-03 14:43:47.221 [error] <0.7863.26> ** Generic server <0.7863.26> terminating
** Last message in was {'$gen_cast',{method,{'queue.declare',0,<<"Orchestrator_NODE2">>,true,false,false,false,false,[]},none,noflow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.17370.12>,<0.7861.26>,<0.17370.12>,<<"[::1]:57912 -> [::1]:5672">>,rabbit_reader,{lstate,<0.7862.26>,false},none,1,{[],[]},{user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]},<<"/">>,<<>>,#{},{state,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},erlang},#{},#{},{set,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},<0.17475.12>,{state,fine,5000,#Ref<0.1829353633.2447638530.110331>},false,1,{{0,nil},{0,nil}},[],[],{{0,nil},{0,nil}},[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"authentication_failure_close">>,bool,true}],none,0,none,flow,[]}
** Reason for termination ==
** {badarg,[{ets,lookup,[rabbit_registry,{ha_mode,exactly}],[]},{rabbit_registry,lookup_module,2,[{file,"src/rabbit_registry.erl"},{line,64}]},{rabbit_mirror_queue_misc,module,1,[{file,"src/rabbit_mirror_queue_misc.erl"},{line,405}]},{rabbit_mirror_queue_misc,is_mirrored,1,[{file,"src/rabbit_mirror_queue_misc.erl"},{line,420}]},{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,508}]},{rabbit_channel,handle_method,6,[{file,"src/rabbit_channel.erl"},{line,2267}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1449}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,563}]}]}
2020-02-03 14:43:47.222 [error] <0.7863.26> CRASH REPORT Process <0.7863.26> with 0 neighbours exited with reason: bad argument in call to ets:lookup(rabbit_registry, {ha_mode,exactly}) in rabbit_registry:lookup_module/2 line 64
2020-02-03 14:43:47.222 [error] <0.7860.26> Supervisor {<0.7860.26>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.17370.12>, <0.7861.26>, <0.17370.12>, <<"[::1]:57912 -> [::1]:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"guest">>,[administrator],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,...},...], <0.17475.12>, <0.7862.26>) at <0.7863.26> exit with reason bad argument in call to ets:lookup(rabbit_registry, {ha_mode,exactly}) in rabbit_registry:lookup_module/2 line 64 in context shutdown_error

Server2

2020-02-03 14:43:47.442 [info] <0.13995.14> rabbit on node rabbit@Developpement down
2020-02-03 14:43:47.473 [info] <0.13995.14> Keeping rabbit@Developpement listeners: the node is already back
2020-02-03 14:43:48.598 [info] <0.13995.14> node rabbit@Developpement down: connection_closed
2020-02-03 14:43:48.598 [info] <0.13995.14> Autoheal: aborting - [rabbit@Developpement] down
2020-02-03 14:44:10.995 [error] <0.30204.18> ** Generic server <0.30204.18> terminating
** Last message in was go
** When Server state == {not_started,{amqqueue,{resource,<<"/">>,queue,<<"TaskScheduler">>},true,false,none,[],<10808.11640.16>,[],[],[rabbit@Developpement],[{vhost,<<"/">>},{name,<<"HA">>},{pattern,<<".*">>},{'apply-to',<<"all">>},{definition,[{<<"ha-mode">>,<<"exactly">>},{<<"ha-params">>,2},{<<"ha-sync-mode">>,<<"automatic">>},{<<"max-length-bytes">>,50000000}]},{priority,0}],undefined,[{<10808.11922.16>,<10808.11640.16>}],[],live,0,[],<<"/">>,#{user => <<"guest">>}}}
** Reason for termination ==
** {{badmatch,{error,{"s:/Jamlogic3/RabbitMQ/db/rabbit@server2-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7KWD7DYVWOVS0ZOSAHWKJAH3W",eexist}}},[{rabbit_variable_queue,delete_crashed,1,[{file,"src/rabbit_variable_queue.erl"},{line,624}]},{rabbit_mirror_queue_slave,handle_go,1,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,123}]},{rabbit_mirror_queue_slave,handle_call,3,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,220}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1035}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2020-02-03 14:44:10.995 [error] <0.30204.18> CRASH REPORT Process <0.30204.18> with 1 neighbours exited with reason: no match of right hand value {error,{"s:/Jamlogic3/RabbitMQ/db/rabbit@server2-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7KWD7DYVWOVS0ZOSAHWKJAH3W",eexist}} in rabbit_variable_queue:delete_crashed/1 line 624 in gen_server2:terminate/3 line 1172
2020-02-03 14:44:10.995 [error] <0.30202.18> Supervisor {<0.30202.18>,rabbit_amqqueue_sup} had child rabbit_amqqueue started with rabbit_prequeue:start_link({amqqueue,{resource,<<"/">>,queue,<<"TaskScheduler">>},true,false,none,[],<10808.11640.16>,[],[],...}, slave, <0.30201.18>) at <0.30204.18> exit with reason no match of right hand value {error,{"s:/Jamlogic3/RabbitMQ/db/rabbit@server2-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7KWD7DYVWOVS0ZOSAHWKJAH3W",eexist}} in rabbit_variable_queue:delete_crashed/1 line 624 in context child_terminated
2020-02-03 14:45:05.221 [info] <0.13995.14> node rabbit@Developpement up
2020-02-03 14:45:19.990 [info] <0.13995.14> rabbit on node rabbit@Developpement up

 Server3

2020-02-03 14:43:52.470 [info] <0.4145.16> Stopping application 'syslog'
2020-02-03 14:43:52.470 [info] <0.4145.16> Stopping application 'lager'
2020-02-03 14:43:52.861 [info] <0.4145.16> Log file opened with Lager
2020-02-03 14:43:53.068 [info] <0.43.0> Application mnesia exited with reason: stopped
2020-02-03 14:44:08.448 [error] <0.10174.16> Mnesia(rabbit@server3): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, rabbit@server2}
2020-02-03 14:44:11.018 [error] <0.11640.16> ** Generic server <0.11640.16> terminating
** Last message in was {init,{<0.11498.16>,[[{segments,[{57,2}]},{persistent_ref,<<184,134,49,24,46,214,85,45,147,242,106,31,237,25,252,43>>},{persistent_count,2},{persistent_bytes,1480}]]}}
** When Server state == {q,{amqqueue,{resource,<<"/">>,queue,<<"TaskScheduler">>},true,false,none,[],<0.11640.16>,[],[],[rabbit@Developpement],[{vhost,<<"/">>},{name,<<"HA">>},{pattern,<<".*">>},{'apply-to',<<"all">>},{definition,[{<<"ha-mode">>,<<"exactly">>},{<<"ha-params">>,2},{<<"ha-sync-mode">>,<<"automatic">>},{<<"max-length-bytes">>,50000000}]},{priority,0}],undefined,[],undefined,live,0,[],<<"/">>,#{user => <<"guest">>}},none,false,undefined,undefined,{state,{queue,[],[],0},{active,939318758029,1.0}},undefined,undefined,undefined,undefined,{state,fine,5000,undefined},{0,nil},undefined,undefined,undefined,{state,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},delegate},undefined,undefined,undefined,undefined,'drop-head',0,0,running}
** Reason for termination ==
** {{{badmatch,{error,{"s:/Jamlogic3/RabbitMQ/db/rabbit@server2-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7KWD7DYVWOVS0ZOSAHWKJAH3W",eexist}}},[{rabbit_variable_queue,delete_crashed,1,[{file,"src/rabbit_variable_queue.erl"},{line,624}]},{rabbit_mirror_queue_slave,handle_go,1,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,123}]},{rabbit_mirror_queue_slave,handle_call,3,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,220}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1035}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_server2,call,[<7629.30204.18>,go,infinity]}}
2020-02-03 14:44:11.034 [error] <0.11640.16> CRASH REPORT Process <0.11640.16> with 2 neighbours exited with reason: {{{badmatch,{error,{"s:/Jamlogic3/RabbitMQ/db/rabbit@server2-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7KWD7DYVWOVS0ZOSAHWKJAH3W",eexist}}},[{rabbit_variable_queue,delete_crashed,1,[{file,"src/rabbit_variable_queue.erl"},{line,624}]},{rabbit_mirror_queue_slave,handle_go,1,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,123}]},{rabbit_mirror_queue_slave,handle_call,3,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,220}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2...."},...]},...]},...} in gen_server2:terminate/3 line 1172
2020-02-03 14:44:11.049 [error] <0.11639.16> Supervisor {<0.11639.16>,rabbit_amqqueue_sup} had child rabbit_amqqueue started with rabbit_prequeue:start_link({amqqueue,{resource,<<"/">>,queue,<<"TaskScheduler">>},true,false,none,[],<0.6671.10>,[],[],[rabbit@Developpement],...}, recovery, <0.11638.16>) at <0.11640.16> exit with reason {{{badmatch,{error,{"s:/Jamlogic3/RabbitMQ/db/rabbit@server2-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/7KWD7DYVWOVS0ZOSAHWKJAH3W",eexist}}},[{rabbit_variable_queue,delete_crashed,1,[{file,"src/rabbit_variable_queue.erl"},{line,624}]},{rabbit_mirror_queue_slave,handle_go,1,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,123}]},{rabbit_mirror_queue_slave,handle_call,3,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,220}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2...."},...]},...]},...} in context child_terminated
2020-02-03 14:44:11.065 [error] <0.12494.16> Restarting crashed queue 'TaskScheduler' in vhost '/'.
2020-02-03 14:44:11.269 [error] <0.11498.16> Queue <0.11691.16> failed to initialise: {{{badmatch,{error,{"s:/Jamlogic3/RabbitMQ/db/rabbit@server2-mnesia/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/DN0FSSJMEBH4YI0HU9K8EE1RI",eexist}}},[{rabbit_variable_queue,delete_crashed,1,[{file,"src/rabbit_variable_queue.erl"},{line,624}]},{rabbit_mirror_queue_slave,handle_go,1,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,123}]},{rabbit_mirror_queue_slave,handle_call,3,[{file,"src/rabbit_mirror_queue_slave.erl"},{line,220}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1035}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]},{gen_server2,call,[<7629.30698.18>,go,infinity]}}


Thank, and sorry for the very long post. Tried to put more information than not enough.
Anyone can please help me make sense of all of this?

Michael Klishin

unread,
Feb 5, 2020, 4:36:03 PM2/5/20
to rabbitmq-users
There sequence of events is not specific enough to conclude much but it could be [1].

One "strange error" means that an internal table did not exist at a certain point.
Another says that a message store operation failed with eexist, which is a Windows-specific
problem that is years old but the root cause is unknown. I don't recall seeing it on 3.8 with Erlang 22,
although that could be a matter of how long that release has been around compared to, say, 3.6 and 3.7 combined.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/1bc201fa-d437-49d9-8b37-ed7ae36e6f70%40googlegroups.com.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Pier Castonguay

unread,
Feb 6, 2020, 2:39:15 PM2/6/20
to rabbitmq-users
Thank you Michael for taking the time to reply.

So the pull request you linked is pretty new and is was not included my version. As I understand it adds a timeout in case something goes wrong in the autoheal process to continue anyway instead of staying stuck. I will update to latest version for my future tests.

But from what I understand of your reply, you aren't sure this was the root cause of what happened in my situation and the problem might still be present.

As you understand, this was a pre-release test before pushing the same cluster configuration into production. Since it failed to run for more than a few weeks (and we weren't even trying to cause it to fail by simulating outages), it's really not giving a trustworthy impression to the management and they are hesitant at approving the deployment. So I'm kinda just looking for a minimum of explanation of why it failed to reassure the management that the technology is still viable because at this point, an unknown outage that crashed everything which can't be explained is not something that will be approved, and we really need to project to go forward.

Michael Klishin

unread,
Feb 6, 2020, 4:46:50 PM2/6/20
to rabbitmq-users
We don't know the root cause of file system operations failing with "eexist". It's an OS response [1] to a runtime operation
that RabbitMQ does not implement.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages