looks like network partition but AWS says there was no network outage

186 views
Skip to first unread message

Antonio Kang

unread,
Oct 15, 2018, 3:26:49 PM10/15/18
to rabbitmq-users
i had a network partition in a cluster of 3 nodes in aws (all in the same AZ)

I got the following errors on node1 and node2, from what i am seeing, it looks like it was due to a network outage but aws claims that there was no network outage.

Could these error msgs have been caused by something else and if so, what caused this issue?

I have posted the logs from node1 and node2:

Logs from node1:

2018-10-09 13:08:42.510 [error] <0.145.0> ** Node 'rabbit@rab-prod-node2' not responding **
** Removing (timedout) connection **
2018-10-09 13:08:42.510 [info] <0.499.0> rabbit on node 'rabbit@rab-prod-node2' down
2018-10-09 13:08:42.526 [info] <0.12834.994> connection <0.12834.994> (10.0.0.179:10473 -> 10.0.15.54:5672): user 'rabbitUser' authenticated and granted access to vhost '/'
2018-10-09 13:08:42.534 [info] <0.499.0> Node rabbit@rab-prod-node2 is down, deleting its listeners
2018-10-09 13:08:42.549 [info] <0.13417.994> closing AMQP connection <0.13417.994> (10.0.15.57:41642 -> 10.0.15.54:5672, vhost: '/', user: 'rabbitUser')
2018-10-09 13:08:42.555 [info] <0.499.0> node 'rabbit@rab-prod-node2' down: net_tick_timeout
2018-10-09 13:08:42.555 [info] <0.499.0> node 'rabbit@rab-prod-node2' up
2018-10-09 13:08:42.555 [error] <0.231.0> Mnesia('rabbit@rab-prod-node1'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rab-prod-node2'}
2018-10-09 13:08:42.556 [info] <0.12362.994> closing AMQP connection <0.12362.994> (10.0.15.57:41648 -> 10.0.15.54:5672, vhost: '/', user: 'rabbitUser')
2018-10-09 13:08:42.564 [error] <0.10257.277> ** Generic server <0.10257.277> terminating
** Last message in was {'$gen_cast',{method,{'queue.declare',0,<<"aliveness-test">>,true,false,false,true,false,[]},none,noflow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.9900.277>,<0.10485.277>,<0.9900.277>,<<"127.0.0.1:41553 -> 127.0.0.1:5672">>,{lstate,<0.10773.277>,false},none,13699,{[{13698,<<"pserver_gd_queue_81efb">>,{<13177.465.0>,2620255}}],[]},{user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]},<<"/">>,<<"aliveness-test">>,#{<13177.465.0> => {resource,<<"/">>,queue,<<"external_post.pserver">>},<13177.495.0> => {resource,<<"/">>,queue,<<"transaction_log_api">>},<0.29173.980> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.3600">>},<13176.11545.1044> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.21600">>},<13177.30222.7276> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.1800">>},<13177.21900.7281> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.60">>},<13177.20623.7287> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.300">>},<13176.5786.7522> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.86400">>}},{state,{dict,22,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[[<13176.5786.7522>|#Ref<0.3954076138.2975072257.38959>]],[[<13176.11262.1047>|#Ref<0.3954076138.3440640001.170174>]],[[<0.9031.979>|#Ref<0.3954076138.3437232129.151133>],[<0.7469.993>|#Ref<0.3954076138.3443785729.213884>],[<13176.31041.1054>|#Ref<0.3954076138.3443785729.213885>]],[[<13176.23408.1042>|#Ref<0.3954076138.3439591425.68177>]],[[<13177.19377.5948>|#Ref<0.3954076138.2975072257.38960>],[<13177.30222.7276>|#Ref<0.3954076138.3437232129.151132>]],[],[[<0.26965.7469>|#Ref<0.3954076138.2975072257.38961>]],[[<13177.13164.7279>|#Ref<0.3954076138.3441164289.64232>],[<0.19707.982>|#Ref<0.3954076138.3441164289.64233>]],[[<13177.495.0>|#Ref<0.3954076138.2962227201.85404>],[<13177.3658.7278>|#Ref<0.3954076138.3439591425.68176>]],[[<0.1022.0>|#Ref<0.3954076138.2962227201.85405>],[<13176.1022.0>|#Ref<0.3954076138.2962227201.85406>],...],...}}},...},...}
** Reason for termination ==
** {{badmatch,true},[{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},{rabbit_channel,handle_method,5,[{file,"src/rabbit_channel.erl"},{line,2189}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1390}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
2018-10-09 13:08:42.564 [error] <0.9900.277> Error on AMQP connection <0.9900.277> (127.0.0.1:41553 -> 127.0.0.1:5672, vhost: '/', user: 'rabbitUser', state: running), channel 1:
 {{badmatch,true},
 [{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},
  {rabbit_channel,handle_method,5,
                  [{file,"src/rabbit_channel.erl"},{line,2189}]},
  {rabbit_channel,handle_method,3,
                  [{file,"src/rabbit_channel.erl"},{line,1390}]},
  {rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},
  {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},
  {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
2018-10-09 13:08:42.565 [error] <0.10257.277> CRASH REPORT Process <0.10257.277> with 0 neighbours exited with reason: no match of right hand value true in rabbit_amqqueue:retry_wait/4 line 509 in gen_server2:terminate/3 line 1161
2018-10-09 13:08:42.565 [error] <0.8734.277> Error on AMQP connection <0.8734.277> (127.0.0.1:41551 -> 127.0.0.1:5672, vhost: '/', user: 'rabbitUser', state: running), channel 1:
 {{badmatch,true},
 [{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},
  {rabbit_channel,handle_method,5,
                  [{file,"src/rabbit_channel.erl"},{line,2189}]},
  {rabbit_channel,handle_method,3,
                  [{file,"src/rabbit_channel.erl"},{line,1390}]},
  {rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},
  {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},
  {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
2018-10-09 13:08:42.566 [warning] <0.9900.277> Non-AMQP exit reason '{{badmatch,true},[{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},{rabbit_channel,handle_method,5,[{file,"src/rabbit_channel.erl"},{line,2189}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1390}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}'
2018-10-09 13:08:42.567 [warning] <0.8734.277> Non-AMQP exit reason '{{badmatch,true},[{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},{rabbit_channel,handle_method,5,[{file,"src/rabbit_channel.erl"},{line,2189}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1390}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}'
2018-10-09 13:08:42.567 [error] <0.10450.277> Supervisor {<0.10450.277>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.9900.277>, <0.10485.277>, <0.9900.277>, <<"127.0.0.1:41553 -> 127.0.0.1:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"authentication_failure_close">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer...">>,...},...], <0.9571.277>, <0.10773.277>) at <0.10257.277> exit with reason no match of right hand value true in rabbit_amqqueue:retry_wait/4 line 509 in context child_terminated
2018-10-09 13:08:42.569 [error] <0.10450.277> Supervisor {<0.10450.277>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.9900.277>, <0.10485.277>, <0.9900.277>, <<"127.0.0.1:41553 -> 127.0.0.1:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"authentication_failure_close">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer...">>,...},...], <0.9571.277>, <0.10773.277>) at <0.10257.277> exit with reason reached_max_restart_intensity in context shutdown
2018-10-09 13:08:42.572 [info] <0.8734.277> closing AMQP connection <0.8734.277> (127.0.0.1:41551 -> 127.0.0.1:5672, vhost: '/', user: 'rabbitUser')
2018-10-09 13:08:42.573 [info] <0.9900.277> closing AMQP connection <0.9900.277> (127.0.0.1:41553 -> 127.0.0.1:5672, vhost: '/', user: 'rabbitUser')
2018-10-09 13:08:42.573 [error] <0.22673.266> ** Generic server <0.22673.266> terminating
** Last message in was {'$gen_cast',{method,{'queue.declare',0,<<"aliveness-test">>,true,false,false,true,false,[]},none,noflow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.8734.277>,<0.9976.277>,<0.8734.277>,<<"127.0.0.1:41551 -> 127.0.0.1:5672">>,{lstate,<0.10739.277>,false},none,13679,{[{13678,<<"pserver_gd_queue_83213">>,{<13177.465.0>,2620258}}],[]},{user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]},<<"/">>,<<"aliveness-test">>,#{<13177.465.0> => {resource,<<"/">>,queue,<<"external_post.pserver">>},<13177.495.0> => {resource,<<"/">>,queue,<<"transaction_log_api">>},<0.29173.980> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.3600">>},<13177.30222.7276> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.1800">>},<13177.21900.7281> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.60">>},<13176.5786.7522> => {resource,<<"/">>,queue,<<"external_post.pserver.buffer.86400">>}},{state,{dict,16,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[[<13176.5786.7522>|#Ref<0.3954076138.2972975105.183604>]],[[<13176.11262.1047>|#Ref<0.3954076138.3438542849.146852>]],[[<0.9031.979>|#Ref<0.3954076138.3435659265.239296>]],[[<13176.23408.1042>|#Ref<0.3954076138.3435134977.25797>]],[[<13177.19377.5948>|#Ref<0.3954076138.2972975105.183605>],[<13177.30222.7276>|#Ref<0.3954076138.3435659265.239295>]],[],[[<0.26965.7469>|#Ref<0.3954076138.2972975105.183606>]],[],[[<13177.495.0>|#Ref<0.3954076138.2962227201.86726>],[<13177.3658.7278>|#Ref<0.3954076138.3435134977.25796>]],[[<0.1022.0>|#Ref<0.3954076138.2962227201.86727>],[<13176.1022.0>|#Ref<0.3954076138.2962227201.86728>],[<13176.3986.1041>|#Ref<0.3954076138.3435659265.239297>],[<13177.21900.7281>|#Ref<0.3954076138.3438542849.146850>]],[[<13177.465.0>|#Ref<0.3954076138.2962227201.67565>]],[[<0.18945.985>|#Ref<0.3954076138.3438542849.146851>]],[],[],[[<0.29173.980>|#Ref<0.3954076138.3435134977.25795>]],[]}}},erlang},#{<<"pserver_gd_queue_83213">> => {{amqqueue,{resource,<<"/">>,...},...},...}},...}
** Reason for termination ==
** {{badmatch,true},[{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},{rabbit_channel,handle_method,5,[{file,"src/rabbit_channel.erl"},{line,2189}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1390}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
2018-10-09 13:08:42.574 [error] <0.22673.266> CRASH REPORT Process <0.22673.266> with 0 neighbours exited with reason: no match of right hand value true in rabbit_amqqueue:retry_wait/4 line 509 in gen_server2:terminate/3 line 1161
2018-10-09 13:08:42.575 [error] <0.10198.277> Supervisor {<0.10198.277>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.8734.277>, <0.9976.277>, <0.8734.277>, <<"127.0.0.1:41551 -> 127.0.0.1:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"authentication_failure_close">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer...">>,...},...], <0.10170.277>, <0.10739.277>) at <0.22673.266> exit with reason no match of right hand value true in rabbit_amqqueue:retry_wait/4 line 509 in context child_terminated
2018-10-09 13:08:42.575 [error] <0.10198.277> Supervisor {<0.10198.277>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.8734.277>, <0.9976.277>, <0.8734.277>, <<"127.0.0.1:41551 -> 127.0.0.1:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"authentication_failure_close">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer...">>,...},...], <0.10170.277>, <0.10739.277>) at <0.22673.266> exit with reason reached_max_restart_intensity in context shutdown
2018-10-09 13:08:42.653 [error] <0.499.0> Partial partition detected:
 * We saw DOWN from rabbit@rab-prod-node2
 * We can still see rabbit@rab-prod-node3 which can see rabbit@rab-prod-node2
We will therefore intentionally disconnect from rabbit@rab-prod-node3


Logs from node2:

2018-10-09 13:08:42.513 [info] <0.488.0> rabbit on node 'rabbit@rab-prod-node1' down
2018-10-09 13:08:42.547 [info] <0.488.0> Keeping rabbit@rab-prod-node1 listeners: the node is already back
2018-10-09 13:08:42.579 [error] <0.190.0> Mnesia('rabbit@rab-prod-node2'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rab-prod-node1'}
2018-10-09 13:08:42.590 [info] <0.488.0> node 'rabbit@rab-prod-node1' down: connection_closed
2018-10-09 13:08:42.598 [info] <0.488.0> node 'rabbit@rab-prod-node1' up
2018-10-09 13:08:42.607 [error] <0.27234.346> Error on AMQP connection <0.27234.346> (127.0.0.1:42246 -> 127.0.0.1:5672, vhost: '/', user: 'rabbitUser', state: running), channel 1:
 {{badmatch,true},
 [{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},
  {rabbit_channel,handle_method,5,
                  [{file,"src/rabbit_channel.erl"},{line,2129}]},
  {rabbit_channel,handle_method,3,
                  [{file,"src/rabbit_channel.erl"},{line,1390}]},
  {rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},
  {gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},
  {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
2018-10-09 13:08:42.608 [error] <0.5771.344> ** Generic server <0.5771.344> terminating
** Last message in was {'$gen_cast',{method,{'queue.declare',0,<<"external_post.setrans.buffer.60">>,false,true,false,false,false,[{<<"x-dead-letter-exchange">>,longstr,<<"external_post">>},{<<"x-expires">>,signedint,1860000},{<<"x-message-ttl">>,signedint,60000}]},none,noflow}}
** When Server state == {ch,running,rabbit_framing_amqp_0_9_1,1,<0.27234.346>,<0.28199.346>,<0.27234.346>,<<"127.0.0.1:42246 -> 127.0.0.1:5672">>,{lstate,<0.27689.346>,false},none,1192,{[{1191,<<"setrans_gd_queue_5ed68">>,{<13347.558.0>,58100}}],[]},{user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]},<<"/">>,<<"aliveness-test">>,#{<13347.495.0> => {resource,<<"/">>,queue,<<"transaction_log_api">>},<13347.558.0> => {resource,<<"/">>,queue,<<"external_post.setrans">>},<8176.2970.993> => {resource,<<"/">>,queue,<<"external_post.setrans.buffer.60">>}},{state,{dict,7,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[[<13347.558.0>|#Ref<0.3382570487.3613655041.14490>]],[[<13347.16266.7287>|#Ref<0.3382570487.4175167489.90875>]],[],[],[],[],[],[],[[<13347.495.0>|#Ref<0.3382570487.3613655041.60467>]],[[<8176.1022.0>|#Ref<0.3382570487.3613655041.60468>],[<0.1022.0>|#Ref<0.3382570487.3613655041.60469>]],[],[],[[<0.25338.1054>|#Ref<0.3382570487.4175167489.90876>]],[[<8176.2970.993>|#Ref<0.3382570487.4175167489.90874>]],[],[]}}},erlang},#{<<"setrans_gd_queue_5ed68">> => {{amqqueue,{resource,<<"/">>,queue,<<"external_post.setrans">>},true,false,none,[],<13347.558.0>,[<8176.1220.0>,<0.1261.0>],[<0.1261.0>,<8176.1220.0>],['rabbit@rab-prod-node1','rabbit@rab-prod-node2'],[{vhost,<<"/">>},{name,<<"ha-all">>},{pattern,<<"^">>},{'apply-to',<<"all">>},{definition,[{<<"ha-mode">>,<<"all">>},{<<"ha-sync-mode">>,<<"automatic">>}]},{priority,0}],undefined,[{<0.1262.0>,<0.1261.0>},{<8176.1222.0>,<8176.1220.0>},{<13347.678.0>,<13347.558.0>}],[],live,0,[],<<"/">>,#{user => <<"test">>}},{false,10,false,[]}}},#{<13347.558.0> => {1,{<<"setrans_gd_queue_5ed68">>,nil,nil}}},{set,1,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[<13347.558.0>],[],[],[],[],[],[],[],...}}},...}
** Reason for termination ==
** {{badmatch,true},[{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},{rabbit_channel,handle_method,5,[{file,"src/rabbit_channel.erl"},{line,2129}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1390}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}
2018-10-09 13:08:42.609 [warning] <0.27234.346> Non-AMQP exit reason '{{badmatch,true},[{rabbit_amqqueue,retry_wait,4,[{file,"src/rabbit_amqqueue.erl"},{line,509}]},{rabbit_channel,handle_method,5,[{file,"src/rabbit_channel.erl"},{line,2129}]},{rabbit_channel,handle_method,3,[{file,"src/rabbit_channel.erl"},{line,1390}]},{rabbit_channel,handle_cast,2,[{file,"src/rabbit_channel.erl"},{line,523}]},{gen_server2,handle_msg,2,[{file,"src/gen_server2.erl"},{line,1047}]},{proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,257}]}]}'
2018-10-09 13:08:42.609 [info] <0.2376.1056> closing AMQP connection <0.2376.1056> (10.0.15.57:50495 -> 10.0.15.116:5672, vhost: '/', user: 'rabbitUser')
2018-10-09 13:08:42.609 [info] <0.537.1056> closing AMQP connection <0.537.1056> (10.0.15.57:50443 -> 10.0.15.116:5672, vhost: '/', user: 'rabbitUser')
2018-10-09 13:08:42.610 [error] <0.5771.344> CRASH REPORT Process <0.5771.344> with 0 neighbours exited with reason: no match of right hand value true in rabbit_amqqueue:retry_wait/4 line 509 in gen_server2:terminate/3 line 1161
2018-10-09 13:08:42.611 [info] <0.27234.346> closing AMQP connection <0.27234.346> (127.0.0.1:42246 -> 127.0.0.1:5672, vhost: '/', user: 'rabbitUser')
2018-10-09 13:08:42.611 [error] <0.27757.346> Supervisor {<0.27757.346>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.27234.346>, <0.28199.346>, <0.27234.346>, <<"127.0.0.1:42246 -> 127.0.0.1:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"authentication_failure_close">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer...">>,...},...], <0.27034.346>, <0.27689.346>) at <0.5771.344> exit with reason no match of right hand value true in rabbit_amqqueue:retry_wait/4 line 509 in context child_terminated
2018-10-09 13:08:42.612 [error] <0.27757.346> Supervisor {<0.27757.346>,rabbit_channel_sup} had child channel started with rabbit_channel:start_link(1, <0.27234.346>, <0.28199.346>, <0.27234.346>, <<"127.0.0.1:42246 -> 127.0.0.1:5672">>, rabbit_framing_amqp_0_9_1, {user,<<"rabbitUser">>,[],[{rabbit_auth_backend_internal,none}]}, <<"/">>, [{<<"authentication_failure_close">>,bool,true},{<<"publisher_confirms">>,bool,true},{<<"consumer...">>,...},...], <0.27034.346>, <0.27689.346>) at <0.5771.344> exit with reason reached_max_restart_intensity in context shutdown
2018-10-09 13:08:42.691 [error] <0.488.0> Partial partition detected:
 * We saw DOWN from rabbit@rab-prod-node1
 * We can still see rabbit@rab-prod-node3 which can see rabbit@rab-prod-node1
We will therefore intentionally disconnect from rabbit@rab-prod-node3


Luke Bakken

unread,
Oct 15, 2018, 3:41:18 PM10/15/18
to rabbitmq-users
Hi Antonio,

Could you give us a bit more information -

* What version of RabbitMQ, Erlang and on what operating system?

* What partition handling scheme you are using? ignore, autoheal or pause_minority?

* Did you have to manually intervene in this case or did the cluster fix itself?

The {badmatch, true} lines (and associated lines) can be ignored as they are a symptom of the partition and not the cause.

If the load on node2 were sufficient to cause a timeout on the distributed Erlang connection, this could be the result. Do you monitor this cluster?

Thanks,
Luke

Antonio Kang

unread,
Oct 15, 2018, 4:00:30 PM10/15/18
to rabbitmq-users
Hi Luke,

Sorry about that, not giving you the extra information initially.

* What version of RabbitMQ, Erlang and on what operating system?

RabbitMQ: 3.7.3-1
Erlang: 20.2-1
OS: Ubuntu 14.04.5

* What partition handling scheme you are using? ignore, autoheal or pause_minority?

Ignore

* Did you have to manually intervene in this case or did the cluster fix itself?

I had to manually stop/start rabbitmq on the nodes as all 3 nodes were in separate clusters.

* If the load on node2 were sufficient to cause a timeout on the distributed Erlang connection, this could be the result. Do you monitor this cluster?

Yes, we do monitor the cluster and i have attached three screen shots that show the metrics of the 3 nodes.

As you can see in the graphs, the load/usage seem to be fine around ~1PM EST which is when the partition happened.
node3.png
node2.png
node1.png

Luke Bakken

unread,
Oct 15, 2018, 4:40:14 PM10/15/18
to rabbitmq-users
Hi Antonio,

Seems like a network event, then. The load looks fine and I assume metrics that show message rate don't show an increase at the time of the event.

AWS may not consider events that delay TCP data to be "outages", but the effect is the same. One way to keep an eye on this would be to monitor the TCP send queue value for the distributed Erlang connection.

* Most likely, RabbitMQ is using port 25672 for distributed Erlang. You can use the epmd -names command to verify.

* Then, check out /proc/net/tcp - the value you want to monitor is tx_queue:

$ head -n5 /proc/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode                                                                                                                                                                                
   0: 00000000:6448 00000000:0000 0A 00000000:00000000 00:00000000 00000000  1000        0 724590 1 00000000842d3bb4 99 0 0 10 0                     
   1: 00000000:6449 00000000:0000 0A 00000000:00000000 00:00000000 00000000  1000        0 792642 1 00000000e2f1b213 99 0 0 10 0                     
   2: 00000000:14EB 00000000:0000 0A 00000000:00000000 00:00000000 00000000   195        0 16660 1 00000000ee47d7ac 99 0 0 10 0                      
   3: 00000000:1111 00000000:0000 0A 00000000:00000000 00:00000000 00000000  1000        0 728382 1 00000000199e7354 99 0 0 10 0     

* Convert 25672 to hex:

$ perl -e 'printf("0x%X\n", 25672)'                                                                                                                                                                                            0x6448

* Monitor those values however you'd like - the following outputs tx_queue (send) and rx_queue (receive) queue values that could be entered into a monitoring system:

$ awk '/6448/ { print $5 }' /proc/net/tcp
00000000:00000000
00000000:00000000
00000000:00000000

If you ever see tx_queue increase over the span of a minute, you can be pretty sure that a partition will happen soon.

Thanks,
Luke

Antonio Kang

unread,
Oct 17, 2018, 3:09:04 PM10/17/18
to rabbitmq-users
Thanks for the suggestions Luke.

I will give this a try and see what we see from monitoring them.

When we start monitoring and we see tx_queue increase over the span of a minute, do you have any suggestions on what steps we can take to prevent the partition or any loss of mesages?

Antonio Kang

unread,
Dec 3, 2018, 4:47:29 PM12/3/18
to rabbitmq-users
Hi Luke,

I finally got around to work on getting this monitored but i have a couple questions.

You said:

"If you ever see tx_queue increase over the span of a minute, you can be pretty sure that a partition will happen soon."

Could you elaborate a bit more on what you mean by "increase over a span of a minute"? I am not sure how it can be increased but should I monitor for any increase, even if it goes up by 1?

Also, If we do see that it that it has a increase over the span of a minute, can we do anything to prevent the network partition from actually occurring? Or is this strictly a warning to let us know that it will happen and that we should be ready to recover the network partition?

Best,

Michael Klishin

unread,
Dec 3, 2018, 6:05:03 PM12/3/18
to rabbitm...@googlegroups.com
It's a metric, not a value you can increase. If it keeps growing monotonically for a minute
then you can be sure transactions are not coming through and then [1] will kick in.

You can bump the nettick timeout if you have evidence that the system catches up after more than a minute.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Antonio Kang

unread,
Dec 4, 2018, 5:38:13 PM12/4/18
to rabbitmq-users
Hi Michael,

Thanks for you reply. I was reading through the net tick page you provided and was wondering if you could clarify/explain one thing that the page did not explain.

I was wondering what are some drawbacks of setting the net tick timeout higher than the default value (in our case, 60)? Even if the evidence does not show that the system will catch up, how bad would it be for us to increase it?

Best,

Michael Klishin

unread,
Dec 4, 2018, 6:58:35 PM12/4/18
to rabbitm...@googlegroups.com
It simply increases the window of time in which connection can be disrupted but neither end of it notices.
There are no other downsides.
Reply all
Reply to author
Forward
0 new messages