How to get the reason of net_tick_timeout

44 views
Skip to first unread message

정찬호 (chan)

unread,
Jun 25, 2024, 5:25:07 AM (10 days ago) Jun 25
to rabbitmq-users
Hello! RabbitMQ users

RabbitMQ cluster consisting of three nodes
Version Erlang v22.3.4.21 / RabbitMQ v3.7.20

Two weeks ago and a week ago, net_tick_timeout occurred in the RabbitMQ cluster.
So a partition network occurred and I solved the problem through a rabbitmq restart.
Server3 was divided from (server1, server2)

But I couldn't find the cause of net_tick_timeout.
I know that the tick_time setting is default 60 seconds
I don't think the communication between the servers was cut off for 60 seconds.
Because there was no problem even though there was another cluster process running on the same node.

It happened twice in two weeks
I think net_tick_timeout will happen in the near future
I think we need to find this cause.
But It's hard to get the reason of net_tick_timeout from RabbitMQ logs

I can't use tcp dump because it's an operation server.
And also can't change log level from 'info' to 'debug'
What kind of inspection process should I do?
I will attach the log below before net_tick_timeout.
If you have any doubts, please comment.
Thanks.

  • Server1 logs
2024-06-13 09:04:54.800 [error] <0.5020.163> Supervisor {<0.5020.163>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.5015.163>, <<"<rab...@server1.3.5015.163>">>) at undefined exit with reason shutdown in context shutdown_error
2024-06-13 09:46:39.792 [error] <0.26064.163> Supervisor {<0.26064.163>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.26060.163>, <<"<rab...@server1.3.26060.163>">>) at undefined exit with reason shutdown in context shutdown_error
2024-06-13 10:28:54.801 [error] <0.14455.164> Supervisor {<0.14455.164>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.14452.164>, <<"<rab...@server1.3.14452.164>">>) at undefined exit with reason shutdown in context shutdown_error
2024-06-13 10:48:51.799 [error] <0.210.0> ** Node 'rabbit@server3' not responding **
** Removing (timedout) connection **
2024-06-13 10:48:51.799 [info] <0.420.0> rabbit on node 'rabbit@server3' down
2024-06-13 10:48:51.801 [warning] <0.868.0> Management delegate query returned errors:
[{<41700.670.0>,{exit,{nodedown,'rabbit@server3'},[]}}]
2024-06-13 10:48:51.806 [info] <0.420.0> Keeping rabbit@server3 listeners: the node is already back
2024-06-13 10:48:51.808 [error] <0.190.0> Mnesia('rabbit@server1'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@server3'}
2024-06-13 10:48:51.820 [info] <0.562.0> Mirrored queue 'QQ' in vhost '/': Slave <rab...@server1.3.562.0> saw deaths of mirrors <rab...@server3.2.559.0>
2024-06-13 10:48:51.821 [info] <0.580.0> Mirrored queue 'QQQ' in vhost '/': Slave <rab...@server1.3.580.0> saw deaths of mirrors <rab...@server3.2.575.0>
2024-06-13 10:48:51.821 [info] <0.602.0> Mirrored queue 'Q' in vhost '/': Slave <rab...@server1.3.602.0> saw deaths of mirrors <rab...@server3.2.595.0>
...
2024-06-13 10:48:51.959 [info] <0.420.0> node 'rabbit@server3' down: net_tick_timeout
2024-06-13 10:48:51.960 [info] <0.420.0> node 'rabbit@server3' up
2024-06-13 10:55:09.811 [error] <0.29854.164> Supervisor {<0.29854.164>,amqp_channel_sup_sup} had child channel_sup started with amqp_channel_sup:start_link(direct, <0.29862.164>, <<"<rab...@server1.3.29862.164>">>) at undefined exit with reason shutdown in context shutdown_error
2024-06-13 11:14:13.098 [info] <0.16581.165> accepting AMQP connection <0.16581.165> (123.456.789.6:7495 -> 22.2.22.51:5672)
2024-06-13 11:14:13.105 [info] <0.16581.165> Connection <0.16581.165> (123.456.789.6:7495 -> 22.2.22.51:5672) has a client-provided name: rabbitConnectionFactory#3ed82c0b:2
2024-06-13 11:14:13.114 [info] <0.16581.165> connection <0.16581.165> (123.456.789.6:7495 -> 22.2.22.51:5672 - rabbitConnectionFactory#3ed82c0b:2): user 'zzzprd.mq' authenticated and granted access to vhost '/'

  • Server2 logs
2024-06-13 10:48:47.077 [warning] <0.5598.92> Received a 'DOWN' message from 'rabbit@server1' but still can communicate with it
2024-06-13 10:48:47.077 [error] <0.418.0> Partial partition disconnect from rabbit@server3
2024-06-13 10:48:47.086 [error] <0.190.0> Mnesia('rabbit@server2'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@server3'}
2024-06-13 10:48:47.203 [info] <0.626.0> Mirrored queue 'QQQ-WAS4' in vhost '/': Slave <rab...@server2.3.626.0> saw deaths of mirrors <rab...@server3.2.619.0>
2024-06-13 10:48:47.203 [info] <0.626.0> Mirrored queue 'QQQ-WAS4' in vhost '/': Promoting slave <rab...@server2.3.626.0> to master
2024-06-13 10:48:47.208 [info] <0.590.0> Mirrored queue 'QQ-WAS5' in vhost '/': Promoting slave <rab...@server2.3.590.0> to master
2024-06-13 10:48:47.209 [info] <0.616.0> Mirrored queue 'QQ-WAS1' in vhost '/': Slave <rab...@server2.3.616.0> saw deaths of mirrors <rab...@server3.2.611.0>
2024-06-13 10:48:47.209 [info] <0.616.0> Mirrored queue 'QQ-WAS1' in vhost '/': Promoting slave <rab...@server2.3.616.0> to master
2024-06-13 10:48:48.080 [info] <0.418.0> rabbit on node 'rabbit@server3' down
2024-06-13 10:48:48.083 [info] <0.418.0> Keeping rabbit@server3 listeners: the node is already back
2024-06-13 10:48:48.085 [info] <0.418.0> node 'rabbit@server3' down: connection_closed
2024-06-13 10:48:48.085 [info] <0.418.0> node 'rabbit@server3' up
2024-06-13 10:48:51.822 [info] <0.560.0> Mirrored queue 'QQQ-WAS7' in vhost '/': Promoting slave <rab...@server2.3.560.0> to master
2024-06-13 10:48:51.942 [info] <0.526.0> Mirrored queue 'QQ-WAS8' in vhost '/': Promoting slave <rab...@server2.3.526.0> to master
2024-06-13 10:48:51.942 [info] <0.582.0> Mirrored queue 'QQ-WAS6' in vhost '/': Promoting slave <rab...@server2.3.582.0> to master
2024-06-13 11:14:13.128 [info] <0.418.0> rabbit on node 'rabbit@server3' down
2024-06-13 11:14:13.131 [info] <0.418.0> Keeping rabbit@server3 listeners: the node is already back
2024-06-13 11:14:13.131 [info] <0.11312.92> accepting AMQP connection <0.11312.92> (123.456.789.6:7388 -> 22.2.22.52:5672)
2024-06-13 11:14:13.147 [info] <0.11312.92> Connection <0.11312.92> (123.456.789.6:7388 -> 22.2.22.52:5672) has a client-provided name: rabbitConnectionFactory#77a9927c:1
2024-06-13 11:14:13.151 [info] <0.11312.92> connection <0.11312.92> (123.456.789.6:7388 -> 22.2.22.52:5672 - rabbitConnectionFactory#77a9927c:1): user '.prd.mq' authenticated and granted access to vhost '/'

  • Server3 logs
2024-06-13 10:48:47.065 [info] <0.417.0> rabbit on node 'rabbit@server1' down
2024-06-13 10:48:47.065 [error] <0.3241.0> ** Node 'rabbit@server1' not responding **
** Removing (timedout) connection **
2024-06-13 10:48:47.072 [info] <0.417.0> Node rabbit@server1 is down, deleting its listeners
2024-06-13 10:48:47.076 [info] <0.417.0> node 'rabbit@server1' down: net_tick_timeout
2024-06-13 10:48:47.077 [error] <0.417.0> Partial partition detected:
 * We saw DOWN from rabbit@server1
 * We can still see rabbit@server2 which can see rabbit@server1
We will therefore intentionally disconnect from rabbit@server2
2024-06-13 10:48:47.085 [error] <0.190.0> Mnesia('rabbit@server3'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@server2'}
2024-06-13 10:48:47.099 [error] <0.521.0> ** Generic server <0.521.0> terminating
** Last message in was {'$gen_cast',{ack,[980928],<0.30035.28>}}
** When Server state == {q,{amqqueue,{resource,<<"/">>,queue,<<"QQQINF">>},true,false,none,[],<0.521.0>,[<41700.522.0>,<41699.524.0>],[<41700.522.0>],['rabbit@server2'],[{vhost,<<"/">>},{name,<<"ha-all">>},{pattern,<<"^AAA">>},{'apply-to',<<"all">>},{definition,[{<<"ha-mode">>,<<"all">>}]},{priority,0}],undefined,[{<41699.525.0>,<41699.524.0>},{<41700.523.0>,<41700.522.0>},{<0.522.0>,<0.521.0>}],[],live,1,[],<<"/">>,#{user => <<"sa">>}},none,true,rabbit_mirror_queue_master,{state,{resource,<<"/">>,queue,<<"QQQINF">>},<0.522.0>,<0.2863.0>,rabbit_priority_queue,{passthrough,rabbit_variable_queue,{vqstate,{0,{[],[]}},{0,{[],[]}},{delta,undefined,0,0,undefined},{0,{[],[]}},{0,{[],[]}},981150,{0,nil},{0,nil},{0,nil},{qistate,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF",{#{},[{segment,59,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF/59.idx",{array,16384,0,undefined,{10000,{1000,1000,1000,1000,{100,{{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,...},...},...},...},...}},...}]},...},...}},...},...}
** Reason for termination ==
** {{badmatch,{vqstate,{0,{[],[]}},{0,{[],[]}},{delta,undefined,0,0,undefined},{0,{[],[]}},{0,{[],[]}},981150,{0,nil},{0,nil},{0,nil},{qistate,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF",{#{},[{segment,59,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF/59.idx",{array,16384,0,undefined,{10000,{1000,1000,1000,1000,{100,{{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},10},{{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined,undefined},{undefined,undefined,undefined,undefined,undefined,undefined,undefined,...},...},...},...},...}},...}]},...},...}},...}
2024-06-13 10:48:47.100 [error] <0.521.0> CRASH REPORT Process <0.521.0> with 2 neighbours exited with reason: no match of right hand value {vqstate,{0,{[],[]}},{0,{[],[]}},{delta,undefined,0,0,undefined},{0,{[],[]}},{0,{[],[]}},981150,{0,nil},{0,nil},{0,nil},{qistate,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF",{#{},[{segment,59,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF/59.idx",{array,16384,0,undefined,{10000,{1000,1000,1000,1000,{100,{{undefined,undefined,undefined,undefined,undefined,undefined,...},...},...},...},...}},...}]},...},...} in rabbit_priority_queue:ack/2 line 319 in gen_server2:terminate/3 line 1183
2024-06-13 10:48:47.100 [error] <0.3493.51> Restarting crashed queue 'QQQINF' in vhost '/'.
2024-06-13 10:48:47.100 [error] <0.520.0> Supervisor {<0.520.0>,rabbit_amqqueue_sup} had child rabbit_amqqueue started with rabbit_prequeue:start_link({amqqueue,{resource,<<"/">>,queue,<<"QQQINF">>},true,false,none,[],<41699.604.0>,[<41700.601.0>],...}, slave, <0.519.0>) at <0.521.0> exit with reason no match of right hand value {vqstate,{0,{[],[]}},{0,{[],[]}},{delta,undefined,0,0,undefined},{0,{[],[]}},{0,{[],[]}},981150,{0,nil},{0,nil},{0,nil},{qistate,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF",{#{},[{segment,59,"/data/rabbitmq/rabbit@server3/msg_stores/vhosts/628WB79CIFDYO9LJI6DKMI09L/queues/EUW0SJXU1DJ2VGXME6PYB6QTF/59.idx",{array,16384,0,undefined,{10000,{1000,1000,1000,1000,{100,{{undefined,undefined,undefined,undefined,undefined,undefined,...},...},...},...},...}},...}]},...},...} in rabbit_priority_queue:ack/2 line 319 in context child_terminated
2024-06-13 10:48:47.199 [error] <0.3524.51> Restarting crashed queue 'QQQINF' in vhost '/'.
2024-06-13 10:48:47.199 [error] <0.3493.51> ** Generic server <0.3493.51> terminating
** Last message in was {'$gen_cast',init}
** When Server state == {q,{amqqueue,{resource,<<"/">>,queue,<<"QQQINF">>},true,false,none,[],<0.3493.51>,[<41700.522.0>,<41699.524.0>],[<41700.522.0>],['rabbit@server2'],[{vhost,<<"/">>},{name,<<"ha-all">>},{pattern,<<"^AAA">>},{'apply-to',<<"all">>},{definition,[{<<"ha-mode">>,<<"all">>}]},{priority,0}],undefined,[{<41699.525.0>,<41699.524.0>},{<41700.523.0>,<41700.522.0>},{<0.522.0>,<0.521.0>}],[],crashed,1,[],<<"/">>,#{user => <<"sa">>}},none,false,undefined,undefined,{state,{queue,[],[],0},{active,-575348530943411,1.0}},undefined,undefined,undefined,undefined,{state,fine,5000,undefined},{0,nil},undefined,undefined,undefined,{state,{dict,0,16,16,8,80,48,{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},delegate},undefined,undefined,undefined,undefined,'drop-head',0,0,running}
** Reason for termination ==
** {noproc,{gen_server2,call,[<0.3495.51>,get_gm,infinity]}}
2024-06-13 10:48:47.199 [error] <0.3493.51> CRASH REPORT Process <0.3493.51> with 0 neighbours exited with reason: no such process or port in call to gen_server2:call(<0.3495.51>, get_gm, infinity) in gen_server2:terminate/3 line 1183
2024-06-13 10:48:47.200 [error] <0.520.0> Supervisor {<0.520.0>,rabbit_amqqueue_sup} had child rabbit_amqqueue started with rabbit_prequeue:start_link({amqqueue,{resource,<<"/">>,queue,<<"QQQINF">>},true,false,none,[],<41699.604.0>,[<41700.601.0>],...}, slave, <0.519.0>) at <0.3493.51> exit with reason no such process or port in call to gen_server2:call(<0.3495.51>, get_gm, infinity) in context child_terminated
2024-06-13 10:48:47.270 [info] <0.2836.0> Mirrored queue 'QQQQz-WAS6' in vhost '/': Master <rab...@server3.2.579.0> saw deaths of mirrors <rab...@server1.3.584.0> <rab...@server2.3.582.0>
2024-06-13 10:48:47.273 [info] <0.2873.0> Mirrored queue 'QQQQz-WAS3' in vhost '/': Master <rab...@server3.2.607.0> saw deaths of mirrors <rab...@server1.3.616.0> <rab...@server2.3.612.0>
2024-06-13 10:48:47.274 [info] <0.2901.0> Mirrored queue 'QQQQPOLL-WAS5' in vhost '/': Master <rab...@server3.2.571.0> saw deaths of mirrors <rab...@server1.3.576.0> <rab...@server2.3.574.0>
2024-06-13 10:48:47.274 [info] <0.2838.0> Mirrored queue 'QQQQPOLL-WAS3' in vhost '/': Master <rab...@server3.2.623.0> saw deaths of mirrors <rab...@server1.3.632.0> <rab...@server2.3.630.0>
2024-06-13 10:48:47.274 [info] <0.2822.0> Mirrored queue 'QQQQz-WAS5' in vhost '/': Master <rab...@server3.2.587.0> saw deaths of mirrors <rab...@server1.3.594.0> <rab...@server2.3.590.0>
2024-06-13 10:48:47.287 [warning] <0.3613.51> global: 'rabbit@server3' failed to connect to 'rabbit@server1'
2024-06-13 10:48:48.078 [info] <0.417.0> rabbit on node 'rabbit@server2' down
2024-06-13 10:48:48.080 [info] <0.417.0> Keeping rabbit@server2 listeners: the node is already back
2024-06-13 10:48:48.081 [info] <0.417.0> node 'rabbit@server2' down: disconnect
2024-06-13 10:48:48.081 [info] <0.417.0> node 'rabbit@server2' up
2024-06-13 10:48:51.806 [info] <0.417.0> node 'rabbit@server1' up
2024-06-13 10:48:51.807 [error] <0.190.0> Mnesia('rabbit@server3'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@server1'}
2024-06-13 10:49:50.190 [error] <0.8639.29> Channel error on connection <0.30027.28> (123.456.789.6:7273 -> 22.2.22.53:5672, vhost: '/', user: 'AAA.prd.mq'), channel 2:
operation queue.declare caused a channel exception not_found: queue 'QQQINF' in vhost '/' process is stopped by supervisor
2024-06-13 10:50:52.255 [error] <0.3737.51> Channel error on connection <0.30027.28> (123.456.789.6:7273 -> 22.2.22.53:5672, vhost: '/', user: 'AAA.prd.mq'), channel 2:
operation queue.declare caused a channel exception not_found: queue 'QQQINF' in vhost '/' process is stopped by supervisor
2024-06-13 10:51:54.325 [error] <0.3843.51> Channel error on connection <0.30027.28> (123.456.789.6:7273 -> 22.2.22.53:5672, vhost: '/', user: 'AAA.prd.mq'), channel 2:
operation queue.declare caused a channel exception not_found: queue 'QQQINF' in vhost '/' process is stopped by supervisor
2024-06-13 10:52:56.379 [error] <0.3938.51> Channel error on connection <0.30027.28> (123.456.789.6:7273 -> 22.2.22.53:5672, vhost: '/', user: 'AAA.prd.mq'), channel 2:
operation queue.declare caused a channel exception not_found: queue 'QQQINF' in vhost '/' process is stopped by supervisor
2024-06-13 11:14:12.995 [info] <0.6304.51> RabbitMQ is asked to stop...
2024-06-13 11:14:13.070 [info] <0.6304.51> Stopping RabbitMQ applications and their dependencies in the following order:
    rabbitmq_management
    amqp_client
    rabbitmq_web_dispatch
    cowboy
    cowlib
    rabbitmq_management_agent
    rabbit
.

jo...@cloudamqp.com

unread,
Jun 25, 2024, 3:34:57 PM (10 days ago) Jun 25
to rabbitmq-users
Hi,

Your best bet is to update to RabbitMQ 3.13.3 and Erlang 26. And start monitoring with Prometheus/Grafana, you can then use the "Erlang-Distribution" dashboard to monitor what is happening on the links between your nodes. https://grafana.com/grafana/dashboards/11352-erlang-distribution/

/Johan

정찬호 (chan)

unread,
Jun 26, 2024, 2:38:35 AM (9 days ago) Jun 26
to rabbitmq-users
Thanks for your opinion, I'm also considering upgrading. 
Then it must be hard to find the cause of net_tick_timeout that has already occurred
2024년 6월 26일 수요일 오전 4시 34분 57초 UTC+9에 jo...@cloudamqp.com님이 작성:
Reply all
Reply to author
Forward
0 new messages