Shovel does not restart after heartbeat timeout

250 views
Skip to first unread message

raphael....@liksi.fr

unread,
Apr 24, 2019, 7:24:11 AM4/24/19
to rabbitmq-users
Hi,

I have dynamic shovel between 2 nodes in cluster to 2 other nodes in an other cluster (10.12.1.1 + 10.12.1.2 are in cluster, and 10.12.1.3 + 10.12.1.4 are in cluster).
From time to time, due to network issue, the link disconnects, manages to reconnect several times, and then disconnect again, but does not retry to reconnect anymore.
When this happens, it seems the Shovel tries to connect from one node to an other node in the same cluster (10.12.1.1 -> 10.12.1.2), but Shovel is not configured like this.

To restart it, we need to shutdown the node where the shovel link appears in "terminated" state.

Any ideas to avoid this ?

Thanks.


2019-04-23 00:20:49.701 [error] <0.3620.14> ** Generic server <0.3620.14> terminating 
** Last message in was heartbeat_timeout
** When Server state == {state,amqp_network_connection,{state,#Port<0.5396885>,<<"client 10.12.1.1:42699 -> 10.12.1.3:5672">>,10,<0.3623.14>,131072,<0.3619.14>,undefined,false},<0.3622.14>,{amqp_params_network,<<"POLUSER">>,<<"***">>,<<"/">>,"POL1",5672,2047,0,10,60000,none,[#Fun<amqp_uri.12.90191702>,#Fun<amqp_uri.12.90191702>],[{<<"connection_name">>,longstr,<<"Shovel POLtoCEN">>}],[]},2047,[{<<"capabilities">>,table,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"consumer_priorities">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"per_consumer_qos">>,bool,true},{<<"direct_reply_to">>,bool,true}]},{<<"cluster_name">>,longstr,<<"POLPRD">>},{<<"copyright">>,longstr,<<"Copyright (C) 2007-2018 Pivotal Software, Inc.">>},{<<"information">>,longstr,<<"Licensed under the MPL.  See http://www.rabbitmq.com/">>},{<<"platform">>,longstr,<<"Erlang/OTP 20.3">>},{<<"product">>,longstr,<<"RabbitMQ">>},{<<"version">>,longstr,<<"3.7.5">>}],none,false}
** Reason for termination == 
** heartbeat_timeout
2019-04-23 00:20:49.701 [error] <0.3620.14> CRASH REPORT Process <0.3620.14> with 0 neighbours exited with reason: heartbeat_timeout in gen_server:handle_common_reply/8 line 726
2019-04-23 00:20:49.702 [error] <0.3618.14> Supervisor {<0.3618.14>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.3619.14>, {amqp_params_network,<<"POLUSER">>,<<"***">>,<<"/">>,"POL1",5672,2047,...}) at <0.3620.14> exit with reason heartbeat_timeout in context child_terminated
2019-04-23 00:20:49.702 [error] <0.3618.14> Supervisor {<0.3618.14>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.3619.14>, {amqp_params_network,<<"POLUSER">>,<<"***">>,<<"/">>,"POL1",5672,2047,...}) at <0.3620.14> exit with reason reached_max_restart_intensity in context shutdown
2019-04-23 00:20:49.702 [info] <0.3617.14> terminating static worker with {shutdown,{gen_server,call,[<0.3629.14>,{subscribe,{'basic.consume',0,<<"central">>,<<>>,false,false,false,false,[]},<0.3617.14>},60000]}} 
2019-04-23 00:20:49.703 [error] <0.3633.14> ** Generic server <0.3633.14> terminating 
** Last message in was {'EXIT',<0.3617.14>,{shutdown,{gen_server,call,[<0.3629.14>,{subscribe,{'basic.consume',0,<<"central">>,<<>>,false,false,false,false,[]},<0.3617.14>},60000]}}}
** When Server state == {state,amqp_network_connection,{state,#Port<0.5397121>,<<"client 10.12.1.1:60993 -> 10.12.1.2:5672">>,10,<0.3636.14>,131072,<0.3632.14>,undefined,false},<0.3635.14>,{amqp_params_network,<<"CENUSER">>,<<"*******">>,<<"/">>,"CEN2",5672,2047,0,10,60000,none,[#Fun<amqp_uri.12.90191702>,#Fun<amqp_uri.12.90191702>],[{<<"connection_name">>,longstr,<<"Shovel POLtoCEN">>}],[]},2047,[{<<"capabilities">>,table,[{<<"publisher_confirms">>,bool,true},{<<"exchange_exchange_bindings">>,bool,true},{<<"basic.nack">>,bool,true},{<<"consumer_cancel_notify">>,bool,true},{<<"connection.blocked">>,bool,true},{<<"consumer_priorities">>,bool,true},{<<"authentication_failure_close">>,bool,true},{<<"per_consumer_qos">>,bool,true},{<<"direct_reply_to">>,bool,true}]},{<<"cluster_name">>,longstr,<<"CENPRD">>},{<<"copyright">>,longstr,<<"Copyright (C) 2007-2018 Pivotal Software, Inc.">>},{<<"information">>,longstr,<<"Licensed under the MPL.  See http://www.rabbitmq.com/">>},{<<"platform">>,longstr,<<"Erlang/OTP 20.3">>},{<<"product">>,longstr,<<"RabbitMQ">>},{<<"version">>,longstr,<<"3.7.5">>}],none,false}
** Reason for termination == 
** "stopping because dependent process <0.3617.14> died: {shutdown,\n                                                      {gen_server,call,\n                                                       [<0.3629.14>,\n                                                        {subscribe,\n                                                         {'basic.consume',0,\n                                                          <<\"central\">>,<<>>,\n                                                          false,false,false,\n                                                          false,[]},\n                                                         <0.3617.14>},\n                                                        60000]}}"
2019-04-23 00:20:49.703 [error] <0.3633.14> CRASH REPORT Process <0.3633.14> with 0 neighbours exited with reason: "stopping because dependent process <0.3617.14> died: {shutdown,\n                                                      {gen_server,call,\n                                                       [<0.3629.14>,\n                                                        {subscribe,\n                                                         {'basic.consume',0,\n                                                          <<\"central\">>,<<>>,\n                                                          false,fal..." in gen_server:handle_common_reply/8 line 726
2019-04-23 00:20:49.704 [error] <0.3631.14> Supervisor {<0.3631.14>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.3632.14>, {amqp_params_network,<<"CENUSER">>,<<"*******">>,<<"/">>,"CEN2",5672,2047,...}) at <0.3633.14> exit with reason "stopping because dependent process <0.3617.14> died: {shutdown,\n                                                      {gen_server,call,\n                                                       [<0.3629.14>,\n                                                        {subscribe,\n                                                         {'basic.consume',0,\n                                                          <<\"central\">>,<<>>,\n                                                          false,fal..." in context child_terminated
2019-04-23 00:20:49.704 [error] <0.3631.14> Supervisor {<0.3631.14>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.3632.14>, {amqp_params_network,<<"CENUSER">>,<<"*******">>,<<"/">>,"CEN2",5672,2047,...}) at <0.3633.14> exit with reason reached_max_restart_intensity in context shutdown

Michael Klishin

unread,
Apr 24, 2019, 8:01:02 AM4/24/19
to rabbitmq-users
I could not find or recall any Shovel reconnection issues in 3.7.5 or later versions.

This log snippet demonstrates a single missed heartbeat event. Consider posting the entire log as well as your Shovel's settings
(except for the credentials, of course).

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Message has been deleted

raphael....@liksi.fr

unread,
Apr 24, 2019, 8:57:43 AM4/24/19
to rabbitmq-users
Hi,

Please find attached full log of incident.

Please notice that we have 2 shovels links (CENtoPOL and POLtoCEN).
CENtoPOL managed to recover successfully by itself but the POLtoCEN is the one which is causing the issue (last log timestamp about it : 2019-04-23 00:20:49)

Thank you for your help.
Regards,
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
rabbitmq_CEN1_ob.log

Michael Klishin

unread,
Apr 24, 2019, 3:18:27 PM4/24/19
to rabbitmq-users
According to this log there's a bunch of Erlang client connections (used by Shovel) that failed due to TCP socket closure and/or missed heartbeats.

I don't see any "giving up" events from Shovel in this snippet, and no failing reconnection attempts either.

We'd need to try to reproduce but when we do it, we will be using 3.7.14 and not 3.7.5.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

raphael....@liksi.fr

unread,
Apr 25, 2019, 3:24:12 AM4/25/19
to rabbitmq-users
I really don't know how to reproduce since it happens only twice in 2 months, but still.
It's definitely due to network issue between both sites.

Michael Klishin

unread,
Apr 28, 2019, 7:14:08 PM4/28/19
to rabbitmq-users
We will try to reproduce with dropped traffic (iptables). That's all I can promise.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Raphaël Delaporte

unread,
Jul 10, 2019, 12:14:42 PM7/10/19
to rabbitm...@googlegroups.com
Hi Michael,

We had again the issue but RabbitMQ version has been upgraded in 3.7.15 since last time.
Please find attached the log file.

Please also note that we are sure there is network outage during the issue, but I don't understand why Shovel is "terminating", and not retrying to connect.
Moreover, I have other Shovel links between same servers in the same way, they displayed errors as well during network outage, but managed to reconnect themselves.
Then I don't understand why from time to time, some of them don't reconnect automatically. This is very problematic issue.

Thank you.

---

Raphaël Delaporte • Director
Direct line: +33 6 21 44 04 36
37 boulevard Solférino • 35000 Rennes

Liksi


You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/lPMLUmQstRw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
crash_shovel.log

Luke Bakken

unread,
Jul 10, 2019, 12:21:47 PM7/10/19
to rabbitmq-users
Hello,

You are probably running into this issue, which was fixed in version 3.7.16 - https://github.com/rabbitmq/rabbitmq-shovel/issues/60

Thanks -
Luke

Raphaël Delaporte

unread,
Jul 10, 2019, 1:12:43 PM7/10/19
to rabbitmq-users
Hi,

OK, seems we are lucky... let's update to 3.7.16.
I'll let you know if issue happens again.

Thanks,
Reply all
Reply to author
Forward
0 new messages