RabbitMQ AWS Peer Discovery timeout issues

Fenil Sevak

unread,

Oct 23, 2018, 12:56:26 PM10/23/18

to rabbitmq-users

Trying to automate rabbit 3.7.7 (erlang 21.0.1) clustering on windows ec2s in us-east-1.

When doing so, I will occasionally get timeout errors and occasionally have peer discovery calls to the aws ec2/autoscaling apis succeed. These timeout errors can occur at either of the two(?) steps pulling from the AWS apis, see the snippets for examples of both.

Failure on first call:

2018-10-23 15:37:58.991 [debug] <0.226.0> Started rabbitmq_aws
2018-10-23 15:37:58.991 [debug] <0.226.0> Will use AWS access key of 'undefined'
2018-10-23 15:37:58.991 [debug] <0.226.0> Setting AWS region to "us-east-1"
2018-10-23 15:37:58.991 [debug] <0.226.0> Fetched EC2 instance ID from "http://169.254.169.254/latest/meta-data/instance-id": "i-0d9e6c08369328bca"
2018-10-23 15:38:04.006 [error] <0.225.0> CRASH REPORT Process <0.225.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[rabbitmq_aws,{request,"autoscaling",get,[],"/?Action=DescribeAutoScalingInstances&Version=2011-01-01",[],[],undefined}]}} in application_master:init/4 line 138
2018-10-23 15:38:04.006 [info] <0.42.0> Application rabbit exited with reason: {timeout,{gen_server,call,[rabbitmq_aws,{request,"autoscaling",get,[],"/?Action=DescribeAutoScalingInstances&Version=2011-01-01",[],[],undefined}]}}
2018-10-23 15:38:08.053 [error] <0.163.0> Supervisor httpc_handler_sup had child undefined started with httpc_handler:start_link() at undefined exit with reason killed in context shutdown_error

Failure on second call:

2018-10-23 15:39:26.069 [debug] <0.226.0> Started rabbitmq_aws
2018-10-23 15:39:26.069 [debug] <0.226.0> Will use AWS access key of 'undefined'
2018-10-23 15:39:26.069 [debug] <0.226.0> Setting AWS region to "us-east-1"
2018-10-23 15:39:26.069 [debug] <0.226.0> Fetched EC2 instance ID from "http://169.254.169.254/latest/meta-data/instance-id": "i-0d9e6c08369328bca"
2018-10-23 15:39:26.303 [debug] <0.226.0> AWS request: /?Action=DescribeAutoScalingInstances&Version=2011-01-01
Response: [{"DescribeAutoScalingInstancesResponse",[{"DescribeAutoScalingInstancesResult",[{"AutoScalingInstances",[{"member",[{"LaunchConfigurationName","Test-Engine-55-EC2-EngineServerLaunchConfig-7P34WGKE6G0D"},{"LifecycleState","InService"},{"AutoScalingGroupName","Test-Engine-55-EC2-EngineServerGroup-10LP8ZK0EVAJY"},{"InstanceId","i-027e1350f37c8d97a"},{"HealthStatus","HEALTHY"},{"ProtectedFromScaleIn","false"},{"AvailabilityZone","us-east-1d"}]},{"member",[{"LaunchConfigurationName","Test-WebApi-00-ASG-AppServerLaunchConfig-14OFLSAVRIO7Q"},{"LifecycleState","InService"},{"AutoScalingGroupName","Test-WebApi-00-ASG-ServerGroup-1LDPVX8WFDECK"},{"InstanceId","i-056a255f977ebac77"},{"HealthStatus","HEALTHY"},{"ProtectedFromScaleIn","false"},{"AvailabilityZone","us-east-1b"}]},{"member",[{"LaunchConfigurationName","rabbit-7.7-nameFix"},{"LifecycleState","InService"},{"AutoScalingGroupName","Test-Rabbit"},{"InstanceId","i-0ad95dd5e5a0e70d8"},{"HealthStatus","HEALTHY"},{"ProtectedFromScaleIn","false"},{"AvailabilityZone","us-east-1b"}]},{"member",[{"LaunchConfigurationName","rabbit-7.7-nameFix"},{"LifecycleState","InService"},{"AutoScalingGroupName","Test-Rabbit"},{"InstanceId","i-0d9e6c08369328bca"},{"HealthStatus","HEALTHY"},{"ProtectedFromScaleIn","false"},{"AvailabilityZone","us-east-1d"}]}]}]},{"ResponseMetadata",[{"RequestId","d2743b16-d6d9-11e8-9e2f-d569972dfe74"}]}]}]
2018-10-23 15:39:26.303 [debug] <0.226.0> Performing autoscaling group discovery, group: "Test-Rabbit"
2018-10-23 15:39:26.303 [debug] <0.226.0> Performing autoscaling group discovery, found instances: ["i-0d9e6c08369328bca","i-0ad95dd5e5a0e70d8"]
2018-10-23 15:39:31.335 [error] <0.225.0> CRASH REPORT Process <0.225.0> with 0 neighbours exited with reason: {timeout,{gen_server,call,[rabbitmq_aws,{request,"ec2",get,[],"/?Action=DescribeInstances&InstanceId.3=i-0d9e6c08369328bca&InstanceId.4=i-0ad95dd5e5a0e70d8&Version=2015-10-01",[],[],undefined}]}} in application_master:init/4 line 138
2018-10-23 15:39:31.335 [info] <0.42.0> Application rabbit exited with reason: {timeout,{gen_server,call,[rabbitmq_aws,{request,"ec2",get,[],"/?Action=DescribeInstances&InstanceId.3=i-0d9e6c08369328bca&InstanceId.4=i-0ad95dd5e5a0e70d8&Version=2015-10-01",[],[],undefined}]}}
2018-10-23 15:39:35.381 [error] <0.163.0> Supervisor httpc_handler_sup had child undefined started with httpc_handler:start_link() at undefined exit with reason killed in context shutdown_error

I've attached a full log extract (the snippets above are from that attached extract) of a particular node that failed for a while and then eventually succeeded in finding an existing node (the other node had been up for nearly 19 hours at that point). There were no firewall / IAM policy changes while this was in progress. I just allowed the retries to eventually succeed. While it worked in this scenario, there is a case where the retries fail for so long that the temporary IAM credentials expire which causes a different, critical, failure when reaching out to the ec2 apis and seems to eventually result in the node spinning up a second cluster, assuming it is the master.

I think this is related to a 100ms timeout that is declared in the rabbitmq_aws dependency (rabbitmq_aws.hrl line 20), but I am pretty new to erlang and have a hard time following some of the syntax. It would be great if this timeout was a configurable value that could be set in rabbitmq.conf or the advanced config file!

Thanks!

CleanedRabbitMQLogs.txt

Michael Klishin

unread,

Oct 25, 2018, 2:14:02 PM10/25/18

to rabbitm...@googlegroups.com

Your observation seems to have a merit. The 100 ms timeout is way too low. I will file an

issue and we'll increase it for 3.7.9. Thank you very much for looking into this!

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

Oct 25, 2018, 2:19:23 PM10/25/18

to rabbitm...@googlegroups.com

FTR, this is the problematic constant [1].

1. https://github.com/rabbitmq/rabbitmq-aws/blob/master/include/rabbitmq_aws.hrl#L20

Michael Klishin

unread,

Oct 25, 2018, 3:51:14 PM10/25/18

to rabbitmq-users

Fixed [1], the new value is 10s and it will ship in 3.7.9.

Thank you for reporting!

1. https://github.com/rabbitmq/rabbitmq-aws/commit/504532144141f4e7784e93502d625fc6f60dbecc

Fenil Sevak

unread,

Oct 25, 2018, 5:07:52 PM10/25/18

to rabbitmq-users

Thanks for the quick response Michael!

Is there a place where tentative release dates are published?

Michael Klishin

unread,

Oct 25, 2018, 8:11:36 PM10/25/18

to rabbitm...@googlegroups.com

Preview releases are announced on this list. We provide no delivery ETAs.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward