In a cluster of 3 instances our Grafana/Prometheus monitoring system detected all the nodes down for 3-4 minutes, in two distinct moments. In such intervals client applications (SpringBoot apps) had issues connecting to RabbitMQ. The cluster was created and launched 26 hours before the error events and worked without problems until the events. After each error event the clustered recovered automatically without any manual intervention, but we would like to understand what happened and the causes.
The monitoring system reported the following during the downtime:
{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}
{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:12Z"}
{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:18Z"}
{"error":"Get http://rabbitmq3:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:35Z"}
{"error":"Get http://rabbitmq1:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:42Z"}
{"error":"Get http://rabbitmq2:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:48Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:05Z"}
{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:12Z"}
{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:18Z"}
{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:35Z"}
{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:42Z"}
{"error":"Get http://rabbitmq2:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:48Z"}
{"error":"Get http://rabbitmq3:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:09Z"}
{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:10Z"}
{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:18Z"}
{"error":"Get http://rabbitmq3:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:39Z"}
{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:40Z"}
{"error":"Get http://rabbitmq2:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:48Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:09Z"}
{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:10Z"}
{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:18Z"}
{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:39Z"}
{"error":"Get http://rabbitmq1:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:40Z"}
{"error":"Get http://rabbitmq2:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:48Z"}
{"error":"Get http://rabbitmq3:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:09Z"}
{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:10Z"}
{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:18Z"}
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:39Z"}
{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:40Z"}
{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:48Z"}
{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:29:09Z"}
The logs of nodes 1 and 2 for the entire day were empty, whereas the node 3 had the following logs:
2019-10-02 21:05:52.587 [error] <0.1301.0> ** Generic server rabbit_epmd_monitor terminating
** Last message in was check
** When Server state == {state,{erlang,#Ref<0.933050920.3712483331.253384>},erl_epmd,"rabbitmq","rabbitmq3",25672}
** Reason for termination ==
** {{case_clause,{error,nxdomain}},[{rabbit_epmd_monitor,check_epmd,1,[{file,"src/rabbit_epmd_monitor.erl"},{line,100}]},{rabbit_epmd_monitor,handle_info,2,[{file,"src/rabbit_epmd_monitor.erl"},{line,79}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-10-02 21:05:52.587 [error] <0.1301.0> CRASH REPORT Process rabbit_epmd_monitor with 0 neighbours crashed with reason: no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100
2019-10-02 21:05:52.587 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.1301.0> exit with reason no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100 in context child_terminated
2019-10-02 21:06:00.588 [error] <0.15921.1> CRASH REPORT Process <0.15921.1> with 0 neighbours crashed with reason: no function clause matching rabbit_epmd_monitor:init_handle_port_please({error,nxdomain}, erl_epmd, "rabbitmq", "rabbitmq3") line 59
2019-10-02 21:06:00.589 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.1301.0> exit with reason no function clause matching rabbit_epmd_monitor:init_handle_port_please({error,nxdomain}, erl_epmd, "rabbitmq", "rabbitmq3") line 59 in context start_error
2019-10-02 23:29:03.236 [error] <0.15923.1> ** Generic server rabbit_epmd_monitor terminating
** Last message in was check
** When Server state == {state,{erlang,#Ref<0.933050920.3710386178.8766>},erl_epmd,"rabbitmq","rabbitmq3",25672}
** Reason for termination ==
** {{case_clause,{error,nxdomain}},[{rabbit_epmd_monitor,check_epmd,1,[{file,"src/rabbit_epmd_monitor.erl"},{line,100}]},{rabbit_epmd_monitor,handle_info,2,[{file,"src/rabbit_epmd_monitor.erl"},{line,79}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
2019-10-02 23:29:03.236 [error] <0.15923.1> CRASH REPORT Process rabbit_epmd_monitor with 0 neighbours crashed with reason: no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100
2019-10-02 23:29:03.237 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.15923.1> exit with reason no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100 in context child_terminated
Nodes are running in a Docker Swarm using the latest Docker official version (3.7.18 at the time of this writing).
Any idea about the error ?
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d20bede7-adc0-4765-b624-7ac17cdc212a%40googlegroups.com.
{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}
{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:12Z"}
{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:18Z"}
At the end of the 3-4 minutes all the exporters did not complain anymore, the node 3 logged its "rabbit_epmd_monitor:check_epmd/1 fails with {error,nxdomain}" error and everything returned to work correctly. 2 hours later the problem re-issued and solved in 3-4 minutes, with the same symptoms.
Considering that the Docker Swarm on each physical server runs 20+ container, of which RabbitMQ and its Prometheus Exported are just a fraction, it seems to me very unlikely that a network issue on the swarm affected exactly all 3 RabbitMQ nodes at the same time (thus preventing their exporters to connect from contacting them with a Client.Timeout error) and did not affected all the other containers (we have Prometheus exporters for everything and nothing was signaled for the other containers...). Also I checked Docker swarm logs, which are turned to debug, and nothing happened on the swarm network at that time. Moreover, in case of a system-wide network/domain resolution error I expected to see an Unknown Hostname error or similar on the Prometheus Exporters, and not a Client.Timeout.
BTW, now I have stored the rabbitmq-collect-env script on the three nodes and I will be able to run them during the error event as soon as I will receive the "down" alert from Grafana/Prometheus.
Thanks again for your time.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d20bede7-adc0-4765-b624-7ac17cdc212a%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f80bd707-64f4-43ce-bd01-e52535511d21%40googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d595695b-4c61-4fc1-b747-d58d732166a6%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d595695b-4c61-4fc1-b747-d58d732166a6%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/cad9229d-33b2-48f8-b427-448b7cd75a20%40googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/cad9229d-33b2-48f8-b427-448b7cd75a20%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/72c1a68e-d7e7-4094-acfa-2e7ed1f0b00d%40googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/72c1a68e-d7e7-4094-acfa-2e7ed1f0b00d%40googlegroups.com.
[Wed Oct 23 19:03:42 2019] IPVS: rr: FWM 330 0x0000014A - no destination availableAbout the specific IPVS related lines I checked few of them and they seems all legit: they were produced just when we performed upgrades on swarm services, including when we switched from RabbitMQ 3.7.18 to 3.8.0, from 3.8.0 to 3.8.1-beta.1 and, yesterday, when we restarted 3.8.1-beta.1 with the alpine based Docker images instead of the usual Ubuntu based. "IPVS No destination available" seems an usual log message when you kill/restart some docker service.
As I explained at the begin of the thread we run dozens of services in the same swarm, few with clustering and few scaled as replica in the 3 nodes and unfortunately only RabbitMQ is showing such "strange" networking issue.
I'm pretty sure the problem is on the swarm networking and not on RabbitMQ itself - we are preparing to perform a full migration to a new Docker Swarm version,
but in the meanwhile we are performing any "quick" attempt in order to "break" or reduce the problem.
2. Node URLs are specified using the Spring AMQP config property, where all the nodes are listed:spring.data.rabbitmq.addresses=rabbitmq1:5672,rabbitmq2:5672,rabbitmq3:5672Ok good spot: I just discovered that the list order is important. I need to use shuffle.
3. Ok I tried a rebalance: on node 1 i executedrabbitmq-queues rebalance "all" --vhost-pattern "/" --queue-pattern ".*"
I hope that this can reduce the pressure on nodes. I will check quorum queues.
Hi,As I explained at the begin of the thread we run dozens of services in the same swarm, few with clustering and few scaled as replica in the 3 nodes and unfortunately only RabbitMQ is showing such "strange" networking issue.Are you using Prometheus to periodically pull metrics from other services running in the same Swarm? If yes, is the polling interval the same as for RabbitMQ? If both are true, then I would expect you to see gaps in metrics for other services.
I'm pretty sure the problem is on the swarm networking and not on RabbitMQ itself - we are preparing to perform a full migration to a new Docker Swarm version,Are there any metrics from the Swarm networking that you can load into Prometheus? It would be especially interesting to see the DNS resolution success/error rate.
but in the meanwhile we are performing any "quick" attempt in order to "break" or reduce the problem.I like your thinking. I would suggest to force all RabbitMQ nodes on the same Swarm node, and see if that makes the errors go away. It would be also interesting to try and place RabbitMQ nodes on their dedicated Swarm node. This is a long-shot, but it's entirely possible that when there is networking pressure in the system - maybe routing table updates, maybe just a peak in packet routing - then networking instability is introduced which makes RabbitMQ nodes misbehave. Metrics and logs from the networking layer will be able to make this more than a hunch.
Hi,As I explained at the begin of the thread we run dozens of services in the same swarm, few with clustering and few scaled as replica in the 3 nodes and unfortunately only RabbitMQ is showing such "strange" networking issue.Are you using Prometheus to periodically pull metrics from other services running in the same Swarm? If yes, is the polling interval the same as for RabbitMQ? If both are true, then I would expect you to see gaps in metrics for other services.
I'm pretty sure the problem is on the swarm networking and not on RabbitMQ itself - we are preparing to perform a full migration to a new Docker Swarm version,
Are there any metrics from the Swarm networking that you can load into Prometheus? It would be especially interesting to see the DNS resolution success/error rate.
but in the meanwhile we are performing any "quick" attempt in order to "break" or reduce the problem.
I like your thinking. I would suggest to force all RabbitMQ nodes on the same Swarm node, and see if that makes the errors go away. It would be also interesting to try and place RabbitMQ nodes on their dedicated Swarm node. This is a long-shot, but it's entirely possible that when there is networking pressure in the system - maybe routing table updates, maybe just a peak in packet routing - then networking instability is introduced which makes RabbitMQ nodes misbehave. Metrics and logs from the networking layer will be able to make this more than a hunch.
We finally migrated the platform to new servers and, more important, to the latest stable Docker (Swarm) engine.As expected, the networking issue solved by itself: after 5 days of uptime not a single 'epmd monitor failed to retrieve our port from epmd: nxdomain' error was produced. Moreover, we didn't experienced any network partition.