rabbit_epmd_monitor:check_epmd/1 fails with {error,nxdomain}

589 views
Skip to first unread message

Aldo Bongio

unread,
Oct 3, 2019, 6:38:34 AM10/3/19
to rabbitmq-users

In a cluster of 3 instances our Grafana/Prometheus monitoring system detected all the nodes down for 3-4 minutes, in two distinct moments. In such intervals client applications (SpringBoot apps) had issues connecting to RabbitMQ. The cluster was created and launched 26 hours before the error events and worked without problems until the events. After each error event the clustered recovered automatically without any manual intervention, but we would like to understand what happened and the causes.


The monitoring system reported the following during the downtime:


{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}

{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}

{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:12Z"}

{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:18Z"}

{"error":"Get http://rabbitmq3:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:35Z"}

{"error":"Get http://rabbitmq1:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:42Z"}

{"error":"Get http://rabbitmq2:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:48Z"}

{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:05Z"}

{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:12Z"}

{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:18Z"}

{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:35Z"}

{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:42Z"}

{"error":"Get http://rabbitmq2:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:06:48Z"}


{"error":"Get http://rabbitmq3:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:09Z"}

{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:10Z"}

{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:18Z"}

{"error":"Get http://rabbitmq3:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:39Z"}

{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:40Z"}

{"error":"Get http://rabbitmq2:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:26:48Z"}

{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:09Z"}

{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:10Z"}

{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:18Z"}

{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:39Z"}

{"error":"Get http://rabbitmq1:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:40Z"}

{"error":"Get http://rabbitmq2:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:27:48Z"}

{"error":"Get http://rabbitmq3:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:09Z"}

{"error":"Get http://rabbitmq1:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:10Z"}

{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:18Z"}

{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:39Z"}

{"error":"Get http://rabbitmq1:15672/api/exchanges?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:40Z"}

{"error":"Get http://rabbitmq2:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:28:48Z"}

{"error":"Get http://rabbitmq3:15672/api/queues?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T23:29:09Z"}


The logs of nodes 1 and 2 for the entire day were empty, whereas the node 3 had the following logs:


2019-10-02 21:05:52.587 [error] <0.1301.0> ** Generic server rabbit_epmd_monitor terminating

** Last message in was check

** When Server state == {state,{erlang,#Ref<0.933050920.3712483331.253384>},erl_epmd,"rabbitmq","rabbitmq3",25672}

** Reason for termination ==

** {{case_clause,{error,nxdomain}},[{rabbit_epmd_monitor,check_epmd,1,[{file,"src/rabbit_epmd_monitor.erl"},{line,100}]},{rabbit_epmd_monitor,handle_info,2,[{file,"src/rabbit_epmd_monitor.erl"},{line,79}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}

2019-10-02 21:05:52.587 [error] <0.1301.0> CRASH REPORT Process rabbit_epmd_monitor with 0 neighbours crashed with reason: no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100

2019-10-02 21:05:52.587 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.1301.0> exit with reason no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100 in context child_terminated

2019-10-02 21:06:00.588 [error] <0.15921.1> CRASH REPORT Process <0.15921.1> with 0 neighbours crashed with reason: no function clause matching rabbit_epmd_monitor:init_handle_port_please({error,nxdomain}, erl_epmd, "rabbitmq", "rabbitmq3") line 59

2019-10-02 21:06:00.589 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.1301.0> exit with reason no function clause matching rabbit_epmd_monitor:init_handle_port_please({error,nxdomain}, erl_epmd, "rabbitmq", "rabbitmq3") line 59 in context start_error


2019-10-02 23:29:03.236 [error] <0.15923.1> ** Generic server rabbit_epmd_monitor terminating

** Last message in was check

** When Server state == {state,{erlang,#Ref<0.933050920.3710386178.8766>},erl_epmd,"rabbitmq","rabbitmq3",25672}

** Reason for termination ==

** {{case_clause,{error,nxdomain}},[{rabbit_epmd_monitor,check_epmd,1,[{file,"src/rabbit_epmd_monitor.erl"},{line,100}]},{rabbit_epmd_monitor,handle_info,2,[{file,"src/rabbit_epmd_monitor.erl"},{line,79}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}

2019-10-02 23:29:03.236 [error] <0.15923.1> CRASH REPORT Process rabbit_epmd_monitor with 0 neighbours crashed with reason: no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100

2019-10-02 23:29:03.237 [error] <0.1300.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.15923.1> exit with reason no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 100 in context child_terminated


Nodes are running in a Docker Swarm using the latest Docker official version (3.7.18 at the time of this writing).


Any idea about the error ?


Michael Klishin

unread,
Oct 3, 2019, 6:51:11 AM10/3/19
to rabbitmq-users
According to this stack trace an internal component responsible for continuous re-registration of the node with epmd
fails with nxdomain. nxdomain is UNIX speak for "domain cannot be resolved".
Something on this host prevents the node from resolving its own domain.

I don't see any evidence that this exception could affect your applications. You are welcome to share more logs on the mailing list.
This component is a period background operation that is important for CLI tool and inter-node communication but is not at all related to client connections.

Most likely hostname resolution failed for the entire system, both epmd and clients, which is why they could not connect.
Full logs of all nodes likely can confirm this hypothesis or at least provide extra clues.

I find hard to believe that on a production system you would see no logs on two nodes for an entire day. This means there were no new connections,
no connection closures, no client errors of any kind, no new queues with mirroring declared.
Are you sure you haven't set log level to [critical] or lower or that the logs are correctly collected?


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/d20bede7-adc0-4765-b624-7ac17cdc212a%40googlegroups.com.


--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Aldo Bongio

unread,
Oct 3, 2019, 8:04:38 AM10/3/19
to rabbitmq-users
Hi Micheal, thanks for your help.

I checked the logs and they are correctly configured, with "info" level. Attached is logs.zip, where you can see logs of the 3 nodes on day 1 (October 1st), when the cluster is established, (pool) connections are opened, queues are mirrored, etc.. On day 2 there is just a small log from the node 1 and the crash logs we are discussing from node 3.

The other log I have is from the monitoring system. There is just one exporter container (based on this Docker image) for each RabbitMQ instance, which (as stated in my first post) reported a Timeout connecting to its instance. More specifically, each exporter for 3-4 minutes reported a Timeout connecting with its monitored instance:

{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}

{"error":"Get http://rabbitmq3:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq3:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:04:36Z"}

{"error":"Get http://rabbitmq1:15672/api/nodes?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq1:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:12Z"}

{"error":"Get http://rabbitmq2:15672/api/overview?sort=: net/http: request canceled (Client.Timeout exceeded while awaiting headers)","host":"http://rabbitmq2:15672","level":"error","msg":"Error while retrieving data from rabbitHost","statusCode":0,"time":"2019-10-02T21:05:18Z"}

At the end of the 3-4 minutes all the exporters did not complain anymore, the node 3 logged its "rabbit_epmd_monitor:check_epmd/1 fails with {error,nxdomain}" error and everything returned to work correctly. 2 hours later the problem re-issued and solved in 3-4 minutes, with the same symptoms.

Considering that the Docker Swarm on each physical server runs 20+ container, of which RabbitMQ and its Prometheus Exported are just a fraction, it seems to me very unlikely that a network issue on the swarm affected exactly all 3 RabbitMQ nodes at the same time (thus preventing their exporters to connect from contacting them with a Client.Timeout error) and did not affected all the other containers (we have Prometheus exporters for everything and nothing was signaled for the other containers...). Also I checked Docker swarm logs, which are turned to debug, and nothing happened on the swarm network at that time. Moreover, in case of a system-wide network/domain resolution error I expected to see an Unknown Hostname error or similar on the Prometheus Exporters, and not a Client.Timeout.

BTW, now I have stored the rabbitmq-collect-env script on the three nodes and I will be able to run them during the error event as soon as I will receive the "down" alert from Grafana/Prometheus.

Thanks again for your time.


To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.
logs.zip

Michael Klishin

unread,
Oct 3, 2019, 8:18:28 AM10/3/19
to rabbitmq-users
> Client.Timeout exceeded while awaiting headers

from an HTTP API client can happen due to a wide variety of reasons, hostname resolution being one of them [1].
Perhaps you want to enable HTTP API request logging for the future [2].

I tried killing epmd monitor in various ways and it is quickly restarted.
After inspecting  the code I do not see how it might affect client connectivity,
the management plugin or the node as a whole.

1. https://www.rabbitmq.com/troubleshooting-networking.html
2. https://www.rabbitmq.com/management.html#http-logging

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f80bd707-64f4-43ce-bd01-e52535511d21%40googlegroups.com.

Aldo Bongio

unread,
Oct 17, 2019, 5:37:48 AM10/17/19
to rabbitmq-users
The cluster was fully recreated yesterday, and today we had 3 distinct problematic events, with 2-4 minutes of "pause". In all of them the Exporters signaled the usual Timeout and the client Spring Boot apps connected to RabbitMQ "paused" the queue read/write for 2-3 minutes without any error. We also had, few hours after the restart, a network partition error, but that was correctly handled in 150 ms, and it is well documented by the RabbitMQ logs and by the Spring Boot apps logs (with connection factories recreated and consumers restarted). So the network partition is not an issue and there is nothing strange about that.

During the last "pause event" (the 8:20...08:24 UTC event) I was able to get the output of rabbitmq-collect-env from the 3 cluster nodes. I see that:
  • node 1 and node 3 indicates {rabbitmq@rabbitmq2,{badrpc,nodedown}}, in cluster_status
  • half of the rabbitmq-collect-env commands for node 2 indicated "Error: unable to perform an operation on node 'rabbitmq@rabbitmq2'. Please see diagnostics information and suggestions below." And the diagnostics indicates "TCP connection succeeded but Erlang distribution failed, connection to remote host has timed out"
  • there are few suspicious processes from maybe_stuck on node 1 and 3. And they reference rabbitmq2 (delegate_management_XXX,rabbitmq@rabbitmq2)
  • the crash.log of node 2 starting at 08:20:55 have 2 CRASH reports and 2 SUPERVISOR reports
  • the crash.log of node 1 and 3 are empty during the 08:20..08:24 problematic event
The output of rabbitmq-collect-env is attached as well as with the prometheus exporter log. I hope someone can help me understanding what is happening. 
I'm not concerned with the node 2 having a failure (that's the reason we have a cluster...), and I hope that crash.log can indicate exactly why the error happened so we can do something. I'm concerned with node 1 and node 3 not correctly handling the problem of node 2, and instead silently blocking themselves and recovering 3 minutes later when the node 2 stops having issues. "Blocking themselves", again, means that they are not yet able to answer to the simple inspection made by the Prometheus exporter.



rabbitmq-collect-env.zip

Aldo Bongio

unread,
Oct 22, 2019, 5:30:36 PM10/22/19
to rabbitmq-users
Cluster nodes upgraded from RabbitMQ 3.7.18 to 3.8.0 using the rolling upgrade. 
But after 1,5 hour the "pause" problem manifested again. We got only this error from node 2 logs (and nothing from other nodes):

rabbitmq2.1.lpkuveliyqy4wehwh6us ec2-34-250-97-160.eu-west-1.compute.amazonaws.com  2019-10-22 19:58:07.665 [error] <0.475.0> ** Generic server rabbit_epmd_monitor terminating
rabbitmq2.1.lpkuveliyqy4wehwh6us ec2-34-250-97-160.eu-west-1.compute.amazonaws.com  ** Last message in was check
rabbitmq2.1.lpkuveliyqy4wehwh6us ec2-34-250-97-160.eu-west-1.compute.amazonaws.com  ** When Server state == {state,{erlang,#Ref<0.980773890.2479620097.199138>},erl_epmd,"rabbitmq","rabbitmq2",25672}
rabbitmq2.1.lpkuveliyqy4wehwh6us ec2-34-250-97-160.eu-west-1.compute.amazonaws.com  ** Reason for termination ==
rabbitmq2.1.lpkuveliyqy4wehwh6us ec2-34-250-97-160.eu-west-1.compute.amazonaws.com  ** {{case_clause,{error,nxdomain}},[{rabbit_epmd_monitor,check_epmd,1,[{file,"src/rabbit_epmd_monitor.erl"},{line,98}]},{rabbit_epmd_monitor,handle_info,2,[{file,"src/rabbit_epmd_monitor.erl"},{line,77}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,637}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,711}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,249}]}]}
rabbitmq2.1.lpkuveliyqy4wehwh6us ec2-34-250-97-160.eu-west-1.compute.amazonaws.com  2019-10-22 19:58:07.665 [error] <0.475.0> CRASH REPORT Process rabbit_epmd_monitor with 0 neighbours crashed with reason: no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 98
rabbitmq2.1.lpkuveliyqy4wehwh6us ec2-34-250-97-160.eu-west-1.compute.amazonaws.com  2019-10-22 19:58:07.798 [error] <0.474.0> Supervisor rabbit_epmd_monitor_sup had child rabbit_epmd_monitor started with rabbit_epmd_monitor:start_link() at <0.475.0> exit with reason no case clause matching {error,nxdomain} in rabbit_epmd_monitor:check_epmd/1 line 98 in context child_terminated

We had the brand new "Erlang Distribution" Grafana dashboard fed by the the new internal Prometheus (so no more HTTP connections to an external HTTP prometheus exporter). The interval surrounding the 2-3 minutes of pause are depicted in the attached grafana.png. 
Basically there are 2 "holes" in the distribution links outgoing from node 2 (what is its meaning ?), and some peak at the end of the "pause", when Spring Boot clients consuming the queues restarted to receive the data.

The pause is visible also in the Grafana CPU graph (grafana2.png) of the 3 containers consuming mostly the queues: the CPU almost went to 0 at 21:56:30 and resumed at 21:59:00. But again, we had no logs from clients, no errors ...

I will try to take the usual rabbitmq-collect-env when a new "pause" event will happen, maybe with RabbitMQ 3.8 there is some diagnostic that can help...

grafana.png
grafana2.png

Michael Klishin

unread,
Oct 23, 2019, 5:54:51 AM10/23/19
to rabbitmq-users
according to the chart node number 2 did not send any data to its peers for several minutes.

As for this specific exception, it will be handled better in 3.8.1 [1] (and you can try a beta) but I somewhat doubt that's the root cause.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Aldo Bongio

unread,
Oct 23, 2019, 6:10:17 AM10/23/19
to rabbitmq-users
Hi Michael, thanks for the answer!

I agree with you that this probably is not the root cause but some effect of an (invisible) issue with Docker Swarm networking and/or DNS. 
I've read that Erlang is somehow special in managing DNS resolution compared to other "frameworks" which instead heavily relies just on native lookup methods, and maybe this explain why RabbitMQ nodes are the only services of the swarm disrupted by this supposed networking/DNS issue. If this is the case I hope it will benefit from an upgrade of Docker Swarm that we are going to perform in the next days...

About RabbitMQ version 3.8.1, is there an ETA for the definitive release ?
Thanks again for the help.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
Oct 23, 2019, 11:45:07 AM10/23/19
to rabbitmq-users
Erlang is not special if you don't have an ERL_INETRC file configured.

The ETA for 3.8.1 is "next couple of weeks" but there's a preview build available already and we quite likely will release
another one this week.

Can you please post a more complete node 2 log around the time of the network throughput drop?

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/cad9229d-33b2-48f8-b427-448b7cd75a20%40googlegroups.com.

Aldo Bongio

unread,
Oct 23, 2019, 12:18:29 PM10/23/19
to rabbitmq-users
Hi Michael, please find attached the logs of the 3 nodes (rabbitmq.zip -> subfolder YESTERDAY) 
You can see the startup of nodes during the rolling upgrade from 3.7.18 to 3.8.0, starting at 18:29 (UTC)
At 19:58:07 node 2 logged the "Generic server rabbit_epmd_monitor terminating", in the middle of the 2-3 minutes of "freeze" displayed by the Grafana dashboard I posted before (Grafana is displaying time GMT+2)

Today we had few freezes and 1 network partition. I enclosed the log of node 3 (subfolder TODAY), where you can see:
  • a freeze between 13:12 and 13:15
  • a network partition starting at 13:23, correctly handled by the pause_minority mode


rabbitmq.zip

Luke Bakken

unread,
Oct 23, 2019, 12:26:13 PM10/23/19
to rabbitmq-users
Hi Also,

I spent some time looking at the output you provided. The combination of partitions with "freezes" made me suspect an issue that would be logged by the Linux kernel.

If you look at the dmesg_t files, you can see network interface events with similar timestamps to your "freezes". I'm going to keep looking but it appears that this may be related to the virtual network interfaces in your containerized environment.

Thanks,
Luke

Michael Klishin

unread,
Oct 23, 2019, 1:22:39 PM10/23/19
to rabbitmq-users
There's a suspicious silence in the log around

> 2019-10-22 19:58:07.665 [error] <0.475.0> ** Generic server rabbit_epmd_monitor terminating

for minutes. No evidence of the node being killed, though.

So maybe the exception does prevent the node from registering with epmd (which could have been killed/restarted/failed, although
we cannot have any evidence of that in RabbitMQ logs), so the change in 3.8.1 would help.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/72c1a68e-d7e7-4094-acfa-2e7ed1f0b00d%40googlegroups.com.

Michael Klishin

unread,
Oct 23, 2019, 1:28:50 PM10/23/19
to rabbitmq-users
My current working theory is this:

 * The component that re-registers with epmd fails because it cannot resolve its own (well, local epmd's) host
 * The node for some reason is not registered with epmd
 * Peers see it as down, no traffic is possible
 * The node still reports metrics to Prometheus

This is a plausible scenario but other nodes do not detect node 2 as being down. However, there's also dead air around that time, e.g. in
node 1 logs:

> 2019-10-22 18:36:25.168 [info] <0.5536.0> Feature flags: all running remote nodes support `[drop_unroutable_metric]`
> 2019-10-22 20:07:52.496 [info] <0.7342.1> Feature flags: all running remote nodes support `[virtual_host_metadata]`

There is evidence of client connections being lost:

> 2019-10-22 20:17:08.256 [warning] <0.2576.0> closing AMQP connection <0.2576.0> (10.0.0.47:41850 -> 10.0.0.51:5672 - rabbitConnectionFactory#54dcfa5a:5, vhost: '/', user: 'guest'):
> rabbitmq1.1.whijp95w7cag922u7zoj ec2-34-249-223-175.eu-west-1.compute.amazonaws.com  client unexpectedly closed TCP connection
> 2019-10-22 20:17:29.741 [info] <0.11014.1> accepting AMQP connection <0.11014.1> (10.0.0.72:36615 -> 10.0.0.51:5672)
> 2019-10-22 20:17:29.894 [info] <0.11014.1> Connection <0.11014.1> (10.0.0.72:36615 -> 10.0.0.51:5672) has a client-provided name: rabbitConnectionFactory#1817f1eb:0
> 2019-10-22 20:17:29.904 [info] <0.11014.1> connection <0.11014.1> (10.0.0.72:36615 -> 10.0.0.51:5672 - rabbitConnectionFactory#1817f1eb:0): user 'guest' authenticated and granted access to vhost '/'
> 2019-10-22 20:25:11.148 [warning] <0.2862.0> closing AMQP connection <0.2862.0> (10.0.0.41:45051 -> 10.0.0.51:5672 - rabbitConnectionFactory#2938127d:5, vhost: '/', user: 'guest'):
> rabbitmq1.1.whijp95w7cag922u7zoj ec2-34-249-223-175.eu-west-1.compute.amazonaws.com  client unexpectedly closed TCP connection
> 2019-10-22 20:26:06.498 [info] <0.13450.1> accepting AMQP connection <0.13450.1> (10.0.0.75:44418 -> 10.0.0.51:5672)
> 2019-10-22 20:26:06.568 [info] <0.13450.1> Connection <0.13450.1> (10.0.0.75:44418 -> 10.0.0.51:5672) has a client-provided name: rabbitConnectionFactory#58ba5b30:0
> 2019-10-22 20:26:06.576 [info] <0.13450.1> connection <0.13450.1> (10.0.0.75:44418 -> 10.0.0.51:5672 - rabbitConnectionFactory#58ba5b30:0): user 'guest' authenticated and granted access to vhost '/'

I'm not sure how a consumer restart would help with node 2 not being registered with epmd. Something doesn't add up here.

Aldo Bongio

unread,
Oct 23, 2019, 3:42:51 PM10/23/19
to rabbitmq-users
Hi Luke, thanks for the help.
I exported the output of 'dmesg -T' for the 3 servers (attached here) but unfortunately:
  • on Oct 22 around the 19:58 interval (freeze) there is nothing
  • same for Oct 23 around 13:12 (freeze) and 13:23 (freeze) 
Everything in the dmesg related to "docker_gwbridge" and "veth..." is related to start/stop or scale up/down of Docker containers within the swarm, performed yesterday and today. But unfortunately the involved time intervals are very very different from the freeze events.

BTW I agree with you that the issue is related to the network interfaces of the environment. I will continue to investigate.

Aldo Bongio

unread,
Oct 23, 2019, 3:48:46 PM10/23/19
to rabbitmq-users
Ok 3.8.1 seems promising. I've just performed a smooth rolling upgrade of the cluster from 3.8.0 to 3.8.1-beta1. 
The official Docker image I've used was built 5 days ago so I think the pull you mentioned is included.

Let's see how this perform.

Thank you for your time and help.

Aldo Bongio

unread,
Oct 23, 2019, 5:15:16 PM10/23/19
to rabbitmq-users
Ok a freeze happened just now.
It resolved in 2-3 minutes. 
Attached are the Grafana Overview and the Erlang Distribution dashboards, where the freeze is evident: there are a lot of holes in Prometheus metrics and, around 22:37, the Outgoing message rate jumps from zero to 3x the usual rate.

The logs is very simple (and compatible with your pull request): just one line for node 2 and nothing for 1 and 3.
2019-10-23 20:34:56.156 [error] <0.479.0> epmd monitor failed to retrieve our port from epmd: nxdomain
grafana1.png
grafana2.png
logs.zip

Aldo Bongio

unread,
Oct 24, 2019, 3:24:37 AM10/24/19
to rabbitmq-users
If this can help, I have new logs.
  • At 00:03 we got a freeze, with only node 2 logging "epmd monitor failed to retrieve our port from epmd: nxdomain"
  • At 01:42, 01:44, 03:32 we got a freeze, with only node 3 logging "epmd monitor failed to retrieve our port from epmd: nxdomain"
  • At 02:29:40 there is a network partition, and all server logs seems coherent with it. One minute before the partition (02:28:38)  server 2 logged "epmd monitor failed to retrieve our port from epmd: nxdomain"

logs2.zip

Aldo Bongio

unread,
Oct 24, 2019, 6:03:25 AM10/24/19
to rabbitmq-users
The missing attachment
dmesg.zip

Gerhard Lazu

unread,
Oct 25, 2019, 10:18:17 AM10/25/19
to rabbitmq-users
Hi,

I would like to take a closer look at your metrics. Could you share dashboard snapshots please? This is how: https://grafana.com/docs/reference/sharing/#dashboard-snapshot Local is fine if it's publicly accessible, otherwise via snapshot.raintank.com is fine.

Taking a quick look at your logs now.

Aldo Bongio

unread,
Oct 25, 2019, 11:04:35 AM10/25/19
to rabbitmq-users
Hi Gerhard, I sent you a private message with the snapshots. Please let me know if you received them ... I can't see a feedback about the private message on the Google Groups history so I want to be sure your received it.

Thanks!
Aldo

Gerhard Lazu

unread,
Oct 25, 2019, 11:22:13 AM10/25/19
to rabbitmq-users
This dmesg.zip contained the most helpful logs, specifically the IPVS-related lines, e.g. 

[Wed Oct 23 19:03:42 2019] IPVS: rr: FWM 330 0x0000014A - no destination available

This suggests to me that there is a genuine networking issue in your Swarm cluster, related to IPVS [1] and DNS resolution [2]. Until you understand how/why this fails, this will continue happening. This article may come in handy [3] - I am not making assumptions.

When dealing with unreliable networks, mirrored queues and pause_minority are not the best combination. I would recommend looking into Quorum Queues [4] and sticking with ignore, the default network partition handling strategy [5].

Aldo Bongio

unread,
Oct 25, 2019, 12:24:37 PM10/25/19
to rabbitmq-users
Hi Gerhard, I'm quite sure that we have networking issues and we hope to solve this when we will migrate to the latest stable version of Docker Swarm, which is planned in few days.
About the specific IPVS related lines I checked few of them and they seems  all legit: they were produced just when we performed upgrades on swarm services, including when we switched from RabbitMQ 3.7.18 to 3.8.0, from 3.8.0 to 3.8.1-beta.1 and, yesterday, when we restarted 3.8.1-beta.1 with the alpine based Docker images instead of the usual Ubuntu based. "IPVS No destination available" seems an usual log message when you kill/restart some docker service.

Thanks

Gerhard Lazu

unread,
Oct 25, 2019, 12:46:39 PM10/25/19
to rabbitmq-users
Thanks for sharing the dashboard snapshots.

They helped me notice a couple of things:

1. rabbitmq2 is intermittently unavailable during freeze "blackout" periods, specifically from=1571883700123&to=1571884565812 in the dashboard snapshots

This means that even Prometheus couldn't get metrics from rabbitmq2 during this period

2. All connections were open on rabbitmq2 before the blackout, and stuck with this node until eventually shifting to rabbitmq1

Running all connections through a single node doesn't help with service availability, since all clients will stop working if that one node goes away. I would recommend picking a node URL at random when connecting to RabbitMQ, so that connections are evenly distributed across the entire cluster.

3. All queue masters end up on a single node

Before the blackout, all queue masters were running on rabbitmq2, and after the blackout they all migrated to rabbitmq1. If you are using the default queue placement strategy, client-local, and all connections end up on a single node, then all queue masters will end up on the same node as well. This means that when that node goes away, all connections stop working and all queues will also stop working until mirrors (or followers in the case of quorum queues) get promoted to masters (leaders).


Implementing the recommendations above will help with service availability, especially if you decide to combine them with ignore & Quorum Queues instead of pause_minority & Mirrored Queues. I believe that this will result in a more robust RabbitMQ service for you, in spite of the underlying networking issues.

Let us know how it goes!

Gerhard Lazu

unread,
Oct 25, 2019, 1:03:11 PM10/25/19
to rabbitmq-users
Hi,

About the specific IPVS related lines I checked few of them and they seems  all legit: they were produced just when we performed upgrades on swarm services, including when we switched from RabbitMQ 3.7.18 to 3.8.0, from 3.8.0 to 3.8.1-beta.1 and, yesterday, when we restarted 3.8.1-beta.1 with the alpine based Docker images instead of the usual Ubuntu based. "IPVS No destination available" seems an usual log message when you kill/restart some docker service.

That makes sense. What about the nxdomain errors? Does this help? https://github.com/moby/moby/issues/30134

Any reason why you chose the Alpine-based image? I added these notes some time ago - https://github.com/docker-library/rabbitmq/blob/master/Dockerfile-ubuntu.template#L1-L2 - and we touched up on this subject with Tianon: https://github.com/docker-library/rabbitmq/issues/277#issuecomment-446668232 .

Aldo Bongio

unread,
Oct 25, 2019, 2:00:37 PM10/25/19
to rabbitmq-users
Hi, moby issue 30134 seems to indicate that the Docker swarm networking is really "complex". BTW we are not using dnsrr mode, which seems the cause of the moby issue. I'm going to re-read the issue and the related issues in order to see if there is some other hint.

Alpine was just a tentative. We used the Ubuntu based image for 3.7.18, 3.8.0 and 3.8.1-beta1. In a hope to "change" something on the underlying infrastructure we tried the Alpine based Docker image, without luck. As I explained at the begin of the thread we run dozens of services in the same swarm, few with clustering and few scaled as replica in the 3 nodes and unfortunately only RabbitMQ is showing such "strange" networking issue. I'm pretty sure the problem is on the swarm networking and not on RabbitMQ itself - we are preparing to perform a full migration to a new Docker Swarm version, but in the meanwhile we are performing any "quick" attempt in order to "break" or reduce the problem. Alpine was just an attempt. BTW I just rolled back to the Ubuntu image, so now we are running with 3.8.1-beta1 Ubuntu.
Message has been deleted

Aldo Bongio

unread,
Oct 25, 2019, 3:03:22 PM10/25/19
to rabbitmq-users
1. Yes, was the same for the old external Prometheus Exporter and didn't changed with the new embedded exporter

2. Node URLs are specified using the Spring AMQP config property, where all the nodes are listed:
spring.data.rabbitmq.addresses=rabbitmq1:5672,rabbitmq2:5672,rabbitmq3:5672
Ok good spot: I just discovered that the list order is important. I need to use shuffle.

3. Ok I tried a rebalance: on node 1 i executed
rabbitmq-queues rebalance "all" --vhost-pattern "/" --queue-pattern ".*"
the output was
Rebalancing all queues...
┌────────────────────┬─────────────────────────┬──────────────────────────┐
│ Node name          │ Number of quorum queues │ Number of classic queues │
├────────────────────┼─────────────────────────┼──────────────────────────┤
│ rabbitmq@rabbitmq3 │ 0                       │ 19                       │
├────────────────────┼─────────────────────────┼──────────────────────────┤
│ rabbitmq@rabbitmq2 │ 0                       │ 11                       │
├────────────────────┼─────────────────────────┼──────────────────────────┤
│ rabbitmq@rabbitmq1 │ 0                       │ 27                       │
└────────────────────┴─────────────────────────┴──────────────────────────┘
I hope that this can reduce the pressure on nodes. I will check quorum queues.

Thanks !

Gerhard Lazu

unread,
Oct 27, 2019, 6:31:17 PM10/27/19
to rabbitmq-users
Hi,


As I explained at the begin of the thread we run dozens of services in the same swarm, few with clustering and few scaled as replica in the 3 nodes and unfortunately only RabbitMQ is showing such "strange" networking issue.

Are you using Prometheus to periodically pull metrics from other services running in the same Swarm? If yes, is the polling interval the same as for RabbitMQ? If both are true, then I would expect you to see gaps in metrics for other services.
 
I'm pretty sure the problem is on the swarm networking and not on RabbitMQ itself - we are preparing to perform a full migration to a new Docker Swarm version,

Are there any metrics from the Swarm networking that you can load into Prometheus? It would be especially interesting to see the DNS resolution success/error rate.
 
but in the meanwhile we are performing any "quick" attempt in order to "break" or reduce the problem.

I like your thinking. I would suggest to force all RabbitMQ nodes on the same Swarm node, and see if that makes the errors go away. It would be also interesting to try and place RabbitMQ nodes on their dedicated Swarm node. This is a long-shot, but it's entirely possible that when there is networking pressure in the system - maybe routing table updates, maybe just a peak in packet routing - then networking instability is introduced which makes RabbitMQ nodes misbehave. Metrics and logs from the networking layer will be able to make this more than a hunch.

Gerhard Lazu

unread,
Oct 27, 2019, 6:51:26 PM10/27/19
to rabbitmq-users
Hi again.
 
2. Node URLs are specified using the Spring AMQP config property, where all the nodes are listed:
spring.data.rabbitmq.addresses=rabbitmq1:5672,rabbitmq2:5672,rabbitmq3:5672
Ok good spot: I just discovered that the list order is important. I need to use shuffle.

This is a good practice that will keep the majority of clients working when a node goes away. It's worth continuing even after this issue is solved.
 
3. Ok I tried a rebalance: on node 1 i executed
rabbitmq-queues rebalance "all" --vhost-pattern "/" --queue-pattern ".*"


I think rabbitmq-queues rebalance all should be sufficient. It's worth running this after every node gets restarted, otherwise your cluster will eventually end up with all queue masters/leaders on a single node. When this node eventually goes away, all queues will be unavailable for longer than if they were spread evenly across all nodes.
 
I hope that this can reduce the pressure on nodes. I will check quorum queues.

This is only true if the queues are putting pressure on the cluster. Based on your traffic, I don't see this being the case. I mentioned this command from the perspective of minimising disruption when a node goes away, or experiences networking issues. Since the queues are spread, only the queue masters/leaders on the affected node will end up migrating.

Quorum queues fail-over predictably, and transfer less data as messages get replicated than mirrored queues. If this is important to your system, I would encourage using them in lieu of mirrored queues. This section sums it up nicely: https://www.rabbitmq.com/quorum-queues.html#use-cases

Aldo Bongio

unread,
Oct 28, 2019, 5:06:11 AM10/28/19
to rabbitmq-users
Hi


Il giorno domenica 27 ottobre 2019 23:31:17 UTC+1, Gerhard Lazu ha scritto:
Hi,

As I explained at the begin of the thread we run dozens of services in the same swarm, few with clustering and few scaled as replica in the 3 nodes and unfortunately only RabbitMQ is showing such "strange" networking issue.

Are you using Prometheus to periodically pull metrics from other services running in the same Swarm? If yes, is the polling interval the same as for RabbitMQ? If both are true, then I would expect you to see gaps in metrics for other services.
Yes, everthing is under Prometheus metric scraping. The interval is the same ( scrape_interval: 15s, scrape_timeout: 15s). But again, there are no gaps. I can zoom into each Grafana dashboard and I can't see any gap excluding the one from RabbitMQ.
 
 
I'm pretty sure the problem is on the swarm networking and not on RabbitMQ itself - we are preparing to perform a full migration to a new Docker Swarm version,

Are there any metrics from the Swarm networking that you can load into Prometheus? It would be especially interesting to see the DNS resolution success/error rate.
 
Ok I will check on this.
 
but in the meanwhile we are performing any "quick" attempt in order to "break" or reduce the problem.

I like your thinking. I would suggest to force all RabbitMQ nodes on the same Swarm node, and see if that makes the errors go away. It would be also interesting to try and place RabbitMQ nodes on their dedicated Swarm node. This is a long-shot, but it's entirely possible that when there is networking pressure in the system - maybe routing table updates, maybe just a peak in packet routing - then networking instability is introduced which makes RabbitMQ nodes misbehave. Metrics and logs from the networking layer will be able to make this more than a hunch.
Indeed, after the swarm upgrade and in case the problem persists we can try both the strategies you suggested.

Thanks
 

Aldo Bongio

unread,
Oct 28, 2019, 5:06:20 AM10/28/19
to rabbitmq-users
Hi

Il giorno domenica 27 ottobre 2019 23:31:17 UTC+1, Gerhard Lazu ha scritto:
Hi,

As I explained at the begin of the thread we run dozens of services in the same swarm, few with clustering and few scaled as replica in the 3 nodes and unfortunately only RabbitMQ is showing such "strange" networking issue.

Are you using Prometheus to periodically pull metrics from other services running in the same Swarm? If yes, is the polling interval the same as for RabbitMQ? If both are true, then I would expect you to see gaps in metrics for other services.
Yes, everthing is under Prometheus metric scraping. The interval is the same ( scrape_interval: 15s, scrape_timeout: 15s). But again, there are no gaps. I can zoom into each Grafana dashboard and I can't see any gap excluding the one from RabbitMQ.
 
 
I'm pretty sure the problem is on the swarm networking and not on RabbitMQ itself - we are preparing to perform a full migration to a new Docker Swarm version,

Are there any metrics from the Swarm networking that you can load into Prometheus? It would be especially interesting to see the DNS resolution success/error rate.
Ok I will check on this.
 
but in the meanwhile we are performing any "quick" attempt in order to "break" or reduce the problem.

I like your thinking. I would suggest to force all RabbitMQ nodes on the same Swarm node, and see if that makes the errors go away. It would be also interesting to try and place RabbitMQ nodes on their dedicated Swarm node. This is a long-shot, but it's entirely possible that when there is networking pressure in the system - maybe routing table updates, maybe just a peak in packet routing - then networking instability is introduced which makes RabbitMQ nodes misbehave. Metrics and logs from the networking layer will be able to make this more than a hunch.
Indeed, after the swarm upgrade and in case the problem persists we can try both the strategies you suggested.

Thanks
 

Aldo Bongio

unread,
Nov 8, 2019, 10:12:29 AM11/8/19
to rabbitmq-users
Hi all, 
  an update related to the issue described in this thread for all who concerned, i.e. for all the people (Michael, Luke, Gerhard) that helped me in tracking down the problem.

We finally migrated the platform to new servers and, more important, to the latest stable Docker (Swarm) engine. 
As expected, the networking issue solved by itself: after 5 days of uptime not a single 'epmd monitor failed to retrieve our port from epmd: nxdomain' error was produced. Moreover, we didn't experienced any network partition.

Thanks for all the help and keep up the good work!

Gerhard Lazu

unread,
Nov 8, 2019, 12:08:51 PM11/8/19
to rabbitmq-users
Hi Aldo, 
 
We finally migrated the platform to new servers and, more important, to the latest stable Docker (Swarm) engine. 
As expected, the networking issue solved by itself: after 5 days of uptime not a single 'epmd monitor failed to retrieve our port from epmd: nxdomain' error was produced. Moreover, we didn't experienced any network partition.

That is great news and an amazing way to start the weekend.

Thanks for the update, happy RabbitMQ-ing!
Reply all
Reply to author
Forward
0 new messages