Stream replication floods logs with stack traces

174 views
Skip to first unread message

Maciej Harczuk

unread,
Aug 24, 2022, 1:26:28 PM8/24/22
to rabbitmq-users
Hi,
I have a 3 node cluster running inside docker (each container runs on a different VM) using --network=host. After upgrade to 3.10.7, RabbitMQ logs are getting spammed with the following message/stack trace:

2022-08-24 14:28:21.590397+00:00 [info] <0.3134.0> STREAM_NAME_1661200157105845916 [osiris_replica:init/1] next offset 0, tail info {0,empty}
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>   crasher:
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>     initial call: osiris_replica:init/1
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>     pid: <0.3134.0>
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>     registered_name: []
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>     exception error: no case clause matching
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                      {error,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                       {connection_refused,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                        {child,undefined,#Ref<0.2446360502.61865985.142836>,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                         {osiris_replica_reader,start_link,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                          [#{connection_token =>
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                              <<15,62,59,206,36,84,92,90,173,46,8,157,65,39,90,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                                234,81,254,166,85,32,226,234,76,27,52,247,215,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                                82,95,175,240>>,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             hosts =>
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                              ["VM_NAME",
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                               "DNS_NAME_ASSIGNED_TO_VM",
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                               {VM_IP},
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                               {DOCKER_CONTAINER_IP}],
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             leader_pid => <12910.31741.0>,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             name =>
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                              "STREAM_NAME_REDACTED",
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             port => 6114,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             reference =>
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                              {resource,<<"VHOST_NAME">>,queue,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                               <<"management.sync">>},
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             replica_pid => <0.3134.0>,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             start_offset => {0,empty},
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                             transport => ssl}]},
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                         temporary,false,5000,worker,
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>                         [osiris_replica_reader]}}}
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>       in function  osiris_replica_reader:start/2 (src/osiris_replica_reader.erl, line 108)
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>       in call from osiris_replica:init/1 (src/osiris_replica.erl, line 234)
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>       in call from gen_server:init_it/2 (gen_server.erl, line 848)
2022-08-24 14:28:21.596951+00:00 [error] <0.3134.0>       in call from gen_server:init_it/6 (gen_server.erl, line 811)

It looks like it's trying to connect to its own replication port that it randomly chose from the 6000-6500 range, but fails and instantly makes another attempt with different port. I've verified connection to a port from that range can be established from within the container using both VM_NAME and DNS_NAME_ASSIGNED_TO_VM. Netstat shows that RabbitMQ starts listening on one of the 6000-6500 ports for a short while, then unbinds and tries another one (it happens more than once per second).

Kind Regards,
Maciej

kjnilsson

unread,
Aug 31, 2022, 4:37:37 AM8/31/22
to rabbitmq-users
Something must be stopping the connection from being established as the system does what it is designed to do (tries different ports if one fails). Can you enable debug logging and see if it logs any more details about the connection sequence?

Maciej Harczuk

unread,
Sep 6, 2022, 6:28:03 AM9/6/22
to rabbitm...@googlegroups.com
Sure. I was able to reproduce the issue on a different env and gather debug logs. Here's what I think is the interesting part:
2022-09-06 10:14:23.072705+00:00 [debug] <0.21105.2> osiris replica TLS connection refused, host:"rabbitmq-vm-1" - port: 6349
2022-09-06 10:14:23.072774+00:00 [debug] <0.21105.2> error while trying to establish TLS connection econnrefused
2022-09-06 10:14:23.072808+00:00 [warning] <0.21105.2> could not connect osiris to replica at ["rabbitmq-vm-1",
2022-09-06 10:14:23.072808+00:00 [warning] <0.21105.2>                                         "rabbitmq-1-redacated.domain.name",
2022-09-06 10:14:23.072808+00:00 [warning] <0.21105.2>                                         {eth0 IP},
2022-09-06 10:14:23.072808+00:00 [warning] <0.21105.2>                                         {docker0 IP}]

We have a bit of an unorthodox setup where VM names don't match resolvable DNS names and it looks like this might be causing the problem. However, streams in RabbitMQ 3.9.x work perfectly fine on the same set of VMs, so clearly something must have changed in 3.10.

--
You received this message because you are subscribed to a topic in the Google Groups "rabbitmq-users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rabbitmq-users/7zGWzNExxp8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/5afd6ed2-3164-4779-9f74-0675cf01555en%40googlegroups.com.

Arnaud Cogoluègnes

unread,
Sep 6, 2022, 8:09:57 AM9/6/22
to rabbitmq-users
Are you using inter-node TLS [1]? If so this will trigger stream replication over TLS in 3.10 [2]. That could explain the different behavior with 3.9, which has no TLS stream replication.

We would need to know more about your setup; even better, a way to reproduce with e.g. tls-gen [3].

Maciej Harczuk

unread,
Sep 6, 2022, 8:22:13 AM9/6/22
to rabbitmq-users
Yes, I am using inter-node TLS. 
I'll try to come up with a docker-compose setup. As a matter of fact, I already tried, but everything worked fine there. However, I didn't have the hostname vs resolvable dns name mismatch in there, which seems to be the key to reproducing the problem. 
BTW, once I made the VM names resolvable by adding them to /etc/hosts, replication started working again.

Maciej Harczuk

unread,
Sep 7, 2022, 6:29:41 AM9/7/22
to rabbitm...@googlegroups.com
Here's a docker-compose setup that reproduces the problem (after truncating /etc/resolv.conf in the containers): https://github.com/G-r-F/rabbimq-tls-stream-replication-issue

Arnaud Cogoluègnes

unread,
Sep 8, 2022, 10:18:50 AM9/8/22
to rabbitmq-users
Thanks for the docker-compose setup, I managed to reproduce and pushed a fix [1].

We built an image that contains the fix, you can try it: pivotalrabbitmq/rabbitmq:rabbitmq-20216ef-osiris-d601e94-otp-max.

Note it's based on RabbitMQ latest code base (roughly 3.11), so a couple of things have been deprecated (you cannot use environment variables to seed a user anymore). I attached the files I changed.

Let me know if it fixes also the problem in your environment.
rabbitmq.conf
docker-compose.yml

Maciej Harczuk

unread,
Sep 9, 2022, 6:25:56 AM9/9/22
to rabbitm...@googlegroups.com
I can confirm that the problem no longer exists in my test env when using the image you provided.

Arnaud Cogoluègnes

unread,
Sep 9, 2022, 7:08:36 AM9/9/22
to rabbitmq-users
Thanks for the feedback. We'll merge this and see if we can squeeze it into 3.10.8 and 3.11.0.
Reply all
Reply to author
Forward
0 new messages