Possible Regression in 3.8.4 when restoring existing exchanges of type x-consistent-hash

59 views
Skip to first unread message

Philipp Böhm

unread,
May 29, 2020, 10:42:48 AM5/29/20
to rabbitmq-users
Hello,

I've encountered a possible regression introduced by RabbitMQ 3.8.4 where previously created exchanges of type x-consistent-hash could not be restored because the enabled plugins seem to loaded later in the process compared to 3.8.3. I'm currently running a single-node RabbitMQ in a developer setup using docker-compose with the following docker-compose.yml file:

version: "3"
services:
  rabbitmq:
    image: "rabbitmq:3.8.3-management-alpine"
    hostname: "rabbitmq"
    environment:
      RABBITMQ_ERLANG_COOKIE: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
      RABBITMQ_DEFAULT_USER: "rabbitmq"
      RABBITMQ_DEFAULT_PASS: "rabbitmq"
      RABBITMQ_DEFAULT_VHOST: "/"
    labels:
      NAME: "rabbitmq"
    volumes:
      - "${PWD}/.rabbitmq_mount:/var/lib/rabbitmq"
      - "${PWD}/rabbitmq_enabled_plugins:/etc/rabbitmq/enabled_plugins"

The rabbitmq_enabled_plugins file contains the following:

[rabbitmq_management,rabbitmq_consistent_hash_exchange,rabbitmq_prometheus].

With the rabbitmq:3.8.3-management-alpine docker image and preexisting durable consistent hash exchanges RabbitMQ starts up normally (logs attached in 3_8_3_no_errors_when_restoring_existing_consistent_hash_exchanges.txt) but when I change it to rabbitmq:3.8.4-management-alpine RabbitMQ can't restore preexisting durable consistent hash exchanges and it prints warnings like [warning] <0.574.0> Could not find exchange type x-consistent-hash (logs attached in 3_8_4_error_restoring_existing_consistent_hash_exchanges.txt). 3.8.4 also seems to open the TCP port before successfully loading plugins (the logs show that connections are accepted before successfully loading all plugins) which produces client errors if they want to declare consistent hash exchanges.

Can anybody reproduce or relate it to a change introduced with 3.8.4?

Thank you
3_8_3_no_errors_when_restoring_existing_consistent_hash_exchanges.txt
3_8_4_error_restoring_existing_consistent_hash_exchanges.txt

Michael Klishin

unread,
May 29, 2020, 1:16:15 PM5/29/20
to rabbitm...@googlegroups.com

The message says that the exchange type was not registered at the time a client attempted to use it, not that

an exchange could not be recovered. In the log, the client attempts to perform an operation even before feature flags are settled,

which is early during boot.

 

While plugin activation and a lot of other node startup time bits were heavily refactored in 3.8.4,

It was always a possibility that a client tried to do something in parallel with plugin activation.

 

Can you confirm that the plugin is enabled and becomes functional at some point?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/9c68ea9e-d36d-435d-bda5-7229ce585ba4%40googlegroups.com.

Michael Klishin

unread,
May 29, 2020, 2:18:25 PM5/29/20
to rabbitm...@googlegroups.com

Assuming that your development environment starts a node then immediately connects to it,

simply introducing a 2-3 second delay before clients start any activity will help as the plugin

will start and register its exchange type before the node services its first connection.

Philipp Böhm

unread,
May 29, 2020, 3:51:41 PM5/29/20
to rabbitmq-users
For me looks like that these warnings are created way before the first client connection happens. Based on the 3.8.4 log file, the first warning is logged at timestamp 2020-05-29 13:58:44.421 but the first client connection only happens at 2020-05-29 13:58:45.693. So for me this looks like the broker tries to restore all durable exchanges and queues but it can't restore the x-consistent-hash exchange (I only have one declared) because the plugin is not loaded yet.

But I can confirm, that after the broker logs the messge "Server startup complete; 5 plugins started." the plugin is active and the consistent hash exchange also exists. If the delayed/parallel initialization of plugins is an expected behaviour of 3.8.4 if have to improve the startup of my client applications that rely on exchanges from plugins.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Philipp Böhm

unread,
May 29, 2020, 3:58:09 PM5/29/20
to rabbitmq-users
I'm currently using this script https://github.com/eficode/wait-for in my client applications docker containers:

command: ["sh", "-c", "wait-for -t 600 rabbitmq:5672 -- python run.py"]

So for clients that depend on custom exchange types, I have to add a sleep 5 or sleep 10 before starting the actual application.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitm...@googlegroups.com.

Michael Klishin

unread,
May 29, 2020, 4:15:15 PM5/29/20
to rabbitm...@googlegroups.com

This is a reasonable hypothesis. We will find out next week. Debug logging should reveal a fair amount of information, so even tracing

the node likely won’t be necessary. However, if data recovery failed for the plugin I’d actually expect this to be a lasting condition. It is very

likely that a single client operation fails, the channel is closed, and the rest of the system doesn’t even notice. Then the following connections

succeed. We have seen this in different scenarios before, e.g. when a virtual host was not started on some nodes but a client already tried to use

it on others.

 

It’s not that we tried to parallelize plugin activation. RabbitMQ nodes do quite a bit on boot (any data service does) and simply by the virtue of

being an Erlang-based system, given enough CPU cores some of these operations will naturally run in parallel. Most of the time this is fine.

Other things are made explicitly sequential. Data recovery and plugin activation have a bit of a chicken-and-egg problem (and we have many more

during node boot).

 

However, we do not control client connections or CLI tool activity during boot. We can and try enable client listeners at a very late point in the boot.

Looks like recent refactoring of configuration and environment variable setup/discovery has affected boot step execution. But the problem

has always existed one way or another. Unless we synchronise absolutely everything to finish before we start client connection listeners,

there’s always a probability that a client connection and operation can happen “too early”. Improvements in this area

can require changes too extensive to justify for our small team.

 

So I’d assume that a node cannot be used immediately after it can accept client TCP connections. Usually a delay in a second or two would not make

any practical difference.

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/f21bf3a1-aeac-4af9-a1b4-05420aebc400%40googlegroups.com.

Michael Klishin

unread,
May 29, 2020, 4:17:38 PM5/29/20
to rabbitm...@googlegroups.com

See

 

rabbitmqctl help await_startup

 

It may or may not be sufficiently different but at the very least it is aware of the node’s boot process (it returns when

an internal “I am booting” marker on the node is removed), unlike something that tests for a TCP acceptor to be up.

 

Listening on a port does not mean the service is fully ready to serve traffic on it (in general, not just in case of RabbitMQ)

To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/e4eb5c22-6236-4c49-997a-1d46a43796af%40googlegroups.com.

Michael Klishin

unread,
Jun 12, 2020, 2:05:09 AM6/12/20
to rabbitmq-users
While investigating something else, we did discover that indeed a major node startup refactoring in 3.8.4 affects the consistent hashing
exchange in a specific scenario [1]. This plugin is unique in that it is stateful and its state is repopulated when topology is recovered
in the core. So in environments where applications do not redeclare the entire topology and exchanges are durable, there is
a chicken-and-egg kind of dependency between data recovery and plugin activation.


From: rabbitm...@googlegroups.com <rabbitm...@googlegroups.com> on behalf of Philipp Böhm <phi...@skendata.de>
Sent: Friday, May 29, 2020 17:42
To: rabbitmq-users <rabbitm...@googlegroups.com>
Subject: [rabbitmq-users] Possible Regression in 3.8.4 when restoring existing exchanges of type x-consistent-hash
 
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages