RabbitMQ HA pacemaker OCF RA agent seems no longer working with rabbitmq>3.7 and Erlang>21

203 views
Skip to first unread message

Bohdan Dobrelia

unread,
Jun 30, 2021, 8:13:21 AM6/30/21
to rabbitmq-users
Hi,
As you may know, changes submitted against the OCF RA [0] have been smoke-tested in Travis CI, like [1]. That job uses docker images based on Debian Buster, built by [2], with:

rabbitmq-server (3.7.8-4)
pacemaker amd64 2.0.1-5+deb10u1
corosync (3.0.1-2+deb10u1)

...which is unfortunately hopelessly outdated.
So I wanted to update rabbitmq to 3.8.18 and Erlang 23 or 24. As of today, I had no luck with that. Pacemaker resource fails to start, like this (/var/log/corosync/corosync.log):

lrmd:    INFO: p_rabbitmq-server[998]: get_status(): failed with code 69. Command output: Error: unable to perform an operation on node 'rabbit @ n1'...
...attempted to contact: [rabbit @ n1]
rabbit @ n1:
  * connected to epmd (port 4369) on n1
  * epmd reports: node 'rabbit' not running at all
                  no other nodes on n1
  * suggestion: start the node
Current node details:
 * node name: 'rabbitmqcli-947-rabbit @ n1'

But it starts just fine via systemd unit shipped with the package.
I *hope* I can fix that to have OCF RA CI'ed against rabbitmq 3.8.18, but not certain.
So any help is greatly appreciated. You can bootstrap a local setup by vagrant and [3], and the testing image provided in vagrant-settings.yaml as:

docker_image: bogdando/rabbitmq-cluster-ocf:testing

There is also README with more details. Thank you for reading that till the end.

Bohdan Dobrelia

unread,
Jun 30, 2021, 9:47:22 AM6/30/21
to rabbitmq-users
I think I found two related problems:
1) beam.smp no longer writes a pidfile after it's started, and OCF RA start_beam_process() expects it to be written at '/var/run/rabbitmq/pid'.
2) Command line parameters passed to beam.smp no longer contain '-sname rabbit@<nodename>' nor   '-mnesia dir "/var/lib/rabbitmq/mnesia/rabbit@<nodename>"', and OCF RA looks for the 'beam.*rabbit@<nodename>' regex match in order to provide the stopping action.

Bohdan Dobrelia

unread,
Jun 30, 2021, 10:54:56 AM6/30/21
to rabbitmq-users
Another problem is that 'RABBITMQ_NODE_ONLY=true /usr/sbin/rabbitmq-server' starts it differently than it was on older versions. OCF RA starts the node only, then starts the application. It used worked early, but no longer. What is the way today to start it without start_app automatically performed?

Bohdan Dobrelia

unread,
Jun 30, 2021, 11:02:26 AM6/30/21
to rabbitmq-users
OK, I think I found the solution and know now how to fix the OCF RA. The command that works is (n1 is the node name here):

RABBITMQ_SERVER_START_ARGS="-mnesia dir \"/var/lib/rabbitmq/mnesia/rabbit\@n1\" -sname rabbit\@n1" RABBITMQ_NODE_ONLY=1 ... /usr/sbin/rabbitmq-server

after that I can do start_app. And the pid file location now changed to "/var/lib/rabbitmq/mnesia/rabbit\@n1.pid", so it should be as well adjusted.

Stay tuned folks :)

Bohdan Dobrelia

unread,
Jul 2, 2021, 5:22:02 AM7/2/21
to rabbitmq-users
Good news are that https://github.com/rabbitmq/rabbitmq-server/pull/3166 fixes the issue. I tested it with rabbitmq-server3.8.18-1 and Erlang 24. It also worked for v3.7.8.
Not so much exciting news are that Travis CI no longer triggers. I'll look into the ways of moving the OCF RA testing to Bazel and Buildbuddy that tests PRs against rabbitmq-server today.

Bohdan Dobrelia

unread,
Oct 1, 2021, 7:10:56 AM10/1/21
to rabbitmq-users
Hello people who still uses Pacemaker!

I've proposed to move the OCF resource agent script into a new home:

Even though I fixed it to work with rabbitmq-server3.8.18-1 and Erlang 24, and older versions, it seems no longer possible to auto-test it in the
existing CI system. While I would deeply appreciate any assistance with integrating it there and having OCF RA changes
tested against local builds of RabbitMQ borrowed from the neighbor CI jobs. But I have no idea how would I make that.
So let's hope for its bright future at new home!

Reply all
Reply to author
Forward
0 new messages