rabbitmq-server-ha.ocf RA can't normally start beam process and sticks in a loop

69 views
Skip to first unread message

Oleksandr N

unread,
Jan 28, 2021, 9:41:12 AM1/28/21
to rabbitmq-users
Hi everyone, 

I've faced an issue in rabbitmq-server-ha.ocf  script  on Debian Buster.
rabbitmq-server                      3.7.8-4 
erlang-base                             1:21.2.6+dfsg-1

After some investigating, I can say that the issue is related to a low pause between starting of the beam process and checking its status in the rabbitmq-server-ha.ocf. 
Beam process does not have time to start,  get_status() cmd fails with code 69. Then beam process is killed due to bad status and 'rabbitctl start_app' cmd can't execute.  And resource-agent falls in the loop. 

There are RA logs
Jan 27 11:12:41 cs010 lrmd[28031]: INFO: HaRabbitMQ[27971]: start: action begin.
Jan 27 11:12:42 cs010 lrmd[28325]: INFO: HaRabbitMQ[27971]: get_status(): failed with code 69. Command output: Error: unable to perform an operation on node 'rabbit@cs010'. Please see diagnostics information and suggestions below.
Jan 27 11:12:42 cs010 lrmd[28336]: INFO: HaRabbitMQ[27971]: start: Setting phase 1 one start time to 1611745962
Jan 27 11:12:42 cs010 lrmd[28341]: INFO: HaRabbitMQ[27971]: start: Deleting start time attribute
Jan 27 11:12:42 cs010 lrmd[28346]: INFO: HaRabbitMQ[27971]: start: Deleting master attribute
Jan 27 11:12:42 cs010 lrmd[28351]: INFO: HaRabbitMQ[27971]: start: RMQ going to start.
Jan 27 11:12:42 cs010 lrmd[28355]: INFO: HaRabbitMQ[27971]: start_rmq_server_app(): begin.
Jan 27 11:12:43 cs010 lrmd[28385]: INFO: HaRabbitMQ[27971]: start_rmq_server_app(): blocked access to RMQ port
Jan 27 11:12:43 cs010 lrmd[28684]: INFO: HaRabbitMQ[27971]: get_status(): failed with code 69. Command output: Error: unable to perform an operation on node 'rabbit@cs010'. Please see diagnostics information and suggestions below.
Jan 27 11:12:43 cs010 lrmd[28688]: INFO: HaRabbitMQ[27971]: start_rmq_server_app(): RMQ-runtime (beam) not started, starting...
Jan 27 11:12:44 cs010 lrmd[29406]: INFO: HaRabbitMQ[27971]: start_rmq_server_app(): RMQ-server app not started, starting...
Jan 27 11:12:45 cs010 lrmd[30512]: INFO: HaRabbitMQ[27971]: get_status(): found the beam process running but failed with code 69. Command output: Error: unable to perform an operation on node 'rabbit@cs010'. Please see diagnostics information and suggestions below.
Jan 27 11:12:45 cs010 lrmd[30516]: INFO: HaRabbitMQ[27971]: try_to_start_rmq_app(): RMQ-runtime (beam) not started, starting...
Jan 27 11:12:45 cs010 lrmd[30520]: WARNING: HaRabbitMQ[27971]: start_beam_process(): found old PID-file '/var/run/rabbitmq/pid'.
Jan 27 11:12:45 cs010 lrmd[30535]: ERROR: HaRabbitMQ[27971]: start_beam_process(): found unknown process with PID=28708 from '/var/run/rabbitmq/pid'.
Jan 27 11:12:45 cs010 lrmd[30539]: ERROR: HaRabbitMQ[27971]: try_to_start_rmq_app(): Failed to start beam - returning from the function
Jan 27 11:12:45 cs010 lrmd[30543]: WARNING: HaRabbitMQ[27971]: start_rmq_server_app(): RMQ-server app can't start without Mnesia cleaning.
Jan 27 11:12:47 cs010 lrmd[31164]: INFO: HaRabbitMQ[27971]: get_status(): app rabbit was not found in command output: [\{sasl,"SASL CXC 138 11","3.3"},
Jan 27 11:12:47 cs010 lrmd[31168]: INFO: HaRabbitMQ[27971]: reset_mnesia(): Execute reset with timeout: 60
Jan 27 11:12:47 cs010 lrmd[31466]: INFO: HaRabbitMQ[27971]: su_rabbit_cmd(): the invoked command exited 70: /usr/sbin/rabbitmqctl reset
Jan 27 11:12:47 cs010 lrmd[31470]: INFO: HaRabbitMQ[27971]: reset_mnesia(): Execute force_reset with timeout: 60
Jan 27 11:12:48 cs010 lrmd[31783]: INFO: HaRabbitMQ[27971]: su_rabbit_cmd(): the invoked command exited 70: /usr/sbin/rabbitmqctl force_reset
Jan 27 11:12:48 cs010 lrmd[31787]: WARNING: HaRabbitMQ[27971]: reset_mnesia(): Mnesia couldn't cleaned, even by force-reset command.
Jan 27 11:12:48 cs010 lrmd[31797]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopping beam.*rabbit@cs010 by PID 28708
Jan 27 11:12:50 cs010 lrmd[32118]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopped beam.*rabbit@cs010
Jan 27 11:12:50 cs010 lrmd[32122]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopping beam.*rabbit@cs010 by PID none
Jan 27 11:12:50 cs010 lrmd[32130]: INFO: HaRabbitMQ[27971]: proc_kill(): cannot find any processes matching the beam.*rabbit@cs010, considering target process to be already dead
Jan 27 11:12:50 cs010 lrmd[32134]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopped beam.*rabbit@cs010
Jan 27 11:12:50 cs010 lrmd[32146]: WARNING: HaRabbitMQ[27971]: reset_mnesia(): Mnesia files appear corrupted and have been removed from /var/lib/rabbitmq/mnesia/rabbit@cs010 and /var/lib/rabbitmq/Mnesia.rabbit@cs010
Jan 27 11:12:51 cs010 lrmd[32445]: INFO: HaRabbitMQ[27971]: get_status(): failed with code 69. Command output: Error: unable to perform an operation on node 'rabbit@cs010'. Please see diagnostics information and suggestions below.
Jan 27 11:12:51 cs010 lrmd[32449]: INFO: HaRabbitMQ[27971]: try_to_start_rmq_app(): RMQ-runtime (beam) not started, starting...
Jan 27 11:12:52 cs010 lrmd[32976]: INFO: HaRabbitMQ[27971]: try_to_start_rmq_app(): begin.
Jan 27 11:12:52 cs010 lrmd[32982]: INFO: HaRabbitMQ[27971]: try_to_start_rmq_app(): Execute start_app with timeout: 60
Jan 27 11:12:53 cs010 lrmd[34269]: INFO: HaRabbitMQ[27971]: su_rabbit_cmd(): the invoked command exited 69: /usr/sbin/rabbitmqctl start_app >>/var/log/rabbitmq/startup_log 2>&1
Jan 27 11:12:53 cs010 lrmd[34273]: INFO: HaRabbitMQ[27971]: try_to_start_rmq_app(): start_app failed.
Jan 27 11:12:54 cs010 lrmd[34873]: INFO: HaRabbitMQ[27971]: get_status(): app rabbit was not found in command output: [\{sasl,"SASL CXC 138 11","3.3"},
Jan 27 11:12:54 cs010 lrmd[34877]: INFO: HaRabbitMQ[27971]: reset_mnesia(): Execute reset with timeout: 60
Jan 27 11:12:55 cs010 lrmd[35175]: INFO: HaRabbitMQ[27971]: su_rabbit_cmd(): the invoked command exited 70: /usr/sbin/rabbitmqctl reset
Jan 27 11:12:55 cs010 lrmd[35179]: INFO: HaRabbitMQ[27971]: reset_mnesia(): Execute force_reset with timeout: 60
Jan 27 11:12:56 cs010 lrmd[35472]: INFO: HaRabbitMQ[27971]: su_rabbit_cmd(): the invoked command exited 70: /usr/sbin/rabbitmqctl force_reset
Jan 27 11:12:56 cs010 lrmd[35476]: WARNING: HaRabbitMQ[27971]: reset_mnesia(): Mnesia couldn't cleaned, even by force-reset command.
Jan 27 11:12:56 cs010 lrmd[35486]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopping beam.*rabbit@cs010 by PID 32469
Jan 27 11:12:58 cs010 lrmd[35788]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopped beam.*rabbit@cs010
Jan 27 11:12:58 cs010 lrmd[35792]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopping beam.*rabbit@cs010 by PID none
Jan 27 11:12:58 cs010 lrmd[35800]: INFO: HaRabbitMQ[27971]: proc_kill(): cannot find any processes matching the beam.*rabbit@cs010, considering target process to be already dead
Jan 27 11:12:58 cs010 lrmd[35804]: INFO: HaRabbitMQ[27971]: proc_stop(): Stopped beam.*rabbit@cs010
Jan 27 11:12:58 cs010 lrmd[35816]: WARNING: HaRabbitMQ[27971]: reset_mnesia(): Mnesia files appear corrupted and have been removed from /var/lib/rabbitmq/mnesia/rabbit@cs010 and /var/lib/rabbitmq/Mnesia.rabbit@cs010

I added 'sleep 3' after starting beam process and the issue was gone.
Changing rabbitmq-server and erlang versions to newer ones does not help.

Has anyone faced such problem? 
Should I open a bug on GIT?

BR,
Alex


John Pfersich

unread,
Jan 29, 2021, 12:48:29 AM1/29/21
to rabbitm...@googlegroups.com
3.7.8 isn't supported, AFAICT. upgrade your system.

/—————————————————————/
For encrypted mail use jgpfe...@protonmail.com - Free account at ProtonMail.com


On Jan 28, 2021, at 06:41, Oleksandr N <o.s.n...@gmail.com> wrote:

Hi everyone, 
--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/11ec1b74-f2d3-40ce-b2d9-8f074a7cad9bn%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages