RabbitMQ Windows Service Recovery not happening when erlang process killed

328 views
Skip to first unread message

Karthikeyan M

unread,
Apr 12, 2017, 6:40:04 AM4/12/17
to rabbitmq-users

The application “Process” name corresponding to the RabbitMQ windows service is erlsrv.exe(Windows Task Manager -> Services -> Rigth click on “RabbitMQ” -> Go to Process).
The below image shows the default configuration of “Recovery” tab of RabbitMQ windows service:
 

1. When I killed the process erlsrv.exe, the windows service was not starting after the configured time and also it was not starting when I tried to start it manually.
It started manually only after killing another related process called erl.exe.
2. In a properly running RabbitMQ service, when I killed the erl.exe, erlsrv.exe also got killed automatically. Even after waiting for 1 minute, the RabbitMQ service is not getting started.

How to set the RabbitMQ windows service recovery when Erlang's process fails?

Auto Generated Inline Image 1

Michael Klishin

unread,
Apr 12, 2017, 6:53:43 AM4/12/17
to rabbitm...@googlegroups.com
RabbitMQ Windows service likely only monitors RabbitMQ itself,
not Erlang/OTP Windows services. Arguably those should be monitored separately.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Maayan Hanin

unread,
Apr 15, 2017, 8:14:00 AM4/15/17
to rabbitmq-users
The Services Control Manager (SCM) is the Windows mechanism responsible for the life-cycle of Windows services.
It launches them on startup while accounting for dependencies, and it handles the recovery logic in case a service abnormally terminates.

When installing RabbitMQ as a Windows Service, you get a Windows Service named RabbitMQ which actually runs erlsrv.exe without any arguments.
This is misleading, since erlsrv is a generic service which runs erlang emulators (it isn't specific to RabbitMQ).

erlsrv is to erlang services as is the SCM to Windows services: it manages the life-cycle of its child erlang services (emulators) by launching them as a part of its startup sequence, it monitors them, and it handles their abnormal termination (by restarting them, or rebooting the machine, or ignoring the failure).


If you check the actual process tree (using Process Explorer), you'll see erlsrv.exe as the parent process of erl.exe, which is the erlang emulator (running RabbitMQ).





erlsrv.exe runs erl.exe, passing the arguments required for running RabbitMQ such as config file path.



From the command-line, you can use erlsrv.exe to list the erlang services it manages:



Now, about the recovery behavior:

If you kill the erl.exe process running RabbitMQ, erlsrv.exe will immediately restart it (since this is the OnFail action defined during installation).


If you kill erlsrv.exe, then you'll notice two things:


First, RabbitMQ is still running (the erl.exe process is up and is performing all of RabbitMQ's functionality) - because erl.exe is still running (it is now an orphan - a process whose parent process was terminated).


Second, starting the "RabbitMQ Windows Service" (which we now know is just erlsrv.exe) fails, as long as RabbitMQ's emulator (erl.exe) is running. Once erl.exe is killed, erlsrv.exe can be started successfully.

This behavior seems like a (minor) bug, a fault-tolerance issue:

Either erlsrv.exe should start successfully in face of already-running erl.exe processes (it should "reattach" to them), or the termination of erlsrv.exe should also terminate all its child erl.exe processes.

The first option depends entirely on the programming of erlsrv.exe.

The second option is achievable using Windows Job Objects and can be achieved without altering the erlsrv.exe code (though it would be nicer if erlsrv supported this behavior as a built-in feature). Once a process is a part of a job, all its child-processes are also a part of the same job. Jobs can be set with the "kill children on close" flag, resulting in termination of the entire process tree when the root process is terminated (thus preventing orphaning).   


Maayan


On Wednesday, April 12, 2017 at 1:53:43 PM UTC+3, Michael Klishin wrote:
RabbitMQ Windows service likely only monitors RabbitMQ itself,
not Erlang/OTP Windows services. Arguably those should be monitored separately.
On Wed, Apr 12, 2017 at 1:40 PM, Karthikeyan M <at.karth...@gmail.com> wrote:

The application “Process” name corresponding to the RabbitMQ windows service is erlsrv.exe(Windows Task Manager -> Services -> Rigth click on “RabbitMQ” -> Go to Process).
The below image shows the default configuration of “Recovery” tab of RabbitMQ windows service:
 

1. When I killed the process erlsrv.exe, the windows service was not starting after the configured time and also it was not starting when I tried to start it manually.
It started manually only after killing another related process called erl.exe.
2. In a properly running RabbitMQ service, when I killed the erl.exe, erlsrv.exe also got killed automatically. Even after waiting for 1 minute, the RabbitMQ service is not getting started.

How to set the RabbitMQ windows service recovery when Erlang's process fails?

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages