Hi,
We are experiencing a strange issue on a server where RabbitMQ/Erlang is stuck in a reboot loop. RabbitMQ appears to crash on start, and then Erlang is restarted. This happens endlessly in a loop (in the windows event log, there are warnings from ErlSrv every 10 seconds: Erlang service restarted and Restarted erlang machine.
Details over our setup:
- This a Windows server
- Only a single RabbitMQ node
- RabbitMQ is natively installed, using a custom install process. The Erlang and RabbitMQ/Erlang files are essentially just copied onto the machine.
- We run RabbitMQ as a Windows Service
- Install data (erlang, rabbitmq) is stored in one folder. Runtime data (i.e. the rabbitMQ database) is stored in a second folder.
We use RabbitMQ as part of a larger software suite. The problem is triggered when we upgrade our software. What happens:
- The RabbitMQ service is stopped and uninstalled
- (new) We kill any remaining erl/epmd/erlsvr processes
- All installation data is removed
(runtime data is not removed)
- The installation from the new version is installed (in this case, there is actually no change to the RabbitMQ files as we haven't upgraded the version)
- The RabbitMQ service is installed and started
At this point we get stuck in the reboot loop. We are able to fix it by manually stopping the RabbitMQ service, uninstalling it, removing all runtime data (it's transient anyways), and then finally reinstalling and starting the Rabbit MQ service.
Based on the logs (attached), it appears that some file is corrupted somewhere, but I'm not sure what and I'm stuck as to how we can fix this issue permanently.
If anyone has any suggestions, or can identify any clear causes from the logs it would be greatly appreciated.