RabbitMQ database corruption

625 views
Skip to first unread message

Jeff Outlaw

unread,
Feb 2, 2015, 3:32:06 PM2/2/15
to rabbitm...@googlegroups.com
I am attempting to troubleshoot a database corruption issue.  Attached is a dump from procmon where I see it accessing the various databases.  The service continues to start then crash shortly after start.  When I delete the 0.rdg file in C:\Windows\System32\config\system profile\AppData\Roaming\RabbitMQ\db\rabbit@MachineName-mnesia\msg_store_persistent and restart the service RMQ successfully comes back up again and stays running.

This behavior can be reproduced/simulated by going into task manager and ending the erlsrv process.  I realize this is a bad thing to do but I am simulating a hard shutdown here without actually powering off the PC.  Side note I have seen several instances of the following message inside of Windows Event Viewer:  RabbitMQ: Erlang machine voluntarily stopped.  The service is not restarted as OnFail is set to ignore.  this message shows up periodically and is on a client PC and they are NOT doing to the procedure I stated at the start of this paragraph but the result is the same the erlsrv.exe is shown as the source in eventviewer and is somehow crashing.

I am running the latest RMQ 3.4.3 server on a Windows 7 x64 WORKSTATION with latest ERLang ( I stressed workstation because I am suspicious and have empirical evidence pointing to the fact that the PC is either not being shut down correctly or is going into hibernation mode or both ).  Don't see these issues on a Windows Server 2012 under our control.

Comments and thoughts appreciated.

Thanks,
Jeff


Michael Klishin

unread,
Feb 2, 2015, 9:12:04 PM2/2/15
to rabbitm...@googlegroups.com, Jeff Outlaw
On 2 February 2015 at 23:32:08, Jeff Outlaw (jeffrey...@gmail.com) wrote:
> I am attempting to troubleshoot a database corruption issue.
> Attached is a dump from procmon where I see it accessing the various
> databases. The service continues to start then crash shortly
> after start. When I delete the 0.rdg file in C:\Windows\System32\config\system
> profile\AppData\Roaming\RabbitMQ\db\rabbit@MachineName-mnesia\msg_store_persistent
> and restart the service RMQ successfully comes back up again
> and stays running.

What was in the logs?

> This behavior can be reproduced/simulated by going into task
> manager and ending the erlsrv process. I realize this is a bad
> thing to do but I am simulating a hard shutdown here without actually
> powering off the PC. Side note I have seen several instances of
> the following message inside of Windows Event Viewer: RabbitMQ:
> Erlang machine voluntarily stopped. The service is not restarted
> as OnFail is set to ignore. this message shows up periodically
> and is on a client PC and they are NOT doing to the procedure I stated
> at the start of this paragraph but the result is the same the erlsrv.exe
> is shown as the source in eventviewer and is somehow crashing.
>
> I am running the latest RMQ 3.4.3 server on a Windows 7 x64 WORKSTATION
> with latest ERLang ( I stressed workstation because I am suspicious
> and have empirical evidence pointing to the fact that the PC is
> either not being shut down correctly or is going into hibernation
> mode or both ). Don't see these issues on a Windows Server 2012
> under our control.
>
> Comments and thoughts appreciated.

If you delete some of the files, that database directory can no longer be used,
and it's a matter of time for the message store to discover inconsistencies. Since
in RabbitMQ cannot restore magically missing data, it shuts down. If this is not acceptable,
you should use multiple nodes and mirror some or even all queues.

Note that deleting a random message store file is not really simulating a power failure. In case of
a power failure, chances are that message store will have some data only in RAM (writes to disk happen
in short periods or when RabbitMQ decides it is idle enough to do it) but won't necessarily result
in inconsistencies. 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Simon MacMullen

unread,
Feb 3, 2015, 6:00:45 AM2/3/15
to Michael Klishin, rabbitm...@googlegroups.com, Jeff Outlaw
On 03/02/15 02:11, Michael Klishin wrote:
> On 2 February 2015 at 23:32:08, Jeff Outlaw (jeffrey...@gmail.com) wrote:
>> I am attempting to troubleshoot a database corruption issue.
>> Attached is a dump from procmon where I see it accessing the various
>> databases. The service continues to start then crash shortly
>> after start. When I delete the 0.rdg file in C:\Windows\System32\config\system
>> profile\AppData\Roaming\RabbitMQ\db\rabbit@MachineName-mnesia\msg_store_persistent
>> and restart the service RMQ successfully comes back up again
>> and stays running.
>
> What was in the logs?

To emphasise this: we would be *very* interested to see what's in the
log files if you can get a RabbitMQ server to refuse to start after an
unplanned shutdown. That really should work, and I would consider any
failure to do so to be a bug. We might need to see the contents of
msg_store_persistent too. erl_crash.dump is unlikely to be helpful.

> If you delete some of the files, that database directory can no longer be used,
> and it's a matter of time for the message store to discover inconsistencies.

Note that OP said he needed to delete files to get it to start, not that
he did so to simulate a power failure.

Cheers, Simon




Reply all
Reply to author
Forward
0 new messages