Issue with rabbitmq not starting due to corrupted WAL file

778 views
Skip to first unread message

Tim Ireland

unread,
Feb 10, 2020, 10:57:57 AM2/10/20
to rabbitmq-users
I triaged a recent event which involved a server fault which cased the server to power off.  When the server was restarted, rabbitmq-server could not start.  Unfortunately, I did not have access to the filesystem of the server at the time (and the system was re-imaged), but the rabbitmq log files captured in a support bundle from the system showed the following trace:

2020-02-01 06:01:53.069 [error] <0.326.0> Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{case_clause,{ok,<<>>}},[{ra_log_wal,open_existing,1,[{file,"src/ra_log_wal.erl"},{line,646}]},{ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,[{file,"src/ra_log_wal.erl"},{line,265}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,268}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,214}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_..."},...]},...]}}}}}}} in context start_error 2020-02-01 06:01:53.069 [error] <0.324.0> CRASH REPORT Process <0.324.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{case_clause,{ok,<<>>}},[{ra_log_wal,open_existing,1,[{file,"src/ra_log_wal.erl"},{line,646}]},{ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,[{file,"src/ra_log_wal.erl"},{line,265}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,268}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,214}]},{gen_batch_server,...},...]}}}}}}}}},...} in application_master:init/4 line 138 2020-02-01 06:01:53.091 [error] <0.8.0> Error description: init:do_boot/3 init:start_em/1 rabbit:start_it/1 line 474 rabbit:broker_start/1 line 350 rabbit:start_loaded_apps/2 line 600 app_utils:manage_applications/6 line 126 lists:foldl/3 line 1263 rabbit:'-handle_app_error/1-fun-0-'/3 line 723 throw:{could_not_start,ra, {ra, {{shutdown, {failed_to_start_child,ra_system_sup, {shutdown, {failed_to_start_child,ra_log_sup, {shutdown, {failed_to_start_child,ra_log_wal_sup, {shutdown, {failed_to_start_child,ra_log_wal, {{case_clause,{ok,<<>>}}, [{ra_log_wal,open_existing,1, [{file,"src/ra_log_wal.erl"},{line,646}]}, {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1, [{file,"src/ra_log_wal.erl"},{line,265}]}, {ra_log_wal,recover_wal,2, [{file,"src/ra_log_wal.erl"},{line,268}]}, {ra_log_wal,init,1, [{file,"src/ra_log_wal.erl"},{line,214}]}, {gen_batch_server,init_it,6, [{file,"src/gen_batch_server.erl"},{line,133}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,249}]}]}}}}... /var/log/rabbitmq/rabb...@localhost.log /var/log/rabbitmq/rabbitmq@localhost_upgrade.log


After browsing the code, I realized that it's trying to open a wal file, and appears to be unable to open due to possible file corruption?  Perhaps the file was actively being written during the power off event?

I then reproduced this behavior by taking a look at my rabbitmq files.  I copied the existing 00000001.wal file and then cat that into a new file 00000002.wal but with some garbage at the end.
 
[root@d53c81baa7f1 rabbitmq@localhost]# ls 00000001.wal 00000002.wal meta.dets names.dets [root@d53c81baa7f1 rabbitmq@localhost]# pwd /var/lib/rabbitmq/mnesia/rabbitmq@localhost/quorum/rabbitmq@localhost [root@d53c81baa7f1 rabbitmq@localhost]# cat 00000002.wal asxCCC...RAWA


I then restarted the rabbitmq-server service and it failed to start, with the same trace:

2020-02-05 16:11:36.944 [error] <0.8.0> Error description: init:do_boot/3 init:start_em/1 rabbit:start_it/1 line 474 rabbit:broker_start/1 line 350 rabbit:start_loaded_apps/2 line 600 app_utils:manage_applications/6 line 126 lists:foldl/3 line 1263 rabbit:'-handle_app_error/1-fun-0-'/3 line 723 throw:{could_not_start,ra, {ra, {{shutdown, {failed_to_start_child,ra_system_sup, {shutdown, {failed_to_start_child,ra_log_sup, {shutdown, {failed_to_start_child,ra_log_wal_sup, {shutdown, {failed_to_start_child,ra_log_wal, {{case_clause, {ok,<<97,115,120,67,67,67,46,46,46,82,65,87,65,1,10>>}}, [{ra_log_wal,open_existing,1, [{file,"src/ra_log_wal.erl"},{line,646}]}, {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1, [{file,"src/ra_log_wal.erl"},{line,265}]}, {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1, [{file,"src/ra_log_wal.erl"},{line,268}]}, {ra_log_wal,recover_wal,2, [{file,"src/ra_log_wal.erl"},{line,268}]}, {ra_log_wal,init,1, [{file,"src/ra_log_wal.erl"},{line,214}]}, {gen_batch_serve... Log file(s) (may contain more information): /var/log/rabbitmq/rabbitmq@localhost.log /var/log/rabbitmq/rabbitmq@localhost_upgrade.log

I then deleted the second file, and the server was able to start.

Perhaps there is a way for rabbitmq-server to recover from this situation, if the wal file is corrupted/not readable, can it be discarded safely ?  Is this more desirable than not running at all?

Thanks,

-Tim

Karl Nilsson

unread,
Feb 10, 2020, 12:01:46 PM2/10/20
to rabbitm...@googlegroups.com
Hi, 

Which version of RabbitMQ did this happen in?
Automatic recovery of data in systems designed for data safety can be a bit tricky, which is why there is no automatic deletion of WAL files with invalid headers. If some set of actions that can be considered safe emerge then we'd consider implementing them but for now data corruption needs to be manually handled.

The WAL fsyncs immediately after writing the header but of course there is still a chance a file with an empty header could be corrupted. How we'd distinguish that from a scenario where the file is empty due to a disk corruption I am not sure. Perhaps if it is the latest existing WAL file we could assume the system exited before fsync operation completed and just re-write the header at that point.

I've created a new issue in the Raft library Ra to consider this: https://github.com/rabbitmq/ra/issues/165

Cheers
Karl




--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/9b9a348e-4d07-4410-b704-355b4ae4ea9e%40googlegroups.com.


--
Karl Nilsson

Pivotal/RabbitMQ

Tim Ireland

unread,
Feb 10, 2020, 4:49:32 PM2/10/20
to rabbitm...@googlegroups.com

Thanks Karl,

 

This was observed in RabbitMQ 3.8.2.  It may very well be difficult to determine file corruption specifically.  I had considered a systemd approach where by if the unit could not start, some automated cleanup script could try to remove the latest wal file .  I’m not sure the ramifications of that – I assume some data loss would occur in that scenario?

 

Thanks for considering

 

-Tim

 

From: "rabbitm...@googlegroups.com" <rabbitm...@googlegroups.com> on behalf of Karl Nilsson <knil...@pivotal.io>
Reply-To: "rabbitm...@googlegroups.com" <rabbitm...@googlegroups.com>
Date: Monday, February 10, 2020 at 12:01 PM
To: "rabbitm...@googlegroups.com" <rabbitm...@googlegroups.com>
Subject: Re: [rabbitmq-users] Issue with rabbitmq not starting due to corrupted WAL file

 

Hi, 

 

Which version of RabbitMQ did this happen in?

Automatic recovery of data in systems designed for data safety can be a bit tricky, which is why there is no automatic deletion of WAL files with invalid headers. If some set of actions that can be considered safe emerge then we'd consider implementing them but for now data corruption needs to be manually handled.

 

The WAL fsyncs immediately after writing the header but of course there is still a chance a file with an empty header could be corrupted. How we'd distinguish that from a scenario where the file is empty due to a disk corruption I am not sure. Perhaps if it is the latest existing WAL file we could assume the system exited before fsync operation completed and just re-write the header at that point.

 

I've created a new issue in the Raft library Ra to consider this: https://github.com/rabbitmq/ra/issues/165

 

Cheers

Karl

 

 

 

 

On Mon, 10 Feb 2020 at 15:58, Tim Ireland <tim.i...@plexxi.com> wrote:

I triaged a recent event which involved a server fault which cased the server to power off.  When the server was restarted, rabbitmq-server could not start.  Unfortunately, I did not have access to the filesystem of the server at the time (and the system was re-imaged), but the rabbitmq log files captured in a support bundle from the system showed the following trace:

 

2020-02-01 06:01:53.069 [error] <0.326.0> Supervisor ra_sup had child ra_system_sup started with ra_system_sup:start_link() at undefined exit with reason {shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{case_clause,{ok,<<>>}},[{ra_log_wal,open_existing,1,[{file,"src/ra_log_wal.erl"},{line,646}]},{ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,[{file,"src/ra_log_wal.erl"},{line,265}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,268}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,214}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_..."},...]},...]}}}}}}} in context start_error 2020-02-01 06:01:53.069 [error] <0.324.0> CRASH REPORT Process <0.324.0> with 0 neighbours exited with reason: {{shutdown,{failed_to_start_child,ra_system_sup,{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{case_clause,{ok,<<>>}},[{ra_log_wal,open_existing,1,[{file,"src/ra_log_wal.erl"},{line,646}]},{ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1,[{file,"src/ra_log_wal.erl"},{line,265}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,268}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,214}]},{gen_batch_server,...},...]}}}}}}}}},...} in application_master:init/4 line 138 2020-02-01 06:01:53.091 [error] <0.8.0> Error description: init:do_boot/3 init:start_em/1 rabbit:start_it/1 line 474 rabbit:broker_start/1 line 350 rabbit:start_loaded_apps/2 line 600 app_utils:manage_applications/6 line 126 lists:foldl/3 line 1263 rabbit:'-handle_app_error/1-fun-0-'/3 line 723 throw:{could_not_start,ra, {ra, {{shutdown, {failed_to_start_child,ra_system_sup, {shutdown, {failed_to_start_child,ra_log_sup, {shutdown, {failed_to_start_child,ra_log_wal_sup, {shutdown, {failed_to_start_child,ra_log_wal, {{case_clause,{ok,<<>>}}, [{ra_log_wal,open_existing,1, [{file,"src/ra_log_wal.erl"},{line,646}]}, {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1, [{file,"src/ra_log_wal.erl"},{line,265}]}, {ra_log_wal,recover_wal,2, [{file,"src/ra_log_wal.erl"},{line,268}]}, {ra_log_wal,init,1, [{file,"src/ra_log_wal.erl"},{line,214}]}, {gen_batch_server,init_it,6, [{file,"src/gen_batch_server.erl"},{line,133}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,249}]}]}}}}... /var/log/rabbitmq/rabb...@localhost.log /var/log/rabbitmq/rabbitmq@localhost_upgrade.log





After browsing the code, I realized that it's trying to open a wal file, and appears to be unable to open due to possible file corruption?  Perhaps the file was actively being written during the power off event?

 

I then reproduced this behavior by taking a look at my rabbitmq files.  I copied the existing 00000001.wal file and then cat that into a new file 00000002.wal but with some garbage at the end.

 

[root@d53c81baa7f1 rabbitmq@localhost]# ls 00000001.wal 00000002.wal meta.dets names.dets [root@d53c81baa7f1 rabbitmq@localhost]# pwd /var/lib/rabbitmq/mnesia/rabbitmq@localhost/quorum/rabbitmq@localhost [root@d53c81baa7f1 rabbitmq@localhost]# cat 00000002.wal asxCCC...RAWA

 

 

I then restarted the rabbitmq-server service and it failed to start, with the same trace:

 

2020-02-05 16:11:36.944 [error] <0.8.0> Error description: init:do_boot/3 init:start_em/1 rabbit:start_it/1 line 474 rabbit:broker_start/1 line 350 rabbit:start_loaded_apps/2 line 600 app_utils:manage_applications/6 line 126 lists:foldl/3 line 1263 rabbit:'-handle_app_error/1-fun-0-'/3 line 723 throw:{could_not_start,ra, {ra, {{shutdown, {failed_to_start_child,ra_system_sup, {shutdown, {failed_to_start_child,ra_log_sup, {shutdown, {failed_to_start_child,ra_log_wal_sup, {shutdown, {failed_to_start_child,ra_log_wal, {{case_clause, {ok,<<97,115,120,67,67,67,46,46,46,82,65,87,65,1,10>>}}, [{ra_log_wal,open_existing,1, [{file,"src/ra_log_wal.erl"},{line,646}]}, {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1, [{file,"src/ra_log_wal.erl"},{line,265}]}, {ra_log_wal,'-recover_wal/2-lc$^0/1-0-',1, [{file,"src/ra_log_wal.erl"},{line,268}]}, {ra_log_wal,recover_wal,2, [{file,"src/ra_log_wal.erl"},{line,268}]}, {ra_log_wal,init,1, [{file,"src/ra_log_wal.erl"},{line,214}]}, {gen_batch_serve... Log file(s) (may contain more information): /var/log/rabbitmq/rabb...@localhost.log /var/log/rabbitmq/rabbitmq@localhost_upgrade.log



I then deleted the second file, and the server was able to start.

 

Perhaps there is a way for rabbitmq-server to recover from this situation, if the wal file is corrupted/not readable, can it be discarded safely ?  Is this more desirable than not running at all?

 

Thanks,

 

-Tim

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/9b9a348e-4d07-4410-b704-355b4ae4ea9e%40googlegroups.com.


 

--

Karl Nilsson

 

Pivotal/RabbitMQ

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.

Michael Klishin

unread,
Feb 28, 2020, 7:23:05 PM2/28/20
to rabbitmq-users
Deleting WAL files could result in data loss. For 3.8.3 we have changed the way WAL files are "allocated" and written,
hopefully it should be more atomic than the current one ("should" because the behavior can be filesystem-dependent, for example)



--
MK

Staff Software Engineer, Pivotal/RabbitMQ
Reply all
Reply to author
Forward
0 new messages