"schema_integrity_check_failed" on reboot, can't recover

1,564 views
Skip to first unread message

Ezra Cooper

unread,
Apr 1, 2020, 4:51:04 PM4/1/20
to rabbitm...@googlegroups.com
Dear RabbitMQ gurus,

My team had an incident today where RabbitMQ 3.8.2 failed to come up after a reboot, with an error message "schema_integrity_check_failed" as below.

We'd like to know if this is a known issue, and if there's anything we can do to prevent, or barring that, any way to recover from it without blowing away the database. The database was around 72GB and 114,000 messages.

The crucial log message was:

2020-04-01 18:50:13.415 [error] <0.337.0> CRASH REPORT Process <0.337.0> with 0 neighbours exited with reason: {{schema_integrity_check_failed,[{table_content_invalid,rabbit_durable_queue,{amqqueue,{resource,'_',queue,'_'},'_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_'},[{amqqueue,{resource,<<"/">>,queue,<<"mq_facts">>},true,false,none,[],<0.16806.0>,[],[],[],[{vhost,<<"/">>},{name,<<"queue-size-limit">>},{pattern,<<"^mq_facts$">>},{'apply-to',<<"all">>},{definition,[{<<"max-length-bytes">>,72000000000}]},{priority,0}],undefined,[],undefined,stopped,0,[],<<"/">>,#{user => <<"ro...">>},...}]}]},...} in application_master:init/4 line 138


It looped in that state, not able to start up, until we moved aside its data files and started it with a fresh database.


The reboot was triggered by an ordinary "shutdown" command and the syslog shows systemd shutting down services in an orderly manner, including Rabbit. Rabbit logs around the time of the shutdown follow:


2020-04-01 18:49:42.719 [error] <0.28535.0> ** Connection attempt from disallowed node 'rabbitmqcli-19535-rabbit@missionq-queueserver-experiment' ** 

2020-04-01 18:49:47.218 [error] <0.28541.0> ** Connection attempt from disallowed node 'rabbitmqcli-19735-rabbit@missionq-queueserver-experiment' ** 

2020-04-01 18:49:47.227 [error] <0.28543.0> ** Connection attempt from disallowed node 'rabbitmqcli-19735-rabbit@missionq-queueserver-experiment' ** 

2020-04-01 18:49:47.240 [info] <0.65.0> SIGTERM received - shutting down

2020-04-01 18:49:47.243 [warning] <0.620.0> RabbitMQ HTTP listener registry could not find context rabbitmq_management_tls

2020-04-01 18:49:47.248 [info] <0.272.0> Peer discovery backend rabbit_peer_discovery_classic_config does not support registration, skipping unregistration.

2020-04-01 18:49:47.249 [info] <0.614.0> stopped TCP listener on [::]:5672

2020-04-01 18:49:47.249 [error] <0.16481.0> Error on AMQP connection <0.16481.0> (172.20.0.135:22526 -> 172.20.0.237:5672, vhost: '/', user: 'roger', state: running), channel 0:

 operation none caused a connection exception connection_forced: "broker forced connection closure with reason 'shutdown'"


[lots of those repeat as connections close]


2020-04-01 18:50:11.505 [info] <0.8.0> Log file opened with Lager

2020-04-01 18:50:13.305 [info] <0.8.0> Feature flags: list of feature flags found:

2020-04-01 18:50:13.305 [info] <0.8.0> Feature flags:   [ ] drop_unroutable_metric

2020-04-01 18:50:13.305 [info] <0.8.0> Feature flags:   [ ] empty_basic_get_metric

2020-04-01 18:50:13.305 [info] <0.8.0> Feature flags:   [ ] implicit_default_bindings

2020-04-01 18:50:13.305 [info] <0.8.0> Feature flags:   [ ] quorum_queue

2020-04-01 18:50:13.305 [info] <0.8.0> Feature flags:   [ ] virtual_host_metadata

2020-04-01 18:50:13.305 [info] <0.8.0> Feature flags: feature flag states written to disk: yes

2020-04-01 18:50:13.390 [info] <0.327.0> ra: meta data store initialised. 0 record(s) recovered

2020-04-01 18:50:13.391 [info] <0.334.0> WAL: recovering ["/var/lib/rabbitmq/mnesia/rabbit@missionq-queueserver-experiment/quorum/rabbit@missionq-queueserver-experiment/00000005.wal"]

2020-04-01 18:50:13.392 [info] <0.338.0> 

 Starting RabbitMQ 3.8.2 on Erlang 21.3.8.11

 Copyright (c) 2007-2019 Pivotal Software, Inc.

 Licensed under the MPL 1.1. Website: https://rabbitmq.com

2020-04-01 18:50:13.393 [info] <0.338.0> 

 node           : rabbit@missionq-queueserver-experiment

 home dir       : /var/lib/rabbitmq

 config file(s) : /etc/rabbitmq/rabbitmq.conf

 cookie hash    : hXNHjnSRJHlc6uC0fTYImQ==

 log(s)         : /var/log/rabbitmq/rab...@missionq-queueserver-experiment.log

                : /var/log/rabbitmq/rabbit@missionq-queueserver-experiment_upgrade.log

 database dir   : /var/lib/rabbitmq/mnesia/rabbit@missionq-queueserver-experiment

2020-04-01 18:50:13.404 [info] <0.338.0> Running boot step pre_boot defined by app rabbit

2020-04-01 18:50:13.404 [info] <0.338.0> Running boot step rabbit_core_metrics defined by app rabbit

2020-04-01 18:50:13.404 [info] <0.338.0> Running boot step rabbit_alarm defined by app rabbit

2020-04-01 18:50:13.407 [info] <0.344.0> Memory high watermark set to 2861 MiB (3000000000 bytes) of 7680 MiB (8053833728 bytes) total

2020-04-01 18:50:13.409 [info] <0.346.0> Enabling free disk space monitoring

2020-04-01 18:50:13.409 [info] <0.346.0> Disk free limit set to 1000MB

2020-04-01 18:50:13.412 [info] <0.338.0> Running boot step code_server_cache defined by app rabbit

2020-04-01 18:50:13.412 [info] <0.338.0> Running boot step file_handle_cache defined by app rabbit

2020-04-01 18:50:13.412 [info] <0.349.0> Limiting to approx 32671 file handles (29401 sockets)

2020-04-01 18:50:13.412 [info] <0.350.0> FHC read buffering:  OFF

2020-04-01 18:50:13.412 [info] <0.350.0> FHC write buffering: ON

2020-04-01 18:50:13.412 [info] <0.338.0> Running boot step worker_pool defined by app rabbit

2020-04-01 18:50:13.413 [info] <0.339.0> Will use 4 processes for default worker pool

2020-04-01 18:50:13.413 [info] <0.339.0> Starting worker pool 'worker_pool' with 4 processes in it

2020-04-01 18:50:13.413 [info] <0.338.0> Running boot step database defined by app rabbit

2020-04-01 18:50:13.414 [info] <0.338.0> Waiting for Mnesia tables for 30000 ms, 9 retries left

2020-04-01 18:50:13.415 [info] <0.338.0> Waiting for Mnesia tables for 30000 ms, 9 retries left

2020-04-01 18:50:13.415 [error] <0.337.0> CRASH REPORT Process <0.337.0> with 0 neighbours exited with reason: {{schema_integrity_check_failed,[{table_content_invalid,rabbit_durable_queue,{amqqueue,{resource,'_',queue,'_'},'_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_'},[{amqqueue,{resource,<<"/">>,queue,<<"mq_facts">>},true,false,none,[],<0.16806.0>,[],[],[],[{vhost,<<"/">>},{name,<<"queue-size-limit">>},{pattern,<<"^mq_facts$">>},{'apply-to',<<"all">>},{definition,[{<<"max-length-bytes">>,72000000000}]},{priority,0}],undefined,[],undefined,stopped,0,[],<<"/">>,#{user => <<"ro...">>},...}]}]},...} in application_master:init/4 line 138

2020-04-01 18:50:13.416 [info] <0.43.0> Application rabbit exited with reason: {{schema_integrity_check_failed,[{table_content_invalid,rabbit_durable_queue,{amqqueue,{resource,'_',queue,'_'},'_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_','_'},[{amqqueue,{resource,<<"/">>,queue,<<"mq_facts">>},true,false,none,[],<0.16806.0>,[],[],[],[{vhost,<<"/">>},{name,<<"queue-size-limit">>},{pattern,<<"^mq_facts$">>},{'apply-to',<<"all">>},{definition,[{<<"max-length-bytes">>,72000000000}]},{priority,0}],undefined,[],undefined,stopped,0,[],<<"/">>,#{user => <<"ro...">>},...}]}]},...}



Thanks for any help you can offer!

Ezra


Ezra Cooper

unread,
Apr 14, 2020, 4:21:04 PM4/14/20
to rabbitm...@googlegroups.com
A short follow-up: we have since seen the schema_integrity_check_failed error again, also on a reboot. We're not doing anything strange, I don't think--just running a Linux-level `reboot` and expecting RabbitMQ to be able to read its database when it comes up.

Searching the internet for the error, I see lots of pages that give the advice to delete the database if this happens. Of course that's not an attractive solution for us (we're persisting the messages for a reason!). It seems like there have historically been various causes for this corruption.

If anyone knows special procedures for preventing this corruption, I'd be interested to hear them.

Cheers,
Ezra

Luke Bakken

unread,
Apr 14, 2020, 7:39:27 PM4/14/20
to rabbitmq-users
Hi Ezra,

You need to be sure that RabbitMQ is stopped gracefully before rebooting your server.

What operating system and version are you using?

Thanks -
Luke

Ezra Cooper

unread,
Apr 15, 2020, 12:48:52 PM4/15/20
to rabbitm...@googlegroups.com
On Tue, Apr 14, 2020 at 4:39 PM Luke Bakken <lba...@pivotal.io> wrote:
Hi Ezra,

You need to be sure that RabbitMQ is stopped gracefully before rebooting your server.

That's a big limitation, isn't it? I'm expecting the queue to persist across a power outage, or Amazon deciding to fail the instance (we're running on EC2 in production).

If the answer here is "RabbitMQ doesn't really make an effort to keep its writes crash-consistent," then I can accept that, though it's a bit of a disappointment. It would actually help us a lot to know that for sure.
 
We're using Ubuntu Xenial.

Thanks for your follow-up,
Ezra

Luke Bakken

unread,
Apr 15, 2020, 1:26:18 PM4/15/20
to rabbitmq-users
Hi Ezra,

No matter how "write consistent" software tries to be there is always the chance that a non-graceful stop results in corrupt data. This isn't unique to RabbitMQ, and yes, RabbitMQ writes data as consistently as possible.

schema_integrity_check_failed is an unusual error, and one that I would only expect to see if something happened during an upgrade.

Was your reboot part of an upgrade process?

Thanks -
Luke

Ezra Cooper

unread,
Apr 15, 2020, 7:08:17 PM4/15/20
to rabbitm...@googlegroups.com
Luke,

Thanks. The reboot here was part of a routine redeployment, and not an upgrade. We have a set of deployment scripts that does things like deploy new configuration automatically, and in some cases it will perform a system reboot to ensure everything is clean on the new configuration. The error seems to be somewhat repeatable in my development deployment at this point, so I can experiment with it.

If schema_integrity_check_failed is something you wouldn't expect to happen on an ordinary reboot—outside of an upgrade process—then that makes me think our deployment process may be doing something a bit funky. I'll take a look to see if I can find any way it might be messing with things behind Rabbit's back.

Cheers,
Ezra


On Wed, Apr 15, 2020 at 10:26 AM Luke Bakken <lba...@pivotal.io> wrote:
Hi Ezra,

No matter how "write consistent" software tries to be there is always the chance that a non-graceful stop results in corrupt data. This isn't unique to RabbitMQ, and yes, RabbitMQ writes data as consistently as possible.

schema_integrity_check_failed is an unusual error, and one that I would only expect to see if something happened during an upgrade.

Was your reboot part of an upgrade process?

Thanks -
Luke

On Wednesday, April 15, 2020 at 9:48:52 AM UTC-7, Ezra Cooper wrote:
On Tue, Apr 14, 2020 at 4:39 PM Luke Bakken <lba...@pivotal.io> wrote:
Hi Ezra,

You need to be sure that RabbitMQ is stopped gracefully before rebooting your server.

That's a big limitation, isn't it? I'm expecting the queue to persist across a power outage, or Amazon deciding to fail the instance (we're running on EC2 in production).

If the answer here is "RabbitMQ doesn't really make an effort to keep its writes crash-consistent," then I can accept that, though it's a bit of a disappointment. It would actually help us a lot to know that for sure.
 
We're using Ubuntu Xenial.

Thanks for your follow-up,
Ezra

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-user...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/rabbitmq-users/9899fe0b-1bd4-4c1a-aef7-4b65a139cccd%40googlegroups.com.

Luke Bakken

unread,
Apr 15, 2020, 7:41:15 PM4/15/20
to rabbitmq-users
Hi Ezra,

A reboot should gracefully shut down all the services running on your system, including RabbitMQ. I'm assuming you've got RabbitMQ running via systemd since you probably installed it via the official package, correct?
Reply all
Reply to author
Forward
0 new messages