Redis instance fails to restart due to AOF corruption

7,273 views
Skip to first unread message

Jay Rolette

unread,
Sep 1, 2014, 1:41:50 PM9/1/14
to redi...@googlegroups.com
I'm periodically running into issues where, after a system crash, one of my Redis instances fails to restart. The message in the log is consistent:

[2156] 01 Sep 17:11:28.042 # Bad file format reading the append only file: make a backup of your AOF file, then use ./redis-check-aof --fix <filename>
Running the suggested command always fixes it, but I'm trying to figure out why this happens as often as it does.

Running Redis 2.8.4 on Ubuntu 14.04 LTS. On this particular instance, we are just storing system stats so some of the options we are using might be different than typical. In particular:

save 900 1
save 60 10000

appendonly yes
appendfsync no
no-appendfsync-on-rewrite no
aof-rewrite-incremental-fsync yes

It's not a surprise that I could lose the last few minutes of my stats data since we aren't forcing fsync, but I wasn't expecting Redis to not restart cleanly in that situation.

Any suggestions or ideas about why I'm seeing AOF file corruption?

Thanks,
Jay Rolette

Matt Palmer

unread,
Sep 1, 2014, 8:59:32 PM9/1/14
to redi...@googlegroups.com
On Mon, Sep 01, 2014 at 12:41:45PM -0500, Jay Rolette wrote:
> I'm periodically running into issues where, after a system crash, one of my
> Redis instances fails to restart. The message in the log is consistent:
>
> [2156] 01 Sep 17:11:28.042 # Bad file format reading the append only
> file: make a backup of your AOF file, then use ./redis-check-aof --fix
> <filename>

[...]

> It's not a surprise that I could lose the last few minutes of my stats data
> since we aren't forcing fsync, but I wasn't expecting Redis to not restart
> cleanly in that situation.
>
> Any suggestions or ideas about why I'm seeing AOF file corruption?

You're seeing AOF file corruption because your systems are crashing. In
that case, all bets are off as to whether (or what) data is actually making
it to disk, especially given the complex nature of modern filesystems.
Forcing fsync would *probably* improve matters, but there's no "atomic unit
write" syscall that will guarantee either all of an arbitrary sized chunk of
data will end up on disk, or none of it will, so you will always have the
risk that only part of an AOF command will be written to disk.

Your best solution is to stop your systems crashing. That's an extremely
abnormal behaviour. Redis (or some other userland process) crashing? Yep,
that can happen for a variety of reasons. A kernel-level crash? I'd be
taking a deeper look at that, because it *should* be a never event.

- Matt

Jay Rolette

unread,
Sep 2, 2014, 9:15:24 AM9/2/14
to redi...@googlegroups.com

On Mon, Sep 1, 2014 at 7:59 PM, Matt Palmer <mpa...@hezmatt.org> wrote:

Your best solution is to stop your systems crashing.  That's an extremely
abnormal behaviour.  Redis (or some other userland process) crashing?  Yep,
that can happen for a variety of reasons.  A kernel-level crash?  I'd be
taking a deeper look at that, because it *should* be a never event.

It's a pre-alpha embedded product, so it isn't near as rare as it might be otherwise, but in truth, it doesn't matter. Production systems do crash and I need to make sure the appliance comes back up without requiring operator intervention.

After some additional testing, it looks like I'm seeing this even after a controlled shutdown/reboot of the system. It sounds like I need to dig into the upstart script to make sure it is doing the right thing on shutdown + add explicit checking for AOF corruption during startup.

Thanks,
Jay

Matt Palmer

unread,
Sep 2, 2014, 6:37:23 PM9/2/14
to redi...@googlegroups.com
On Tue, Sep 02, 2014 at 08:15:18AM -0500, Jay Rolette wrote:
> On Mon, Sep 1, 2014 at 7:59 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
> > Your best solution is to stop your systems crashing. That's an extremely
> > abnormal behaviour. Redis (or some other userland process) crashing? Yep,
> > that can happen for a variety of reasons. A kernel-level crash? I'd be
> > taking a deeper look at that, because it *should* be a never event.
>
> It's a pre-alpha embedded product, so it isn't near as rare as it might be
> otherwise, but in truth, it doesn't matter. Production systems do crash and
> I need to make sure the appliance comes back up without requiring operator
> intervention.

In that case, you might be best to patch Redis itself to stop processing the
AOF and startup when it discovers corruption. All the the corruption-fix
program does is to truncate the AOF to the end of the last-good command; the
reason why Redis doesn't automatically startup like this is to avoid the
possibility of corruption in the middle of a file from losing a huge amount
of otherwise recoverable data; if you're just automatically truncating the
AOF without examining it, Redis itself may as well do that, and save you the
extended delay of Redis trying to load the AOF, failing, running the
corruption fix (which scans the whole AOF again), and then Redis starting up
(reading the AOF a *third* time).

I've dug around in that corner of the codebase in the past, and from memory
it's a relatively simple thing to change.

> After some additional testing, it looks like I'm seeing this even after a
> controlled shutdown/reboot of the system.

Uhm... that's a pretty serious bug somewhere. Either Redis is stopping in
the middle of a write (which I didn't think was possible -- Redis issues the
write to the AOF pretty much immediately), or the OS isn't flushing to disk
before it shuts down. I'd consider that a fairly nasty bug, and one that
could potentially cause problems elsewhere.

- Matt

--
Just because we work at a University doesn't mean we're surrounded by smart
people.
-- Brian Kantor, in the monastery

Jay Rolette

unread,
Sep 2, 2014, 7:33:19 PM9/2/14
to redi...@googlegroups.com
On Tue, Sep 2, 2014 at 5:37 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
On Tue, Sep 02, 2014 at 08:15:18AM -0500, Jay Rolette wrote:
> It's a pre-alpha embedded product, so it isn't near as rare as it might be
> otherwise, but in truth, it doesn't matter. Production systems do crash and
> I need to make sure the appliance comes back up without requiring operator
> intervention.

In that case, you might be best to patch Redis itself to stop processing the
AOF and startup when it discovers corruption.  All the the corruption-fix
program does is to truncate the AOF to the end of the last-good command; the
reason why Redis doesn't automatically startup like this is to avoid the
possibility of corruption in the middle of a file from losing a huge amount
of otherwise recoverable data; if you're just automatically truncating the
AOF without examining it, Redis itself may as well do that, and save you the
extended delay of Redis trying to load the AOF, failing, running the
corruption fix (which scans the whole AOF again), and then Redis starting up
(reading the AOF a *third* time).

I've dug around in that corner of the codebase in the past, and from memory
it's a relatively simple thing to change.
 
Good point on the startup performance implications. I'll look at adding a command line option that won't abort starting if the AOF file is bad. Instead it would clone the corrupted file and then truncate/fix and startup.

> After some additional testing, it looks like I'm seeing this even after a
> controlled shutdown/reboot of the system.

Uhm... that's a pretty serious bug somewhere.  Either Redis is stopping in
the middle of a write (which I didn't think was possible -- Redis issues the
write to the AOF pretty much immediately), or the OS isn't flushing to disk
before it shuts down.  I'd consider that a fairly nasty bug, and one that
could potentially cause problems elsewhere.

Yep. My guess is it is a problem with the shutdown script not giving Redis time to cleanup. It probably sends SIGTERM and then immediately bounces the box. It's on my list to chase that one down in the next couple of days.

Appreciate your thoughts Matt.

Regards,
Jay

Matt Palmer

unread,
Sep 2, 2014, 8:50:09 PM9/2/14
to redi...@googlegroups.com
Why would you clone the corrupted file? Just in case it was a mid-file
corruption and the remainder of the data could be recovered at a later time?
That'll end up consuming a lot of disk space, especially in the event of
multiple successive corruptions that aren't quickly examined and cleaned up
by an operator.

I wouldn't make a copy, in any event. I'd change the startup logic to be:

Load AOF
If corruption found:
Immediately trigger a rewrite, to get a clean base AOF
when AOF rewrite complete, move corrupted AOF out of the way instead of
overwriting it

Moving the corrupted AOF at the end of the rewrite avoids an extended period
when there's no AOF to load if Redis restarts, while avoiding the need to
chew disk I/O copying a file for which two copies aren't needed.

> > After some additional testing, it looks like I'm seeing this even after a
> > > controlled shutdown/reboot of the system.
> >
> > Uhm... that's a pretty serious bug somewhere. Either Redis is stopping in
> > the middle of a write (which I didn't think was possible -- Redis issues
> > the
> > write to the AOF pretty much immediately), or the OS isn't flushing to disk
> > before it shuts down. I'd consider that a fairly nasty bug, and one that
> > could potentially cause problems elsewhere.
>
> Yep. My guess is it is a problem with the shutdown script not giving Redis
> time to cleanup. It probably sends SIGTERM and then immediately bounces the
> box. It's on my list to chase that one down in the next couple of days.

By "bouncing the box", you mean you're just hard-resetting the CPU, rather
than giving the kernel time to shutdown cleanly? That'll end badly for all
*sorts* of things. But yeah, you should definitely wait for Redis to
shutdown before letting the machine reboot.

- Matt

Jay Rolette

unread,
Sep 3, 2014, 10:29:10 AM9/3/14
to redi...@googlegroups.com
On Tue, Sep 2, 2014 at 7:49 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
On Tue, Sep 02, 2014 at 06:33:11PM -0500, Jay Rolette wrote:
> Good point on the startup performance implications. I'll look at adding a
> command line option that won't abort starting if the AOF file is bad.
> Instead it would clone the corrupted file and then truncate/fix and startup.

Why would you clone the corrupted file?  Just in case it was a mid-file
corruption and the remainder of the data could be recovered at a later time?
That'll end up consuming a lot of disk space, especially in the event of
multiple successive corruptions that aren't quickly examined and cleaned up
by an operator.

Couple of reasons to clone it:

1) For diagnostics. This should/better be a rare event in production and the partial data at the end of the AOF file may give me clues about what is going on in the system at the time.

2) In the cases I've seen so far, it has always been very little data truncated from the AOF file. In those cases, the data loss won't matter much (it's just historical stats from the system). However, I don't know that it won't ever be significantly worse than that. I don't expect it to be based on how Redis manages the AOF file, but bugs and hardware failures happen. If I get a support call from a customer that suddenly loses 3 months of stats, having a possible avenue to lessen the pain is "A Good Thing(tm)".
 
I wouldn't make a copy, in any event.  I'd change the startup logic to be:

Load AOF
If corruption found:
  Immediately trigger a rewrite, to get a clean base AOF
  when AOF rewrite complete, move corrupted AOF out of the way instead of
      overwriting it

Moving the corrupted AOF at the end of the rewrite avoids an extended period
when there's no AOF to load if Redis restarts, while avoiding the need to
chew disk I/O copying a file for which two copies aren't needed.

Not sure I'm seeing the difference in the startup logic you are suggesting. Maybe I'm missing what you mean by "trigger a rewrite"? Are you suggesting creating a new empty AOF file to run from? I'm assuming you meant run "redis-check-aof --fix". Also, "move the corrupted AOF out of the way" sounds no different than cloning the AOF file.

> Yep. My guess is it is a problem with the shutdown script not giving Redis
> time to cleanup. It probably sends SIGTERM and then immediately bounces the
> box. It's on my list to chase that one down in the next couple of days.

By "bouncing the box", you mean you're just hard-resetting the CPU, rather
than giving the kernel time to shutdown cleanly?  That'll end badly for all
*sorts* of things.  But yeah, you should definitely wait for Redis to
shutdown before letting the machine reboot.

No, "bouncing the box" means "sudo reboot", so the kernel does its normal shutdown processing.

Matt Palmer

unread,
Sep 3, 2014, 5:43:26 PM9/3/14
to redi...@googlegroups.com
On Wed, Sep 03, 2014 at 09:29:02AM -0500, Jay Rolette wrote:
> On Tue, Sep 2, 2014 at 7:49 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
> > I wouldn't make a copy, in any event. I'd change the startup logic to be:
> >
> > Load AOF
> > If corruption found:
> > Immediately trigger a rewrite, to get a clean base AOF
> > when AOF rewrite complete, move corrupted AOF out of the way instead of
> > overwriting it
> >
> > Moving the corrupted AOF at the end of the rewrite avoids an extended
> > period
> > when there's no AOF to load if Redis restarts, while avoiding the need to
> > chew disk I/O copying a file for which two copies aren't needed.
> >
>
> Not sure I'm seeing the difference in the startup logic you are suggesting.
> Maybe I'm missing what you mean by "trigger a rewrite"? Are you suggesting
> creating a new empty AOF file to run from?

No, I'm saying get Redis to run the same process as is done by sending a
BGREWRITEAOF command (fork, start writing a new AOF, queueing changes, when
the write is done, dump the cache of changes to the new AOF, then rename
that new AOF over the top of the existing AOF), *except* that you move the
known-corrupt AOF out of the way, for later diagnosis.

> I'm assuming you meant run "redis-check-aof --fix".

No. My plan would *never* run redis-check-aof. It would all be done
internally to redis-server.

> Also, "move the corrupted AOF out of the way" sounds no different than
> cloning the AOF file.

It's very different. Cloning an AOF means, to me, copying it. That
requires reading and writing the entire AOF, during which time Redis isn't
doing anything. Perhaps you're only going to be storing a couple of MB of
AOF, in which case who cares, but if it's any non-trivial amount of data
you'll be waiting a *long* time (comparatively) for that copy to finish.
Perhaps I'm just scarred by my days of dealing with 100GB+ AOFs and the
"fun" involved in dealing with those when they corrupted.

By moving the corrupt AOF out of the way, you're performing a single write
(a rename) instead of a large copy, which will happen nearly
instantaneously, and get your Redis up and running a lot more quickly.

- Matt

Salvatore Sanfilippo

unread,
Sep 4, 2014, 9:01:43 AM9/4/14
to Redis DB
Hello, replying here to the original poster but trying to capture
other infos / ideas from the reply.

The problem here is the following: Redis AOF is append only, so
technically speaking there is no corruption possible because of a
crash, if there are no filesystem issues or bugs in the code base.
However there is another problem, that is, an AOF file that has the
last command truncated. Redis takes care to use a single write(2) call
in order to append a full command into the AOF, but is this enough?
Not really. With ext4, for example, especially if the filesystem is
mounted without data=ordered, it happens often that after a crash your
AOF contains a last write which is not a valid command.

Now, Redis currently stops while processing such an AOF with a final
half-written command. This results into the user being required to do
the following:

1) Run the AOF fix utility.
2) Acknowledge a given fix.
3) Restart the server.

However, what should the user do if not that? Anyway the last command
will be hard to recover.
On the other side, if we don't do that, is it good that Redis starts
silently without warning the user about the condition? After all the
AOF has something wrong.

I think that this is a rare case where a subtle behavior should be
user configurable: by default I think Redis should just start if the
last command is corrupted, and instead stop if the corruption is in
the middle.
However with a configuration directive the user should be able to
specify to always stop even if the corruption is at the end.

It's quite some time I would like to have this change in Redis, and
with Redis Cluster going RC in a few weeks, and AOF being the default
for Redis Cluster, this is starting to be really important to have.

Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/d/optout.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

"One would never undertake such a thing if one were not driven on by
some demon whom one can neither resist nor understand."
— George Orwell

Jay Rolette

unread,
Sep 4, 2014, 9:27:34 AM9/4/14
to redi...@googlegroups.com
> I think that this is a rare case where a subtle behavior should be
> user configurable: by default I think Redis should just start if the
> last command is corrupted, and instead stop if the corruption is in
> the middle.
> However with a configuration directive the user should be able to
> specify to always stop even if the corruption is at the end.

That would fit my needs perfectly, although with me using Redis in an embedded system instead of the more typical web app, I'm probably not representative of the rest of the user-base.

Can you reliably differentiate between it being only the last command that is corrupted and one in the middle? Once you hit a corrupted or partial command, are you able to trust anything in the file beyond that? Do you just look for what appears to be valid commands after the corruption and assume it wasn't the last command based on that?

Regards,
Jay

Itamar Haber

unread,
Sep 4, 2014, 9:29:09 AM9/4/14
to redi...@googlegroups.com

For completeness, how about also adding a way to configure a forced start - I.e. even if the AOF corruption is in the middle? I assume this will be helpful in cases where availability is mandatory and data is secondary - e.g. avoid cache warning as much as possible.

Jay Rolette

unread,
Sep 4, 2014, 9:55:21 AM9/4/14
to redi...@googlegroups.com
On Wed, Sep 3, 2014 at 4:25 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
On Wed, Sep 03, 2014 at 09:29:02AM -0500, Jay Rolette wrote:
> On Tue, Sep 2, 2014 at 7:49 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
> > I wouldn't make a copy, in any event.  I'd change the startup logic to be:
> >
> > Load AOF
> > If corruption found:
> >   Immediately trigger a rewrite, to get a clean base AOF
> >   when AOF rewrite complete, move corrupted AOF out of the way instead of
> >       overwriting it
> >
> > Moving the corrupted AOF at the end of the rewrite avoids an extended
> > period
> > when there's no AOF to load if Redis restarts, while avoiding the need to
> > chew disk I/O copying a file for which two copies aren't needed.
> >
>
> Not sure I'm seeing the difference in the startup logic you are suggesting.
> Maybe I'm missing what you mean by "trigger a rewrite"? Are you suggesting
> creating a new empty AOF file to run from?

No, I'm saying get Redis to run the same process as is done by sending a
BGREWRITEAOF command (fork, start writing a new AOF, queueing changes, when
the write is done, dump the cache of changes to the new AOF, then rename
that new AOF over the top of the existing AOF), *except* that you move the
known-corrupt AOF out of the way, for later diagnosis.

Gotcha. I wasn't familiar with BGREWRITEAOF and when you said "startup logic", I was interpreting that as something to be done in my startup scripts, not Redis itself. 
 
> Also, "move the corrupted AOF out of the way" sounds no different than
> cloning the AOF file.

It's very different.  Cloning an AOF means, to me, copying it.  That
requires reading and writing the entire AOF, during which time Redis isn't
doing anything.  Perhaps you're only going to be storing a couple of MB of
AOF, in which case who cares, but if it's any non-trivial amount of data
you'll be waiting a *long* time (comparatively) for that copy to finish.
Perhaps I'm just scarred by my days of dealing with 100GB+ AOFs and the
"fun" involved in dealing with those when they corrupted.

By moving the corrupt AOF out of the way, you're performing a single write
(a rename) instead of a large copy, which will happen nearly
instantaneously, and get your Redis up and running a lot more quickly.

Understood. I misunderstood where you were proposing do this.

In my case, I'm not worried about "fun" with 100GB+ files. While we have 256GB RAM on the system, most of that is dedicated elsewhere. The size of this Redis database is bounded... known set of stats sampled at a known frequency with EXPIRY used on all keys.

As an aside, have you ever seen corruption in the AOF files other than the last command?

Jay

Salvatore Sanfilippo

unread,
Sep 4, 2014, 12:09:48 PM9/4/14
to Redis DB
On Thu, Sep 4, 2014 at 3:27 PM, Jay Rolette <rol...@infiniteio.com> wrote:
> Can you reliably differentiate between it being only the last command that
> is corrupted and one in the middle?

Fair question, but the reply is a complex matter.

Basically what it is possible to do is to differentiate between a
corrupted format, and an unexpected end of file.

For example, every new command starts with an Array of strings, so
with the "*" character. If instead of "*" we find "Z" then the file is
corrupted, and we don't recover.
However files can get corrupted in a way that we detect it like an
unexpected end of file. Example:

*1\r\n
$4\r\n
PING\r\n

This is valid protocol, assume you find this in the middle of an AOF
file (actually not possible since PING is read-only).

It is possible that because of a corruption, the chars "\r\nPING" are
replaced with "123456", so the file content is turned into:

*1\r\n
$4123456\r\n

This is a corruption, but will be detected as an EOF condition if
there is not enough room in the file.

However we can't detect this kind of corruptions in a reliable way
generally. For what is worth the command may turn into FLUSHALL, so to
pretend we can protect from corruption here is not wise.
Basically I would say that to make the option work only when an
unexpected EOF is detected is safe enough, or at least, the safety
because of odd corruptions is not a big price to pay compared to the
availability concerns of the current setup.

I've already started a new branch that distinguish the two errors, now
to implement the new option is a breeze. I'll do it in the next 24h
and report here.

Salvatore

Jay Rolette

unread,
Sep 4, 2014, 12:24:53 PM9/4/14
to redi...@googlegroups.com
Excellent! That seems completely reasonable, Salvatore. Any chance we'll be able to get this in 2.8? :-)



Matt Palmer

unread,
Sep 4, 2014, 4:19:32 PM9/4/14
to redi...@googlegroups.com
On Thu, Sep 04, 2014 at 08:55:13AM -0500, Jay Rolette wrote:
> As an aside, have you ever seen corruption in the AOF files other than the
> last command?

No, I haven't, but I've always run systems that were *very* heavily
engineered against the sorts of things that tend to cause file corruption.

- Matt

Matt Stancliff

unread,
Sep 4, 2014, 5:43:48 PM9/4/14
to redi...@googlegroups.com

On Sep 4, 2014, at 4:19 PM, Matt Palmer <mpa...@hezmatt.org> wrote:

> No, I haven't, but I've always run systems that were *very* heavily
> engineered against the sorts of things that tend to cause file corruption.

Generic file corruption is a good and often overlooked point too. The AOF file doesn’t have any checksum or internal consistency guarantees, so if your disk corrupts the journal data in a transparent way, Redis will happily reload and serve corrupt data (assuming corruption is in the data portions and not the inline protocol specification).

There’s an open GitHub issue to address persistence improvements (improve RDB format by adding K/V metadata), so maybe after that happens we can try to improve AOF integrity. I think there’s also a plan to enable RDB+AOF reload operations so we can eventually have something resembling a traditional “big binary compact checksum’d database file” with smaller real time differential binlog/journal.


-Matt

Salvatore Sanfilippo

unread,
Sep 8, 2014, 5:06:30 AM9/8/14
to Redis DB
Checksumming AOF files could be interesting, but is hard to get a
perfect solution. What it is possible to do is to add a command with
the current checksum, from time to time, but the last part of the file
will not be protected, unless we want to emit a new checksum for every
new command appended which is a very big overhead. Let's call this
system "Checkpointed CRCs", so the file would look like:

command
command
command
... more commands ...
CRC f8d2a83b
... more commands...
CRC 67bb8489
... more commands ...


What's bad with "checkpointed CRCs" is that there is no way to tell if
the file was hardly truncated. It can only protect all the existing
parts of the file minus the last block. It is still better than
nothing probably...

Salvatore

Salvatore Sanfilippo

unread,
Sep 8, 2014, 5:08:39 AM9/8/14
to Redis DB
I just merged this stuff into unstable, 3.0 and 2.8 branch, so
definitely yes, the next 2.8 patch release will include it since for
experience I can tell *most* users want to be able to load truncated
AOFs instead of having the server aborting. After all what 99.99% of
the users will do after a truncated AOF is detected by the server is
to run the aof fix utility accepting whatever the utility will suggest
to fix the file, and restart the server. Same safety more or less, a
lot of downtime. However the setting is now a configuration directive
so users have the ability to pick the behavior they like more.

Salvatore

Jay Rolette

unread,
Sep 8, 2014, 8:44:35 AM9/8/14
to redi...@googlegroups.com
Excellent! Thanks for knocking this out so quickly.

Jay
Reply all
Reply to author
Forward
0 new messages