Why is AOF used as the "base" file, rather than RDB?

Matthew Palmer

unread,

Jun 9, 2012, 9:01:50 PM6/9/12

to redi...@googlegroups.com

I'm interested in knowing why, when using AOF, the "minimal form" of the
database is still AOF, rather than RDB. It seems to me that a much better
model would be to use RDB to store the point-in-time "minimal set" with an
AOF to just store the changes since the RDB dump was made. I'm interested
in knowing why this is a bad idea, before I sit down and modify Redis to
work this way.

We've recently been reviewing the way we configure Redis internally, as
we're doing a lot more of it, and we decided to improve persistence by using
AOF. It's great and dandy, except it goes through disk space like a crazy
thing. We've been bitten a couple of times by a disk filling up too quick
and Redis crashing due to not being able to write the AOF. That's not what
I'd like to discuss, though -- fixing that problem is just a matter of
provisioning lots and lots of disk, which we can do.

As part of our analysis of the situation, though, the question was asked
"why the hell does Redis use an AOF for the base dump?" (actually, the
language was somewhat more colourful, as we were recovering a large
production database that had just done itself in). It was a question I
couldn't answer, since it does actually make a of sense to use RDB, a very
space-efficient format, rather than AOF, for the base dumps, given that they
are different representations of the same data.

The way I would expect the process to happen would be as follows:

1) When an AOF rewrite is triggered (automatically or manually), a child is
spawned to dump the point-in-time to an RDB file.

2) Immediately, the parent starts writing out the changes to a new AOF,
which avoids the need to cache anything in memory. The name of this file
must be known to the child which is dumping the RDB.

3) At the conclusion of the RDB dump, a final record (a new type) is written
indicating the name of the AOF which contains the delta between the
point-in-time and current reality. (Hence why the child needs to know
the name of the AOF the parent is now appending to).

4) The new RDB is renamed on top of the old snapshot, and the AOF associated
with that previous snapshot is no longer written to and is removed.

On startup, Redis would then always look for an RDB. If it sees a record in
the RDB that says "AOF file is over there ---->" it reads that file for the
incremental changes to bring itself completely up to date. This would
*simplify* the start time logic and reduce the chances of spectacular fail,
because you wouldn't have the current situation where a perfectly valid
dump.rdb would be ignored just because appendonly=true and no appendonly.aof
was present (an unpleasant trap for new players).

The only problem with this mechanism we could come up with is the
possibility that disk I/O rates might be higher with this method, however we
came to the conclusion that it probably wouldn't be a problem in practice,
because the RDB is so much smaller. During rewrite at present, you've got
two things writing to disk:

* The existing AOF to capture changes still going on; and
* The AOF being rewritten.

With this new method, you would have three writes going on:

* The existing AOF, as before;
* The new AOF, which is recording the changes since the dump began; and
* The RDB.

Given the huge size difference between a minimal-changes AOF and an RDB of
the same data (most of an order of magnitude, in our experience), my gut
feeling is that there would actually be a *reduction* in I/O if this
mechanism were used, because rather than writing a 54GiB minimal AOF, you're
writing a 7.2GiB RDB (to take numbers off one of our servers currently in
use).

If disk I/O *were* a concern, though, the existing in-memory change cache
could still be utilised, and the new delta AOF only written once the dump
was complete, but before renaming the RDB and stopping the old AOF write, to
ensure we always have a complete dataset available.

Thus, I have two questions:

1) Was this method of persistence considered during the development of AOF,
and if so, what have I not considered that makes it such a bad idea?

2) If it wasn't considered, and others think it's a decent idea, would a
patch to modify the AOF persistence mechanism to use this method rather
than the current AOF rewrite method be considered for inclusion
(presumably for Redis 2.8)?

Unless there's a glaring hole in my scheme I can't see, I'll almost
certainly implement this method for our internal use; I'm asking #2 to gauge
whether it's worth submitting the change upstream, or just leave it in our
internal fork.

Thanks for Redis, it's a great system.

- Matt

Hampus Wessman

unread,

Jun 10, 2012, 1:19:58 PM6/10/12

to redi...@googlegroups.com

Hello Matthew,

Interesting e-mail. I think this is a very good idea (and I even
suggested something similar in a comment here:
https://github.com/antirez/redis/issues/89).

As far as I know, it simply wasn't done that way to begin with and then
nobody has changed it. The only potential challenge that I see with this
change is to make sure it's easy to upgrade. I think it would be worth
it. Salvatore or Pieter will have to comment on whether a patch might
get included, but +1 from me!

I have a few small comments and ideas below.

When doing an AOF rewrite right now, the total disk I/O consists of 1.
rewriting the AOF 2. writing all concurrent changes to the old AOF and
3. writing all concurrent changes to the new AOF (although it's cached
until after the base rewrite). The total amount of disk I/O would in
other words definitely be less with the new system (which may be the
most important in the end), because 2 and 3 would be the same and 1
would require writing less data. I think we can do even better, though.

Writing all changes that take place during the "rewrite" (maybe we
should call it a "snapshot" or "checkpoint" now) twice is a waste. If we
added support for segmented AOF files, where we could start a new AOF
file whenever we wanted and Redis then would read them all in order when
loading the AOF (so we still don't risk losing any changes), we could
get rid of that too. In that case we could simply start a background RDB
dump and at the same time start a new AOF segment. The new RDB would
point to the new AOF segment, but until the dump was finished we could
just load an old RDB dump from before and start reading at the AOF
segment pointed to by that RDB (and continue reading all the newer AOF
segments). No need to write any AOF data twice then. Old AOF segments
could be deleted once they weren't needed anymore.

With that small modification, we would only write changes to the AOF
once (as always) and write the RDB snapshot (which is fairly compact).
It doesn't get much better than that!

My suggestion is to make this change as you described it, but do it with
AOF segments like above. I hope it's fairly clear how I mean. There may
be other interesting things we could do with AOF segments later on (like
incremental backup/archiving). I think it would be quite easy to
implement too. Maybe I'm missing some problem with this modified
approach, though? What do the rest of you think? I think this would be a
good change exactly as Matthew describes it too.

Cheers,
Hampus

Matthew Palmer

unread,

Jun 10, 2012, 7:43:47 PM6/10/12

to redi...@googlegroups.com

On Sun, Jun 10, 2012 at 07:19:58PM +0200, Hampus Wessman wrote:
> Interesting e-mail. I think this is a very good idea (and I even
> suggested something similar in a comment here:
> https://github.com/antirez/redis/issues/89).

Oh, neat. Glad we're not the only ones to have come up with this idea --
makes me think it's not completely crazy. It's a pity nobody's done this
before, but meh.

> As far as I know, it simply wasn't done that way to begin with and then
> nobody has changed it. The only potential challenge that I see with this
> change is to make sure it's easy to upgrade.

That's not a huge issue, IMO -- we could easily support the fallback
position of "if appendonly=yes and appendonly.aof exists, load that instead
of dump.rdb then immediately trigger a bgrewriteaof to persist in the new
format", and have bgrewriteaof delete appendonly.aof when it's done (if it
exists). It's not backwards compatible, of course, but the 2.2->2.4 upgrade
wasn't either (RDB file format changes... aiee!).

Replication will probably need to change, too, now that I think about that.
I'm not particularly familiar with how it works, but I believe it takes a
client dump and caches subsequence changes in memory to replay to the slave
when the dump's loaded. Presumably we could skip the cache-in-memory part
and just rerun the delta AOF on the client once the RDB is across and
loaded.

> Writing all changes that take place during the "rewrite" (maybe we
> should call it a "snapshot" or "checkpoint" now) twice is a waste. If we
> added support for segmented AOF files, where we could start a new AOF
> file whenever we wanted and Redis then would read them all in order when
> loading the AOF (so we still don't risk losing any changes), we could
> get rid of that too.

Hot *damn* that's a good idea. It'd only require a new command, perhaps
"loadaof", which took a filename (relative to dir) to run. Then the last
command of the AOF we're finished writing to would be "loadaof <newaof>".
Once the new RDB dump is complete, we just delete the previous AOF when we
nuke the dump.rdb, and there's no muss, no fuss. Disk I/O reduction ftw!

> My suggestion is to make this change as you described it, but do it with
> AOF segments like above. I hope it's fairly clear how I mean. There may
> be other interesting things we could do with AOF segments later on (like
> incremental backup/archiving).

That'd be trivially implementable with a command "checkpointaof", which
simply opened a new AOF, wrote "loadaof <newaof>" to the old one, and
continued on it's merry way. That should be trivial.

- Matt

--
If only more employers realized that people join companies, but leave
bosses. A boss should be an insulator, not a conductor or an amplifier.
-- Geoff Kinnel, in the Monastery

Pieter Noordhuis

unread,

Jun 11, 2012, 12:52:14 PM6/11/12

to redi...@googlegroups.com

Hi Matt, Hampus,

This has never been taken care of because there was no pressing need
(or no pressing need that we know/knew of). I see that a 7x bigger AOF
poses a problem both in terms of I/O on save/rewrite/load, as in terms
of capacity. We can definitely do something to improve this.

I think that the idea that is put forward is very good. It allows us
to use a combination of the compact and point-in-time RDB and the more
verbose and incremental AOF. There is another augmentation to the idea
that if implemented can serve even another goal. When a background
save is kicked of, the proposal is to start appending to a new AOF
segment. The RDB can then point to this segment in its trailing bytes
and Redis can continue applying writes where the snapshot left off.
Something very similar happens in replication. When a slave connects
to a master, the master triggers an RDB dump and starts collecting all
writes in memory. When the RDB is saved and transferred to the slave,
the master continues sending the writes it has collected since the RDB
dump was created. Instead of keeping these writes in memory
specifically for this slave, we can just keep a pointer to an offset
in the AOF that is already being written, and stream off of that.
Another thing we have been contemplating is resistance against short
master/slave link outages. The slave should be able to tell the master
it needs all writes since some time T, which can both reduce the load
on the master, and reduce the time until the slave is in sync again.
To allow this on top of the proposed persistence scheme, we can tag
every write with a sequence number. The RDB filename can contain the
sequence number of the last write it included, and the AOF filename
can contain the sequence number where it starts. The two can then be
combined by matching this sequence number. In addition, a slave can
use this sequence number as a pointer to the "last write received"
(provided it also knows the run ID of the master to prevent mismatches
between sequence IDs). On an outage, it can issue a SYNC SINCE <X> to
only resync the piece that is missing instead of triggering a full
RESYNC. Unless writes for sequence number X are no longer stored on
disk, the master only needs to stream all AOF writes since X to this
slave for it to be up to date again.

All in all, this would greatly reduce the complexity of choice between
AOF/RDB because they can complement in this scheme instead of being
totally isolated.

Thanks for bringing this up.

Cheers,
Pieter

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

Matthew Palmer

unread,

Jun 11, 2012, 5:40:55 PM6/11/12

to redi...@googlegroups.com

On Mon, Jun 11, 2012 at 09:52:14AM -0700, Pieter Noordhuis wrote:
> Another thing we have been contemplating is resistance against short
> master/slave link outages. The slave should be able to tell the master
> it needs all writes since some time T, which can both reduce the load
> on the master, and reduce the time until the slave is in sync again.
> To allow this on top of the proposed persistence scheme, we can tag
> every write with a sequence number. The RDB filename can contain the
> sequence number of the last write it included, and the AOF filename
> can contain the sequence number where it starts. The two can then be
> combined by matching this sequence number. In addition, a slave can
> use this sequence number as a pointer to the "last write received"
> (provided it also knows the run ID of the master to prevent mismatches
> between sequence IDs). On an outage, it can issue a SYNC SINCE <X> to
> only resync the piece that is missing instead of triggering a full
> RESYNC. Unless writes for sequence number X are no longer stored on
> disk, the master only needs to stream all AOF writes since X to this
> slave for it to be up to date again.

I'd be worried about having to rescan an entire (large) AOF to find which
command corresponded to the sequence number the client provided. My
recommendation would be to make the "sequence number" actually a combination
of some sort of identifier of the AOF file, and a byte offset in that file.
No tagging required in the file, the slave says "SYNC SINCE 12345-6789", and
the master goes and opens segment12345.aof, seeks to byte offset 6789, and
fires away.

> All in all, this would greatly reduce the complexity of choice between
> AOF/RDB because they can complement in this scheme instead of being
> totally isolated.
>
> Thanks for bringing this up.

So is that an encouragement to submit a hefty patch (or 8 or 9), or have you
folded it into the todo list and it'll be done by the time this e-mail hits
the list? <grin> (I've heard the stories of lightning fast bugfixes...)

- Matt

--
Microsoft: We took the "perfect" out of "Wordperfect"

Hampus Wessman

unread,

Jun 12, 2012, 2:26:26 PM6/12/12

to redi...@googlegroups.com

I agree. This would definitely be useful. We could just keep old AOF
segments around for a while as needed for this (maybe even keep track of
how far behind slaves are and try not to delete those segments).

>> All in all, this would greatly reduce the complexity of choice between
>> AOF/RDB because they can complement in this scheme instead of being
>> totally isolated.
>>
>> Thanks for bringing this up.
>
> So is that an encouragement to submit a hefty patch (or 8 or 9), or have you
> folded it into the todo list and it'll be done by the time this e-mail hits
> the list? <grin> (I've heard the stories of lightning fast bugfixes...)
>

I'm interested in helping out with this, if it would be helpful. I
could write some (or all) code and/or do some code reviewing and
testing. I haven't had time before, but now I have more time again. I
will look into it more quite soon, but feel free to do so too. I'll
probably create an experimental branch on my GitHub later...

Cheers,
Hampus

Hampus Wessman

unread,

Jun 12, 2012, 9:57:02 PM6/12/12

to redi...@googlegroups.com

I created a quick implementation of this over here:
https://github.com/hampus/redis/commits/betteraof

Comments and suggestions are welcome. It actually does all the basic
things that we discussed here already (replication and similar hasn't
been touched, though). I introduced two new commands (one that is only
used in the AOFs and one to create new AOF segments). The code style
needs a little fixing, because it doesn't match the rest of Redis
completely. I will have a more careful look at all this tomorrow.

Cheers,
Hampus

Salvatore Sanfilippo

unread,

Jun 13, 2012, 5:47:30 AM6/13/12

to redi...@googlegroups.com

Hello,

I'm replying to the original message opening this interesting
discussion, but actually it's a reply for all the ideas that were
submitted to this thread. Splitting the reply into sections for easy
parsing.

# 1) Why the RDB is not used as base type in AOF rewrites?

It's a pretty straightforward idea, even if one wants to retain the
idea that a Redis dump, being it RDB or AOF, is a single file, that is
an advantage in terms of operations, it is possible to do something
like that in the AOF file:

RDB:<bytes>\r\n
... RDB dump payload ...
... Normal commands logged ...

So it's still one file, with identical semantics, but just the initial
base type is binary.

There are only two disadvantages with this approach (and many advantages):

1) You lost the ability to easily process the AOF with a script or alike.
2) When RDB and AOF are enabled you lost the advantage of having data
encoded in two formats by two different code paths (resistance to bugs
corrupting the DB).

Compatibility is not an issue in the above case, because old AOFs
would not start with "RDB" so Redis will simply process the file
sequentially as a standard AOF file.

The only reason this was not done before is other priorities in the
development, and that this requires very careful design and very
careful implementation.

# 2) About 2.6 AOF.

The good news is that in 2.6 the AOF rewrite uses variadic commands to
reduce the space used to generate the AOF *a lot*.
It's not going to be as small as the RDB but to output a 10 element
list a single RPUSH command will be used instead of 10 RPUSH commands.

Not the final solution as we can improve this with RDB format indeed.

# 3) Two files, one database.

In this thread it was proposed that we can have RDB dumps *and* the
AOF file expressing what changed since the RDB file, so not just a
single file. So you could have:

dump.rdb
appendonly.aof

That combined together would form the final dataset. I think that
while this allows some optimization it is a somewhat dangerous path
practically speaking and for "operations" guys. Either we need to make
things more complex and reference each file inside each other in some
way, like taking checksums, or if for an error a server is restarted
with a mismatching set of files something really bad can happen.

IMHO a better approach is to have just a single file as a first step,
probably an extension of the current RDB file that can optionally get
commands in the protocol format (like the AOF itself) logged at the
end, with a single utility to check if the file is same, with a single
file extension for all the kind of Redis dumps (.rdb), and so forth.

So the rewrite would work like that:

1) Start writing the RDB file in background.
2) Append changes in memory (or inside a file, see next sections).
3) Flush AOF-format changes at the end of the RDB file.
4) Rename into the final place.

Of course the command line tools should be able to tell us easily how
big is the RDB section, how big is the appended part, and so forth.

# 4) About AOF and replication link: format is different.

It's important to remember that the AOF format and the replication
link is not a compatible format. We rewrite certain commands for the
replication link, so I think that mixing the two is not ok, maybe in
the future, but given the sensible nature of this changes is better to
move forward step by step :)

# 5) Accumulate AOF differences while rewriting: Memory or Disk?

Currently we accumulate writes during the rewrite using an in-memory
buffer. This uses memory indeed, but especially if the base format
will be the RDB, it is generated in a decent amount of time so not too
much memory is used.
It is still possible to write the difference into a file that is later
appended in the RDB file by the parent, but my feeling is that the
current approach using a single write(2) call is a lot less time
consuming to perform in the main thread, leading to less latency.

Btw the important thing here is: we can do that in the future, but
just start trying to make RDB persistence as fast as possible. Maybe
we'll realize that the in memory buffer is good because for sure it
takes a memory that is proportional to the memory already used for
data (since RDB write time is proportional to data size), and this may
often be a small percentage of the memory currently used.

If we instead find that it's cool to provide an option to use a file
instead of memory, we can do it later.

# 6) On segmented AOF.

Segmented AOF is a mess compared to a single file from an operational
point of view, however it offers many advantages like the ones in this
context, but also the ability to offline compact pieces of the AOF
file. If the benefits are huge then it's worth it, but at this stage
it's not going to be a good idea probably, just to save the in-memory
AOF rewrite buffer.

So... this is how I see this issue:

1) Move forward slowly starting with just the replacement of RDB in
the AOF rewrite.
2) Keep the single-file persistence.
3) Use only a single file format, the RDB one, that will start always
with an RDB dump plus an optional AOF section.
4) Still make the new Redis version able to read old AOF files, we'll
no longer have 'appendonly.aof' files in the future but just RDB so
this will be easy.
5) Optimize the RDB generation in order to make it as fast as possible.

Later:

6) Switch the AOF section to a binary format, so that it will be much
better to both store and transfer it to the replication link.

Conceptually it will be still command arg arg arg, but instead of
*3\r\n...$2342.... will a binary thing with like 16-bit command
opcode, number of arguments and lengths as 32 bit numbers, followed by
data composing the arguments.

# User interface

How to expose this to the user? With just a persistence format we want
to also tell a single persistence story to our users inside
redis.conf.

I think something like that will work:

1) We can retain the "save" thing, that will simply force an AOF
rewrite when the save point is triggered, generating the new RDB file.
2) a new option like 'rdb-append-changes yes/no' will be added to
select if the user just want snapshots or snapshots with logs of
commands.
3) kill the command BGREWRITEAOF that will just be BGSAVE. If
rdb-append-changes is set to yes it will work as a rewrite.
4) Add a new redis.conf option so that on rewrites the RDB file will
also be copied into dump.rdb.base that is just the base file without
any appending performed on it, so that people still can have a
single-file point-in-time stuff that you can copy around.

Feedbacks? Makes sense?

I think that in this simple form we can have this stuff into 2.8 with
little efforts and little bugs hopefully.

We can then iterate again to improve it for 3.0.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
http://invece.org

Beauty is more important in computing than anywhere else in technology
because software is so complicated. Beauty is the ultimate defence
against complexity.
— David Gelernter

Pedro Melo

unread,

Jun 13, 2012, 9:48:05 AM6/13/12

to redi...@googlegroups.com

Hi,

On Wed, Jun 13, 2012 at 10:47 AM, Salvatore Sanfilippo
<ant...@gmail.com> wrote:
> # 3) Two files, one database.
>
> In this thread it was proposed that we can have RDB dumps *and* the
> AOF file expressing what changed since the RDB file, so not just a
> single file. So you could have:
>
> dump.rdb
> appendonly.aof
>
> That combined together would form the final dataset. I think that
> while this allows some optimization it is a somewhat dangerous path
> practically speaking and for "operations" guys. Either we need to make
> things more complex and reference each file inside each other in some
> way, like taking checksums, or if for an error a server is restarted
> with a mismatching set of files something really bad can happen.

Actually operations of SQL databases have this concept already. While
you are backing up a SQL DB data files, you usually accumulate
operations in a journal.

Restoring means copying the data files into the proper place and
replaying the journal.

So yes, maybe it is more complex, but it is a complexity that
operations already is fluent with.

Bye,
--
Pedro Melo
@pedromelo
http://www.simplicidade.org/
http://about.me/melo
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org

Salvatore Sanfilippo

unread,

Jun 13, 2012, 9:51:11 AM6/13/12

to redi...@googlegroups.com

On Wed, Jun 13, 2012 at 3:48 PM, Pedro Melo <me...@simplicidade.org> wrote:
> So yes, maybe it is more complex, but it is a complexity that
> operations already is fluent with.

That's true but the real 10x improvement here is using the RDB file
instead of the AOF as the base, so maybe at this stage at least it's
not worth it to split the persistence into multiple physical files,
but just into different logical sections.

Cheers,
Salvatore

Hampus Wessman

unread,

Jun 13, 2012, 3:25:30 PM6/13/12

to redi...@googlegroups.com

This is a very interesting discussion. My comments below.

Salvatore Sanfilippo skrev 2012-06-13 11:47:
>
> # 1) Why the RDB is not used as base type in AOF rewrites?

> So it's still one file, with identical semantics, but just the initial
> base type is binary.
>
> There are only two disadvantages with this approach (and many advantages):
>
> 1) You lost the ability to easily process the AOF with a script or alike.
> 2) When RDB and AOF are enabled you lost the advantage of having data
> encoded in two formats by two different code paths (resistance to bugs
> corrupting the DB).
>

Agree. In particular 2 is a potential disadvantage. It's also an
advantage, though. By using the same code path for AOF rewrites and RDB
saves, we can focus more on optimizing and testing that single
implementation. There will be less code that can go wrong.

>
> The only reason this was not done before is other priorities in the
> development, and that this requires very careful design and very
> careful implementation.
>

Indeed. This needs to be done really carefully before we can trust it.

> # 2) About 2.6 AOF.
>
> The good news is that in 2.6 the AOF rewrite uses variadic commands to
> reduce the space used to generate the AOF *a lot*.
> It's not going to be as small as the RDB but to output a 10 element
> list a single RPUSH command will be used instead of 10 RPUSH commands.
>
> Not the final solution as we can improve this with RDB format indeed.
>

This is really good =)

> # 3) Two files, one database.
>
> In this thread it was proposed that we can have RDB dumps *and* the
> AOF file expressing what changed since the RDB file, so not just a
> single file. So you could have:
>
> dump.rdb
> appendonly.aof
>
> That combined together would form the final dataset. I think that
> while this allows some optimization it is a somewhat dangerous path
> practically speaking and for "operations" guys. Either we need to make
> things more complex and reference each file inside each other in some
> way, like taking checksums, or if for an error a server is restarted
> with a mismatching set of files something really bad can happen.
>

Sorry for the long comment here, but I hope it adds something to the
discussion.

Having separate files is not necessarily more complicated IMHO. With
this system, the RDB snapshots would work (essentially) the same
regardless of whether AOF is used or not and as long as you have your
RDB-file you will always be able to restore a snapshot of your data (so
this is an excellent candidate for backups). If you always shutdown
Redis cleanly, then you actually *never* need anything else.

If the AOF files happen to be around, then they will be replayed
automatically after loading the RDB (the RDB references the AOF, which I
think can be made robust enough easily). The AOF will only add crash
recovery on top of the RDB-snapshots, though, which is a very clean
separation IMO. The RDB snapshots would be the primary persistence
method and AOF would only do some optional logging in addition to that
(if enabled). The sysadmin would usually only need to deal with the RDBs
(and there will be no difference between taking a snapshot by doing a
BGSAVE or a BGREWRITEAOF, which simplifies the concepts a little). Redis
can then take care of the AOF more or less automatically, similar to how
redo logs and WALs are dealt with in other databases. The AOF is
completely optional and will be much smaller when they are separate
file(s) that only contain the differences from the last snapshot, so
they will be less of a "problem". In fact, after a clean shutdown (with
a save) the AOF could be safely deleted.

When it comes to implementation, I think it's actually very easy to
implement this using separate AOF files (see my experimental branch in
another e-mail). Especially with a segmented AOF, it's possible to
remove a lot of code. Most of the RDB-saving code is reused exactly as
is and all the AOF rewrite code can be simply deleted. No changes need
to be buffered in memory or similar either and the new code (mostly for
loading and creating new segments) is quite short and simple.

One small problem with this approach would definitely be to make sure
that Redis doesn't try to load a set of mismatching files together. One
way to add some extra safety here would be to give each AOF a random id
(stored inside the file) and mention it in each reference (so we can
check that it's not another AOF with the same filename). Mismatches
could happen if someone restores a backup of an RDB and keeps some old
AOFs around, but something like this should catch even that.

> IMHO a better approach is to have just a single file as a first step,
> probably an extension of the current RDB file that can optionally get
> commands in the protocol format (like the AOF itself) logged at the
> end, with a single utility to check if the file is same, with a single
> file extension for all the kind of Redis dumps (.rdb), and so forth.
>
> So the rewrite would work like that:
>
> 1) Start writing the RDB file in background.
> 2) Append changes in memory (or inside a file, see next sections).
> 3) Flush AOF-format changes at the end of the RDB file.
> 4) Rename into the final place.
>
> Of course the command line tools should be able to tell us easily how
> big is the RDB section, how big is the appended part, and so forth.
>

I think it's very useful to have the pure RDB-snapshot as a separate
file, so that it's easy to make backups of that (without having to
include the AOF changes). If we store both kinds, then we get some data
duplication, which means more disk space and disk I/O required. Having
to write more to the disk could affect performance a lot in some cases,
I think. It would still be an improvement, though.

> # 4) About AOF and replication link: format is different.
>
> It's important to remember that the AOF format and the replication
> link is not a compatible format. We rewrite certain commands for the
> replication link, so I think that mixing the two is not ok, maybe in
> the future, but given the sensible nature of this changes is better to
> move forward step by step :)
>

Agree. Lets try to unify replication and persistance later, if that
would be useful.

> # 5) Accumulate AOF differences while rewriting: Memory or Disk?
>
> Currently we accumulate writes during the rewrite using an in-memory
> buffer. This uses memory indeed, but especially if the base format
> will be the RDB, it is generated in a decent amount of time so not too
> much memory is used.
> It is still possible to write the difference into a file that is later
> appended in the RDB file by the parent, but my feeling is that the
> current approach using a single write(2) call is a lot less time
> consuming to perform in the main thread, leading to less latency.
>
> Btw the important thing here is: we can do that in the future, but
> just start trying to make RDB persistence as fast as possible. Maybe
> we'll realize that the in memory buffer is good because for sure it
> takes a memory that is proportional to the memory already used for
> data (since RDB write time is proportional to data size), and this may
> often be a small percentage of the memory currently used.
>
> If we instead find that it's cool to provide an option to use a file
> instead of memory, we can do it later.

I don't think the memory use is the main problem. We need source code to
manage the buffer (more code is always bad) and the buffered data needs
to be written twice to disk here (disk I/O is limited and especially
when writing a lot at once we could saturate the disk for a short
while). Overall, it's probably not a huge problem, though ;)

>
> # 6) On segmented AOF.
>
> Segmented AOF is a mess compared to a single file from an operational
> point of view, however it offers many advantages like the ones in this
> context, but also the ability to offline compact pieces of the AOF
> file. If the benefits are huge then it's worth it, but at this stage
> it's not going to be a good idea probably, just to save the in-memory
> AOF rewrite buffer.

Agree, it's probably not worth it just to save the memory and
double-write. IMHO, it also simplifies the implementation and
operations, though. It does require a shift of view, however. The AOF
will no longer be used on its own, but will simply be a complement to
the RDB-snapshots.

I think we can make this kind of segmented AOF work in a very automatic
and invisible way (so that nobody really needs to care much about it
being enabled or not, except for performance and durability) and without
complicating the source code needlessly. Overall, I think this would
benefit both users and the implementation.

That's just my opinion, though. I like this approach, as you probably
have noticed by now =)

> So... this is how I see this issue:
>
> 1) Move forward slowly starting with just the replacement of RDB in
> the AOF rewrite.
> 2) Keep the single-file persistence.
> 3) Use only a single file format, the RDB one, that will start always
> with an RDB dump plus an optional AOF section.
> 4) Still make the new Redis version able to read old AOF files, we'll
> no longer have 'appendonly.aof' files in the future but just RDB so
> this will be easy.
> 5) Optimize the RDB generation in order to make it as fast as possible.

Personally, I think it would be easier and better to directly solve this
with my approach (which may need some elaboration). This is definitely a
viable alternative, however.

> Later:
>
> 6) Switch the AOF section to a binary format, so that it will be much
> better to both store and transfer it to the replication link.
>
> Conceptually it will be still command arg arg arg, but instead of
> *3\r\n...$2342.... will a binary thing with like 16-bit command
> opcode, number of arguments and lengths as 32 bit numbers, followed by
> data composing the arguments.
>

That's a good idea!

> # User interface
>
> How to expose this to the user? With just a persistence format we want
> to also tell a single persistence story to our users inside
> redis.conf.
>
> I think something like that will work:
>
> 1) We can retain the "save" thing, that will simply force an AOF
> rewrite when the save point is triggered, generating the new RDB file.
> 2) a new option like 'rdb-append-changes yes/no' will be added to
> select if the user just want snapshots or snapshots with logs of
> commands.
> 3) kill the command BGREWRITEAOF that will just be BGSAVE. If
> rdb-append-changes is set to yes it will work as a rewrite.
> 4) Add a new redis.conf option so that on rewrites the RDB file will
> also be copied into dump.rdb.base that is just the base file without
> any appending performed on it, so that people still can have a
> single-file point-in-time stuff that you can copy around.

Sounds reasonable.

>
> Feedbacks? Makes sense?
>
> I think that in this simple form we can have this stuff into 2.8 with
> little efforts and little bugs hopefully.
>
> We can then iterate again to improve it for 3.0.
>

It does make sense. We clearly have two candidate solutions here. I feel
that the main question here is whether to continue saving all related
data in a single file or not. I think it might be time for Redis to take
the step and move to multi-file persistence (where classic RDB snapshots
and AOF logs cooperate more tightly).

I personally think it would be best to solve this with a "segmented AOF"
and separate RDB-snapshots that have references to the AOF, but either
solution would be a step forward. Maybe it would be useful to try out
these solutions outside the main branch for a while before making a
final decision? I will probably maintain my git branch for a while and
I'm interested in seeing how well that can be made to work. Feel free to
try it out, everyone (although, it's highly experimental right now!).
It's completely ok if it's never included in the main Redis branch, of
course.

That's my comments. It would be cool to hear what more people think. Let
me know if there's anything more I could do here that would be helpful.
I think improvements related to this are very interesting...

Cheers,
Hampus

Reply all

Reply to author

Forward