This is a very interesting discussion. My comments below.
Salvatore Sanfilippo skrev 2012-06-13 11:47:
>
> # 1) Why the RDB is not used as base type in AOF rewrites?
> So it's still one file, with identical semantics, but just the initial
> base type is binary.
>
> There are only two disadvantages with this approach (and many advantages):
>
> 1) You lost the ability to easily process the AOF with a script or alike.
> 2) When RDB and AOF are enabled you lost the advantage of having data
> encoded in two formats by two different code paths (resistance to bugs
> corrupting the DB).
>
Agree. In particular 2 is a potential disadvantage. It's also an 
advantage, though. By using the same code path for AOF rewrites and RDB 
saves, we can focus more on optimizing and testing that single 
implementation. There will be less code that can go wrong.
>
> The only reason this was not done before is other priorities in the
> development, and that this requires very careful design and very
> careful implementation.
>
Indeed. This needs to be done really carefully before we can trust it.
> # 2) About 2.6 AOF.
>
> The good news is that in 2.6 the AOF rewrite uses variadic commands to
> reduce the space used to generate the AOF *a lot*.
> It's not going to be as small as the RDB but to output a 10 element
> list a single RPUSH command will be used instead of 10 RPUSH commands.
>
> Not the final solution as we can improve this with RDB format indeed.
>
This is really good =)
> # 3) Two files, one database.
>
> In this thread it was proposed that we can have RDB dumps *and* the
> AOF file expressing what changed since the RDB file, so not just a
> single file. So you could have:
>
> dump.rdb
> appendonly.aof
>
> That combined together would form the final dataset. I think that
> while this allows some optimization it is a somewhat dangerous path
> practically speaking and for "operations" guys. Either we need to make
> things more complex and reference each file inside each other in some
> way, like taking checksums, or if for an error a server is restarted
> with a mismatching set of files something really bad can happen.
>
Sorry for the long comment here, but I hope it adds something to the 
discussion.
Having separate files is not necessarily more complicated IMHO. With 
this system, the RDB snapshots would work (essentially) the same 
regardless of whether AOF is used or not and as long as you have your 
RDB-file you will always be able to restore a snapshot of your data (so 
this is an excellent candidate for backups). If you always shutdown 
Redis cleanly, then you actually *never* need anything else.
If the AOF files happen to be around, then they will be replayed 
automatically after loading the RDB (the RDB references the AOF, which I 
think can be made robust enough easily). The AOF will only add crash 
recovery on top of the RDB-snapshots, though, which is a very clean 
separation IMO. The RDB snapshots would be the primary persistence 
method and AOF would only do some optional logging in addition to that 
(if enabled). The sysadmin would usually only need to deal with the RDBs 
(and there will be no difference between taking a snapshot by doing a 
BGSAVE or a BGREWRITEAOF, which simplifies the concepts a little). Redis 
can then take care of the AOF more or less automatically, similar to how 
redo logs and WALs are dealt with in other databases. The AOF is 
completely optional and will be much smaller when they are separate 
file(s) that only contain the differences from the last snapshot, so 
they will be less of a "problem". In fact, after a clean shutdown (with 
a save) the AOF could be safely deleted.
When it comes to implementation, I think it's actually very easy to 
implement this using separate AOF files (see my experimental branch in 
another e-mail). Especially with a segmented AOF, it's possible to 
remove a lot of code. Most of the RDB-saving code is reused exactly as 
is and all the AOF rewrite code can be simply deleted. No changes need 
to be buffered in memory or similar either and the new code (mostly for 
loading and creating new segments) is quite short and simple.
One small problem with this approach would definitely be to make sure 
that Redis doesn't try to load a set of mismatching files together. One 
way to add some extra safety here would be to give each AOF a random id 
(stored inside the file) and mention it in each reference (so we can 
check that it's not another AOF with the same filename). Mismatches 
could happen if someone restores a backup of an RDB and keeps some old 
AOFs around, but something like this should catch even that.
> IMHO a better approach is to have just a single file as a first step,
> probably an extension of the current RDB file that can optionally get
> commands in the protocol format (like the AOF itself) logged at the
> end, with a single utility to check if the file is same, with a single
> file extension for all the kind of Redis dumps (.rdb), and so forth.
>
> So the rewrite would work like that:
>
> 1) Start writing the RDB file in background.
> 2) Append changes in memory (or inside a file, see next sections).
> 3) Flush AOF-format changes at the end of the RDB file.
> 4) Rename into the final place.
>
> Of course the command line tools should be able to tell us easily how
> big is the RDB section, how big is the appended part, and so forth.
>
I think it's very useful to have the pure RDB-snapshot as a separate 
file, so that it's easy to make backups of that (without having to 
include the AOF changes). If we store both kinds, then we get some data 
duplication, which means more disk space and disk I/O required. Having 
to write more to the disk could affect performance a lot in some cases, 
I think. It would still be an improvement, though.
> # 4) About AOF and replication link: format is different.
>
> It's important to remember that the AOF format and the replication
> link is not a compatible format. We rewrite certain commands for the
> replication link, so I think that mixing the two is not ok, maybe in
> the future, but given the sensible nature of this changes is better to
> move forward step by step :)
>
Agree. Lets try to unify replication and persistance later, if that 
would be useful.
> # 5) Accumulate AOF differences while rewriting: Memory or Disk?
>
> Currently we accumulate writes during the rewrite using an in-memory
> buffer. This uses memory indeed, but especially if the base format
> will be the RDB, it is generated in a decent amount of time so not too
> much memory is used.
> It is still possible to write the difference into a file that is later
> appended in the RDB file by the parent, but my feeling is that the
> current approach using a single write(2) call is a lot less time
> consuming to perform in the main thread, leading to less latency.
>
> Btw the important thing here is: we can do that in the future, but
> just start trying to make RDB persistence as fast as possible. Maybe
> we'll realize that the in memory buffer is good because for sure it
> takes a memory that is proportional to the memory already used for
> data (since RDB write time is proportional to data size), and this may
> often be a small percentage of the memory currently used.
>
> If we instead find that it's cool to provide an option to use a file
> instead of memory, we can do it later.
I don't think the memory use is the main problem. We need source code to 
manage the buffer (more code is always bad) and the buffered data needs 
to be written twice to disk here (disk I/O is limited and especially 
when writing a lot at once we could saturate the disk for a short 
while). Overall, it's probably not a huge problem, though ;)
>
> # 6) On segmented AOF.
>
> Segmented AOF is a mess compared to a single file from an operational
> point of view, however it offers many advantages like the ones in this
> context, but also the ability to offline compact pieces of the AOF
> file. If the benefits are huge then it's worth it, but at this stage
> it's not going to be a good idea probably, just to save the in-memory
> AOF rewrite buffer.
Agree, it's probably not worth it just to save the memory and 
double-write. IMHO, it also simplifies the implementation and 
operations, though. It does require a shift of view, however. The AOF 
will no longer be used on its own, but will simply be a complement to 
the RDB-snapshots.
I think we can make this kind of segmented AOF work in a very automatic 
and invisible way (so that nobody really needs to care much about it 
being enabled or not, except for performance and durability) and without 
complicating the source code needlessly. Overall, I think this would 
benefit both users and the implementation.
That's just my opinion, though. I like this approach, as you probably 
have noticed by now =)
> So... this is how I see this issue:
>
> 1) Move forward slowly starting with just the replacement of RDB in
> the AOF rewrite.
> 2) Keep the single-file persistence.
> 3) Use only a single file format, the RDB one, that will start always
> with an RDB dump plus an optional AOF section.
> 4) Still make the new Redis version able to read old AOF files, we'll
> no longer have 'appendonly.aof' files in the future but just RDB so
> this will be easy.
> 5) Optimize the RDB generation in order to make it as fast as possible.
Personally, I think it would be easier and better to directly solve this 
with my approach (which may need some elaboration). This is definitely a 
viable alternative, however.
> Later:
>
> 6) Switch the AOF section to a binary format, so that it will be much
> better to both store and transfer it to the replication link.
>
> Conceptually it will be still command arg arg arg, but instead of
> *3\r\n...$2342.... will a binary thing with like 16-bit command
> opcode, number of arguments and lengths as 32 bit numbers, followed by
> data composing the arguments.
>
That's a good idea!
> # User interface
>
> How to expose this to the user? With just a persistence format we want
> to also tell a single persistence story to our users inside
> redis.conf.
>
> I think something like that will work:
>
> 1) We can retain the "save" thing, that will simply force an AOF
> rewrite when the save point is triggered, generating the new RDB file.
> 2) a new option like 'rdb-append-changes yes/no' will be added to
> select if the user just want snapshots or snapshots with logs of
> commands.
> 3) kill the command BGREWRITEAOF that will just be BGSAVE. If
> rdb-append-changes is set to yes it will work as a rewrite.
> 4) Add a new redis.conf option so that on rewrites the RDB file will
> also be copied into dump.rdb.base that is just the base file without
> any appending performed on it, so that people still can have a
> single-file point-in-time stuff that you can copy around.
Sounds reasonable.
>
> Feedbacks? Makes sense?
>
> I think that in this simple form we can have this stuff into 2.8 with
> little efforts and little bugs hopefully.
>
> We can then iterate again to improve it for 3.0.
>
It does make sense. We clearly have two candidate solutions here. I feel 
that the main question here is whether to continue saving all related 
data in a single file or not. I think it might be time for Redis to take 
the step and move to multi-file persistence (where classic RDB snapshots 
and AOF logs cooperate more tightly).
I personally think it would be best to solve this with a "segmented AOF" 
and separate RDB-snapshots that have references to the AOF, but either 
solution would be a step forward. Maybe it would be useful to try out 
these solutions outside the main branch for a while before making a 
final decision? I will probably maintain my git branch for a while and 
I'm interested in seeing how well that can be made to work. Feel free to 
try it out, everyone (although, it's highly experimental right now!). 
It's completely ok if it's never included in the main Redis branch, of 
course.
That's my comments. It would be cool to hear what more people think. Let 
me know if there's anything more I could do here that would be helpful. 
I think improvements related to this are very interesting...
Cheers,
Hampus