one of the problems we had with AOF rewrite required closing a file
into a different thread to avoid blocking on the main thread.
This fixes one instance of a wider class of bugs related to latency
spikes related to slow I/O syscalls, something you need to take into
account when writing a single-threaded server doing disk I/O.
The introduction of this abstraction that you can find in its most
updated implementation in the file "bio.c" in the "bg-aof" github
branch, made me wondering if it was the right time to also improve the
latency of the 'every second' fsync policy for the AOF.
As you may know there are currently three different AOF fsync policies.
One is "always" that means maximum guarantees: the command is written
in the AOF before the client receives the OK status. There are no
rooms for much improvement here, we already cluster the writes into a
single one performed before returning into the event loop after
executing all the pending commands. This way we are sure that no
client is going to get a reply before the AOF is written, but at the
same time we avoid an fsync() per every single command if there are
many parallel clients.
The other extreme case is to never fsync. This by default in Linux
means that the kernel will flus the buffers on disk every 30 seconds.
This is a good pick for many users, but may other users can't afford
to lose 30 seconds of AOF in case of an OS crash or blackout and so
forth.
For fsync none there is nothing to optimize, and it works pretty well already.
The option in the middle between the above two is to fsync every
second. This is like a sensible choice as one second is a short period
compared to 30 seconds, and performances of this fsync mode are good
enough for most users. However when there is some disk I/O spike (that
can be caused by Redis itself, doing an .rdb BGSAVE, or by other
instances of Redis running in the same host) the fsync may start to be
pretty slow, and Redis may become unresponsive.
What can be done? The obvious idea is to move the fsync() in a different thread.
Unfortunately this is not a good idea as well, write(2) will block
anyway if an fsync is in progress against the same file.
The only alternative is to move both the write+fsync in a different thread.
Thanks to our new bio.c abstraction moving tasks on a different thread
is now pretty trivial after all, however there is to deeply understand
the implications of moving the write+fsync stage on a different
thread.
One obvious problem is that if the disk is slow, the background thread
may just not be able to cope with the speed we send it write + fsync
jobs.
The queue of jobs may become longer and longer. This is however an
extreme case, there is also a case where the fsync does not take more
than 100 or 200 ms every second, so after all we are just avoiding to
be unresponsive in the main thread, and we are still honoring the
contract with our user that wants Redis to fsync every second.
However if we want to guarantee the fsync every second we must be sure
to completely block our main thread if we start feeling that the
background thread is not working fast enough.
Another approach when moving the write+fsync in another thread is to
just tell Redis: please fsync as much as you can in a best effort way.
This maximizes the durability that our hardware can provide, however
after reflecting about this option I think I don't like it too much as
unfortunately when planning how to configure a database you need to
make some tradeoff, but this tradeoff can't be flexible. Some
guarantee is needed...
It is still possible to say: try to fsync every second, if you can't
fsync for 5 seconds then block the main thread. So you have a soft and
an hard limit.
This is probably a bit too much of a tweak for our users... and after
all if we wait 5 seconds to block the main thread, the reality is, our
design must support a 5 second missing AOF. So what is the point with
that?
For all this reasons... everything considered, my idea is as follows:
we take everything as it is from the point of view of our users, there
are the old three policies.
The only difference is that we create two new background job types,
REDIS_BIO_WRITE_AOF, and REDIS_BIO_FSYNC_AOF, that are used instead of
the usual write and fsync when the policy is "one-second".
However since now we have ways to query bio.c to check how many
pending jobs there are for every kind of job type, we block if when we
are issuing our next fsync there is still one that was not processed,
until it is finished.
This way we provide the same hard guarantee of fsync at every second,
but in all the cases where this fsync can be performed in less than a
second we avoid to block the server. My feeling is that all the users
that can relax this more should just configure the AOF to never fsync,
since the 30 seconds limit is reasonable and there is no data loss
even in the case of a Redis crash (the OS buffer will flush everything
on the file regardless the process is still alive or not), problems
are only possible if there is a power failure or an OS crash.
Comments? :)
Salvatore
--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele
However if we want to guarantee the fsync every second we must be sure
to completely block our main thread if we start feeling that the
background thread is not working fast enough.
we take everything as it is from the point of view of our users, there
are the old three policies.
since the 30 seconds limit is reasonable and there is no data loss
even in the case of a Redis crash (the OS buffer will flush everything
on the file regardless the process is still alive or not), problems
are only possible if there is a power failure or an OS crash.
The main thing I'm missing is some way to have Redis block only writes
when the disk can't keep up. In theory we could just as well continue
serving read requests from memory at full speed. That would be really
nice, but it's probably best to wait a bit with that regardless. It's
not necessarily linked to this. I'm skipping all the details here, but I
think you get the idea (basically you would block only clients that are
waiting on disk I/O). What do you think? Perhaps something for the
(distant) future?
BR,
Hampus
Hello Hampus,
now I've a reference implementation of the idea I exposed, in the
aof-bg branch. It is the simplest implementation possible and
currently there is some bit missing, probably I need to wait for
thread jobs also in the CONFIG SET case, but the essence is that.
About letting read clients to work when the disk can't cope, it is
probably not possible as even GETs will result in DEL written in the
AOF in case of an expire. But in general in real applications often
the same connection will do a number of write and read operations in
order to accomplish work, so if writes are unavailable likely this is
going to be a problem almost as big as both write and reads being not
available.
Btw at the current stage I'm not even sure I want to merge my aof-bg
branch in the future... the reason is that I still think it would be
so much better to avoid this complexity. I'm thinking about
alternatives, and I think I'm starting to have a few ideas...
The first idea is the following. We can just perform the fsync(2) in
the background thread.
But we know write(2) would block if the fsync is in progress. However
using the bio.c API we can query to check if an AOF-fsync op is
currently in progress, if it is then we can just append to the buffer
instead of writing, up to two seconds of delay (the bio.c API allows
to query for the creation time of the pending job).
If two seconds elapsed we block.
This way we have a lot less complexity in the code compared to the
current patch, a lot less jobs to process in the background thread
(only the fsync).
I'll probably try this approach today.
Thanks for your feedbacks!
And logged, or at least stat'ed in INFO.
Bye,
--
Pedro Melo
@pedromelo
http://www.simplicidade.org/
http://about.me/melo
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org
However a warning is a good idea indeed.
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
On 09/15/2011 04:46 PM, Salvatore Sanfilippo wrote:
> On Thu, Sep 15, 2011 at 4:36 PM, Hampus Wessman
> <hampus....@gmail.com> wrote:
>> The main thing I'm missing is some way to have Redis block only writes
>> when the disk can't keep up. In theory we could just as well continue
>> serving read requests from memory at full speed. That would be really
>> nice, but it's probably best to wait a bit with that regardless. It's
>> not necessarily linked to this. I'm skipping all the details here, but I
>> think you get the idea (basically you would block only clients that are
>> waiting on disk I/O). What do you think? Perhaps something for the
>> (distant) future?
> Hello Hampus,
>
> now I've a reference implementation of the idea I exposed, in the
> aof-bg branch. It is the simplest implementation possible and
> currently there is some bit missing, probably I need to wait for
> thread jobs also in the CONFIG SET case, but the essence is that.
>
> About letting read clients to work when the disk can't cope, it is
> probably not possible as even GETs will result in DEL written in the
> AOF in case of an expire. But in general in real applications often
> the same connection will do a number of write and read operations in
> order to accomplish work, so if writes are unavailable likely this is
> going to be a problem almost as big as both write and reads being not
> available.
I think it would be possible, as all requests could carry on until they
have to write to the AOF and then they would get blocked until that
finishes (even some reads, but not most). It quickly gets complicated,
though.
I agree that the difference won't be large in general. It would be nice
(at least in theory), but it most likely costs more in terms of code
complexity than it's worth... Thanks for replying!
> Btw at the current stage I'm not even sure I want to merge my aof-bg
> branch in the future... the reason is that I still think it would be
> so much better to avoid this complexity. I'm thinking about
> alternatives, and I think I'm starting to have a few ideas...
>
> The first idea is the following. We can just perform the fsync(2) in
> the background thread.
> But we know write(2) would block if the fsync is in progress. However
> using the bio.c API we can query to check if an AOF-fsync op is
> currently in progress, if it is then we can just append to the buffer
> instead of writing, up to two seconds of delay (the bio.c API allows
> to query for the creation time of the pending job).
>
> If two seconds elapsed we block.
>
> This way we have a lot less complexity in the code compared to the
> current patch, a lot less jobs to process in the background thread
> (only the fsync).
>
> I'll probably try this approach today.
That sounds like a great idea! It would be really nice if that turns out
to work well. With such a simple solution to this it would be even less
worth it to add my above suggestion too. I think this would ge good enough.
>
> Thanks for your feedbacks!
> Salvatore
Thanks for working on this!
Regards,
Hampus
Perhaps it would be useful with some general documentation about Redis
and performance. It could explain some of the internal details that may
be useful to know for everyone. Some potential problems and best
practices could be included too. There are a lot of questions about
those things. Not sure I'll have time to write something like that, but
perhaps... Would it be useful? Is there something like that already,
that I have missed?
Hampus
In the chance you'll find some time to do this, be sure that we'll
publish it in the Redis.io site in bold ;)
Salvatore
--
Thanks for the feedback!
That could turn to be simple enough to enter 2.4, that is a very good
point as this problem is afflicting many of our users: it is not a
show stopper but from time to time you see this latency spikes that
are not cool.
I'll have a test branch soon.
Cheers,
no-appendfsync-on-rewrite yes/no
It's a redis.conf flag.
However this is not enough even if it helps, as in many setups disk
load is created by external entities such as other Redis instances.
Thanks for the good (even if already implemented) suggestion!
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
--
Finally we know how it is working :)
The bg-aof-2 branch implements this second idea of just moving the
fsync to a different thread and defer the write up to two seconds if
there is a bg fsync still in progress.
I tested this branch in the following setup:
- The box was my real Linux box kindly provided by VMware: a real
Linux system with a non SSD disk. Ext4 file system.
- Redis was loaded with 5 million keys.
- A script was running in the background to continuously write against
the server.
- redis-cli --latency was running to meter latency. Also an strace was
metering the duration of fsync calls, with "strace -f -p $(pidof
redis-server) -T -e trace=fdatasync"
- Redis was configured to use AOF, with fsync policy "everysec".
Without the bg-aof-2 patch the max latency was from 800 ms to 1.5 seconds.
With the patch the max latency I obtained was of 200 ms, but often the
operation was performed with a 40 ms latency.
I'll be working more to tune this implementation and to make sure it
is error free, but it is reasonably simple and short so I guess can be
audited efficiently enough to enter 2.4, otherwise we'll carry on this
AOF fsync latency issue for another year in production environments.
Note that with the current implementation
a race it possible where we fsync() a closed filedes, but this was by
design as it is harmless, and much better than blocking waiting for
the bg thread to finish its work.
I'll look for other possible races monday.
Cheers,
> Since writing is still done in the main thread in all cases, it should
> be made clear to the users (in the doc) that write can block for plenty
> of reasons (and not only because a fsync is on-going on the same file).
> For instance a system wide sync, or simply pressure on the filesystem
> cache, will likely block write as well ...
Good idea Didier, I just started this document:
https://github.com/antirez/redis-doc/blob/master/topics/latency.md
I think you have an unmatched skill set about real world kernel
behavior and semantics here, any help on this document is very
welcomed!
> By the way, is there any plan to relieve the pressure on the filesystem
> cache by calling posix_fadvise in the bio thread just after the fsync? I'm
It seems a good idea to me, but I would hope that a file continuously
accessed by writes in append only mode is enough evidence for the
kernel to expect such a pattern. But well an hint is not bad :)
> not so sure it is a good idea to evict the AOF from the cache, because it
> will have to be read again at rewrite time. However, I think there would
> be a benefit to do it for the rdb fsync.
fortunately the rdb fsync is performed in the child process so at
least in this case we don't have to block the main thread.
With rdb persistence we are in the fortunate condition of never
writing nor fsyncing in the main thread... in other words if it would
not be for fork(2) Redis would be a pretty real-time system using a
subset of fast commands (no SORT, intersections, ...).
Thanks!
Salvatore
> Best regards,
> Didier.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/redis-db/-/PSeV6Zr5MkIJ.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
>
--
--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
I have looked through all the code and tested it on my system too now. I
think it looks good and it seems to work very well!
I'm also running Linux on a physical computer with ext4 on a traditional
HDD. I ran a python script in the background that was writing heavily to
the disk, while also put some load on Redis. The whole system actually
got a bit slow and unresponsive because of all this... Redis still
continued to work well, though! As long as fsyncs finish within two
seconds it works great and you need to put a lot of load on your disks
to get it slower than that (managed to do that a few times). Blocking is
the logical thing to do in those cases, so that's just good.
I also ran it with valgrind (primarily drd and helgrind) a few times and
couldn't find any problems with the threading.
The only thing I see as a potential problem (but it's nothing new and
it's quite hard to get this perfectly right - even harder to test it) is
that some filesystems reorder writes very aggressively (e.g. ext4, even
though it has some hacks to avoid some problems people were having). It
is possible that a filesystem would write the metadata for a rename to
disk before it writes all pending data for the renamed file to disk. I
believe that we, in theory, could end up losing some data (the changes
made during a rewrite) when renaming the new AOF, because of that. There
were some heated discussions in many places about these things when ext4
was new.
Not much changed about this, as said. It's like that in unstable too and
it's unlikely to cause any serious problems. It could perhaps be good to
investigate this more in the future. Just pointing it out, because it's
somewhat related to this change. Actually, I think I have a nice and
fairly simple idea related to AOF persistence that seems to fix this as
a bonus and would work well together with this change. I'll probably
post something about that some other day...
With some changes I think bioWaitPendingJobsLE could be implemented more
efficiently by using the condition variable here too. That function is
not yet used anywhere (right?), so it doesn't really matter though.
It would be nice to log when the disk can't keep up (as someone else
also requested). Perhaps something like:
--- a/src/aof.c
+++ b/src/aof.c
@@ -101,6 +101,7 @@ void flushAppendOnlyFile(int force) {
}
/* Otherwise fall trough, and go write since we can't wait
* over two seconds. */
+ redisLog(REDIS_NOTICE, "Asynchronous disk operations
falling too far behind. Writing from main thread instead.");
}
}
/* If you are following this code path, then we are going to write so
I also think this should be safe enough for 2.4. I haven't found any
real problems in the current implementation.
That's all from me :) Hope you don't mind the long reply. I like the
solution. Very redis-style.
Cheers,
Hampus
Great! Thanks
> I'm also running Linux on a physical computer with ext4 on a traditional
> HDD. I ran a python script in the background that was writing heavily to
> the disk, while also put some load on Redis. The whole system actually
> got a bit slow and unresponsive because of all this... Redis still
> continued to work well, though! As long as fsyncs finish within two
> seconds it works great and you need to put a lot of load on your disks
> to get it slower than that (managed to do that a few times). Blocking is
> the logical thing to do in those cases, so that's just good.
Also thanks for this alternative testing, it's cool to have a few data
points, for some reason performances of ext4 + fsync tends to vary a
lot in different Linux configuration with different hardware.
> I also ran it with valgrind (primarily drd and helgrind) a few times and
> couldn't find any problems with the threading.
Great!
> The only thing I see as a potential problem (but it's nothing new and
> it's quite hard to get this perfectly right - even harder to test it) is
> that some filesystems reorder writes very aggressively (e.g. ext4, even
> though it has some hacks to avoid some problems people were having). It
> is possible that a filesystem would write the metadata for a rename to
> disk before it writes all pending data for the renamed file to disk. I
> believe that we, in theory, could end up losing some data (the changes
> made during a rewrite) when renaming the new AOF, because of that. There
> were some heated discussions in many places about these things when ext4
> was new.
Interesting, so in short we can't consider renaming a new just
generated file against an old one completely safe. But is this also
true in the case we fsync() the new just created file before calling
rename? We do that in rewriteAppendOnlyFile().
> With some changes I think bioWaitPendingJobsLE could be implemented more
> efficiently by using the condition variable here too. That function is
> not yet used anywhere (right?), so it doesn't really matter though.
Absolutely, we are actually not using this function anymore...
removing it for now.
> It would be nice to log when the disk can't keep up (as someone else
> also requested). Perhaps something like:
Done! Thanks for everything :)
As far as I know, to first fsync() all the data and then rename() the
file is the standard solution for this, so that should be safe (and
doing it the other way around isn't completely so).
It's probably safe to do it without an fsync on ext4 by now, because
they detect common patterns like that. There may be other filesystems
not doing that, on the other hand.
This is probably far more than you want to know, but here are some more
related thoughts:
The rename() isn't synced to disk immediately either (that's why it
doesn't block) so it may not be safe to assume that the rename is
durable, without calling fsync() on the directory. See e.g.
http://stackoverflow.com/questions/3764822/how-to-durably-rename-a-file-in-posix
and
http://postgresql.1045698.n5.nabble.com/fsync-reliability-td4330289.html
(and the POSIX standard). I'm a little unsure about exactly how this
works in practice, though. Most applications seem to ignore it and
perhaps that's the best thing to do... It could perhaps, in very rare
cases, lead to both the new and old AOF existing after a crash and the
new having newer data in it (but not being loaded).
Then there's the problem with many disks that caches data internally, so
that an fsync() isn't sufficient for crash safety. On many systems it's
best to manually disable those caches (e.g. Linux). In Mac OS X I think
you should rather use a special command in the application instead. See
http://www.postgresql.org/docs/9.0/static/wal-reliability.html and
http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fsync.2.html
(in particular about fcntl and F_FULLSYNC). Perhaps we should document
this too (like PostgreSQL) and possibly change what we do on OSX (at
some point)? PostgreSQL seems to use fcntl with F_FULLSYNC on OSX.
Anyone who knows anything more about these things? It's quite
interesting and could be useful. It's perfectly possible that I'm wrong
about something here, but this is how I think it works anyway.
>> It would be nice to log when the disk can't keep up (as someone else
>> also requested). Perhaps something like:
> Done! Thanks for everything :)
>
> Salvatore
No problem :)
Cheers,
Hampus