AOF write+fsync in a different thread

637 views
Skip to first unread message

Salvatore Sanfilippo

unread,
Sep 15, 2011, 7:06:29 AM9/15/11
to Redis DB
Hi all!

one of the problems we had with AOF rewrite required closing a file
into a different thread to avoid blocking on the main thread.
This fixes one instance of a wider class of bugs related to latency
spikes related to slow I/O syscalls, something you need to take into
account when writing a single-threaded server doing disk I/O.

The introduction of this abstraction that you can find in its most
updated implementation in the file "bio.c" in the "bg-aof" github
branch, made me wondering if it was the right time to also improve the
latency of the 'every second' fsync policy for the AOF.

As you may know there are currently three different AOF fsync policies.

One is "always" that means maximum guarantees: the command is written
in the AOF before the client receives the OK status. There are no
rooms for much improvement here, we already cluster the writes into a
single one performed before returning into the event loop after
executing all the pending commands. This way we are sure that no
client is going to get a reply before the AOF is written, but at the
same time we avoid an fsync() per every single command if there are
many parallel clients.

The other extreme case is to never fsync. This by default in Linux
means that the kernel will flus the buffers on disk every 30 seconds.
This is a good pick for many users, but may other users can't afford
to lose 30 seconds of AOF in case of an OS crash or blackout and so
forth.
For fsync none there is nothing to optimize, and it works pretty well already.

The option in the middle between the above two is to fsync every
second. This is like a sensible choice as one second is a short period
compared to 30 seconds, and performances of this fsync mode are good
enough for most users. However when there is some disk I/O spike (that
can be caused by Redis itself, doing an .rdb BGSAVE, or by other
instances of Redis running in the same host) the fsync may start to be
pretty slow, and Redis may become unresponsive.

What can be done? The obvious idea is to move the fsync() in a different thread.
Unfortunately this is not a good idea as well, write(2) will block
anyway if an fsync is in progress against the same file.
The only alternative is to move both the write+fsync in a different thread.

Thanks to our new bio.c abstraction moving tasks on a different thread
is now pretty trivial after all, however there is to deeply understand
the implications of moving the write+fsync stage on a different
thread.

One obvious problem is that if the disk is slow, the background thread
may just not be able to cope with the speed we send it write + fsync
jobs.
The queue of jobs may become longer and longer. This is however an
extreme case, there is also a case where the fsync does not take more
than 100 or 200 ms every second, so after all we are just avoiding to
be unresponsive in the main thread, and we are still honoring the
contract with our user that wants Redis to fsync every second.

However if we want to guarantee the fsync every second we must be sure
to completely block our main thread if we start feeling that the
background thread is not working fast enough.

Another approach when moving the write+fsync in another thread is to
just tell Redis: please fsync as much as you can in a best effort way.
This maximizes the durability that our hardware can provide, however
after reflecting about this option I think I don't like it too much as
unfortunately when planning how to configure a database you need to
make some tradeoff, but this tradeoff can't be flexible. Some
guarantee is needed...
It is still possible to say: try to fsync every second, if you can't
fsync for 5 seconds then block the main thread. So you have a soft and
an hard limit.

This is probably a bit too much of a tweak for our users... and after
all if we wait 5 seconds to block the main thread, the reality is, our
design must support a 5 second missing AOF. So what is the point with
that?

For all this reasons... everything considered, my idea is as follows:

we take everything as it is from the point of view of our users, there
are the old three policies.
The only difference is that we create two new background job types,
REDIS_BIO_WRITE_AOF, and REDIS_BIO_FSYNC_AOF, that are used instead of
the usual write and fsync when the policy is "one-second".

However since now we have ways to query bio.c to check how many
pending jobs there are for every kind of job type, we block if when we
are issuing our next fsync there is still one that was not processed,
until it is finished.

This way we provide the same hard guarantee of fsync at every second,
but in all the cases where this fsync can be performed in less than a
second we avoid to block the server. My feeling is that all the users
that can relax this more should just configure the AOF to never fsync,
since the 30 seconds limit is reasonable and there is no data loss
even in the case of a Redis crash (the OS buffer will flush everything
on the file regardless the process is still alive or not), problems
are only possible if there is a power failure or an OS crash.

Comments? :)

Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Gabriel Welsche

unread,
Sep 15, 2011, 7:34:58 AM9/15/11
to redi...@googlegroups.com
Hi,

However if we want to guarantee the fsync every second we must be sure
to completely block our main thread if we start feeling that the
background thread is not working fast enough.

THIS have to be documented. 

we take everything as it is from the point of view of our users, there
are the old three policies.
I also like this "simple" approach. 

since the 30 seconds limit is reasonable and there is no data loss
even in the case of a Redis crash (the OS buffer will flush everything
on the file regardless the process is still alive or not), problems
are only possible if there is  a power failure or an OS crash.
This should also be documented (http://redis.io/topics/persistence),  because many user don't know!

I really like the way you develop redis. For the future: Keep it simple - Thanks!

Gabriel

Hampus Wessman

unread,
Sep 15, 2011, 10:36:46 AM9/15/11
to redi...@googlegroups.com
I think this is a good and simple solution to an important problem! It
will always be possible to add more complexity later if the need arises :)

The main thing I'm missing is some way to have Redis block only writes
when the disk can't keep up. In theory we could just as well continue
serving read requests from memory at full speed. That would be really
nice, but it's probably best to wait a bit with that regardless. It's
not necessarily linked to this. I'm skipping all the details here, but I
think you get the idea (basically you would block only clients that are
waiting on disk I/O). What do you think? Perhaps something for the
(distant) future?

BR,
Hampus

Salvatore Sanfilippo

unread,
Sep 15, 2011, 10:46:37 AM9/15/11
to redi...@googlegroups.com
On Thu, Sep 15, 2011 at 4:36 PM, Hampus Wessman
<hampus....@gmail.com> wrote:
> The main thing I'm missing is some way to have Redis block only writes
> when the disk can't keep up. In theory we could just as well continue
> serving read requests from memory at full speed. That would be really
> nice, but it's probably best to wait a bit with that regardless. It's
> not necessarily linked to this. I'm skipping all the details here, but I
> think you get the idea (basically you would block only clients that are
> waiting on disk I/O). What do you think? Perhaps something for the
> (distant) future?

Hello Hampus,

now I've a reference implementation of the idea I exposed, in the
aof-bg branch. It is the simplest implementation possible and
currently there is some bit missing, probably I need to wait for
thread jobs also in the CONFIG SET case, but the essence is that.

About letting read clients to work when the disk can't cope, it is
probably not possible as even GETs will result in DEL written in the
AOF in case of an expire. But in general in real applications often
the same connection will do a number of write and read operations in
order to accomplish work, so if writes are unavailable likely this is
going to be a problem almost as big as both write and reads being not
available.

Btw at the current stage I'm not even sure I want to merge my aof-bg
branch in the future... the reason is that I still think it would be
so much better to avoid this complexity. I'm thinking about
alternatives, and I think I'm starting to have a few ideas...

The first idea is the following. We can just perform the fsync(2) in
the background thread.
But we know write(2) would block if the fsync is in progress. However
using the bio.c API we can query to check if an AOF-fsync op is
currently in progress, if it is then we can just append to the buffer
instead of writing, up to two seconds of delay (the bio.c API allows
to query for the creation time of the pending job).

If two seconds elapsed we block.

This way we have a lot less complexity in the code compared to the
current patch, a lot less jobs to process in the background thread
(only the fsync).

I'll probably try this approach today.

Thanks for your feedbacks!

Pedro Melo

unread,
Sep 15, 2011, 11:15:05 AM9/15/11
to redi...@googlegroups.com
On Thu, Sep 15, 2011 at 12:34 PM, Gabriel Welsche
<gabriel...@googlemail.com> wrote:
> Hi,
>>
>> However if we want to guarantee the fsync every second we must be sure
>> to completely block our main thread if we start feeling that the
>> background thread is not working fast enough.
>>
> THIS have to be documented.

And logged, or at least stat'ed in INFO.

Bye,
--
Pedro Melo
@pedromelo
http://www.simplicidade.org/
http://about.me/melo
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org

Salvatore Sanfilippo

unread,
Sep 15, 2011, 11:17:37 AM9/15/11
to redi...@googlegroups.com
Good idea for logging. For documenting, yes makes sense, but this is
what happens already in Redis today, and in every other app trying to
write to the disk faster than the I/O of the disk is able to accept.

However a warning is a good idea indeed.

Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Hampus Wessman

unread,
Sep 15, 2011, 11:22:00 AM9/15/11
to redi...@googlegroups.com
Hi Salvatore,

On 09/15/2011 04:46 PM, Salvatore Sanfilippo wrote:
> On Thu, Sep 15, 2011 at 4:36 PM, Hampus Wessman
> <hampus....@gmail.com> wrote:
>> The main thing I'm missing is some way to have Redis block only writes
>> when the disk can't keep up. In theory we could just as well continue
>> serving read requests from memory at full speed. That would be really
>> nice, but it's probably best to wait a bit with that regardless. It's
>> not necessarily linked to this. I'm skipping all the details here, but I
>> think you get the idea (basically you would block only clients that are
>> waiting on disk I/O). What do you think? Perhaps something for the
>> (distant) future?
> Hello Hampus,
>
> now I've a reference implementation of the idea I exposed, in the
> aof-bg branch. It is the simplest implementation possible and
> currently there is some bit missing, probably I need to wait for
> thread jobs also in the CONFIG SET case, but the essence is that.
>
> About letting read clients to work when the disk can't cope, it is
> probably not possible as even GETs will result in DEL written in the
> AOF in case of an expire. But in general in real applications often
> the same connection will do a number of write and read operations in
> order to accomplish work, so if writes are unavailable likely this is
> going to be a problem almost as big as both write and reads being not
> available.

I think it would be possible, as all requests could carry on until they
have to write to the AOF and then they would get blocked until that
finishes (even some reads, but not most). It quickly gets complicated,
though.

I agree that the difference won't be large in general. It would be nice
(at least in theory), but it most likely costs more in terms of code
complexity than it's worth... Thanks for replying!

> Btw at the current stage I'm not even sure I want to merge my aof-bg
> branch in the future... the reason is that I still think it would be
> so much better to avoid this complexity. I'm thinking about
> alternatives, and I think I'm starting to have a few ideas...
>
> The first idea is the following. We can just perform the fsync(2) in
> the background thread.
> But we know write(2) would block if the fsync is in progress. However
> using the bio.c API we can query to check if an AOF-fsync op is
> currently in progress, if it is then we can just append to the buffer
> instead of writing, up to two seconds of delay (the bio.c API allows
> to query for the creation time of the pending job).
>
> If two seconds elapsed we block.
>
> This way we have a lot less complexity in the code compared to the
> current patch, a lot less jobs to process in the background thread
> (only the fsync).
>
> I'll probably try this approach today.

That sounds like a great idea! It would be really nice if that turns out
to work well. With such a simple solution to this it would be even less
worth it to add my above suggestion too. I think this would ge good enough.

>
> Thanks for your feedbacks!
> Salvatore

Thanks for working on this!

Regards,
Hampus

Hampus Wessman

unread,
Sep 15, 2011, 11:46:23 AM9/15/11
to redi...@googlegroups.com
(Slightly off topic)

Perhaps it would be useful with some general documentation about Redis
and performance. It could explain some of the internal details that may
be useful to know for everyone. Some potential problems and best
practices could be included too. There are a lot of questions about
those things. Not sure I'll have time to write something like that, but
perhaps... Would it be useful? Is there something like that already,
that I have missed?

Hampus

Salvatore Sanfilippo

unread,
Sep 15, 2011, 12:04:09 PM9/15/11
to redi...@googlegroups.com
Hey Hampus, that would be hugely useful! This is a weak spot in our
doc for sure.

In the chance you'll find some time to do this, be sure that we'll
publish it in the Redis.io site in bold ;)

Salvatore

--

Salvatore Sanfilippo

unread,
Sep 15, 2011, 12:06:41 PM9/15/11
to redi...@googlegroups.com

Thanks for the feedback!

That could turn to be simple enough to enter 2.4, that is a very good
point as this problem is afflicting many of our users: it is not a
show stopper but from time to time you see this latency spikes that
are not cool.

I'll have a test branch soon.

Cheers,

CT Radu

unread,
Sep 16, 2011, 5:01:55 AM9/16/11
to Redis DB
Hello Salvatore,

What about the following idea:
- save in bg to aof as often as needed (1sec, 3sec, etc)
- when a big .rdb BGSAVE is done (in some setups once every 30
seconds, for example)
do the BGSAVE and queue the aof save to be done afterwards

The idea was to have a flag, or something similar in order to avoid 2
disk operations at the same time that at the OS level get queued
anyway.
The problem in the proposed solution might be that some of the
commands in the aof will also get in the rdb.

Regards,
Costin Radu


Salvatore Sanfilippo

unread,
Sep 16, 2011, 10:09:04 AM9/16/11
to redi...@googlegroups.com
Hello Costin, what you suggest is already implemented in Redis since
long time and is called:

no-appendfsync-on-rewrite yes/no

It's a redis.conf flag.

However this is not enough even if it helps, as in many setups disk
load is created by external entities such as other Redis instances.

Thanks for the good (even if already implemented) suggestion!
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--

Salvatore Sanfilippo

unread,
Sep 16, 2011, 10:16:56 AM9/16/11
to redi...@googlegroups.com

Finally we know how it is working :)

The bg-aof-2 branch implements this second idea of just moving the
fsync to a different thread and defer the write up to two seconds if
there is a bg fsync still in progress.

I tested this branch in the following setup:

- The box was my real Linux box kindly provided by VMware: a real
Linux system with a non SSD disk. Ext4 file system.
- Redis was loaded with 5 million keys.
- A script was running in the background to continuously write against
the server.
- redis-cli --latency was running to meter latency. Also an strace was
metering the duration of fsync calls, with "strace -f -p $(pidof
redis-server) -T -e trace=fdatasync"
- Redis was configured to use AOF, with fsync policy "everysec".

Without the bg-aof-2 patch the max latency was from 800 ms to 1.5 seconds.

With the patch the max latency I obtained was of 200 ms, but often the
operation was performed with a 40 ms latency.

I'll be working more to tune this implementation and to make sure it
is error free, but it is reasonably simple and short so I guess can be
audited efficiently enough to enter 2.4, otherwise we'll carry on this
AOF fsync latency issue for another year in production environments.

Note that with the current implementation
a race it possible where we fsync() a closed filedes, but this was by
design as it is harmless, and much better than blocking waiting for
the bg thread to finish its work.

I'll look for other possible races monday.

Cheers,

Didier Spezia

unread,
Sep 17, 2011, 6:08:16 AM9/17/11
to redi...@googlegroups.com
Hi,

I also like this pragmatic approach. It is definitely an improvement
over the current behavior, and safe enough material for 2.4,
so I'm all in favor of it.

Since writing is still done in the main thread in all cases, it should
be made clear to the users (in the doc) that write can block for plenty
of reasons (and not only because a fsync is on-going on the same file).
For instance a system wide sync, or simply pressure on the filesystem
cache, will likely block write as well ...

By the way, is there any plan to relieve the pressure on the filesystem
cache by calling posix_fadvise in the bio thread just after the fsync? I'm
not so sure it is a good idea to evict the AOF from the cache, because it
will have to be read again at rewrite time. However, I think there would
be a benefit to do it for the rdb fsync.

Best regards,
Didier.

Salvatore Sanfilippo

unread,
Sep 17, 2011, 8:54:25 AM9/17/11
to redi...@googlegroups.com
On Sat, Sep 17, 2011 at 12:08 PM, Didier Spezia <didi...@gmail.com> wrote:

> Since writing is still done in the main thread in all cases, it should
> be made clear to the users (in the doc) that write can block for plenty
> of reasons (and not only because a fsync is on-going on the same file).
> For instance a system wide sync, or simply pressure on the filesystem
> cache, will likely block write as well ...

Good idea Didier, I just started this document:

https://github.com/antirez/redis-doc/blob/master/topics/latency.md

I think you have an unmatched skill set about real world kernel
behavior and semantics here, any help on this document is very
welcomed!

> By the way, is there any plan to relieve the pressure on the filesystem
> cache by calling posix_fadvise in the bio thread just after the fsync? I'm

It seems a good idea to me, but I would hope that a file continuously
accessed by writes in append only mode is enough evidence for the
kernel to expect such a pattern. But well an hint is not bad :)

> not so sure it is a good idea to evict the AOF from the cache, because it
> will have to be read again at rewrite time. However, I think there would
> be a benefit to do it for the rdb fsync.

fortunately the rdb fsync is performed in the child process so at
least in this case we don't have to block the main thread.
With rdb persistence we are in the fortunate condition of never
writing nor fsyncing in the main thread... in other words if it would
not be for fork(2) Redis would be a pretty real-time system using a
subset of fast commands (no SORT, intersections, ...).

Thanks!
Salvatore

> Best regards,
> Didier.


>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/redis-db/-/PSeV6Zr5MkIJ.


> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.
>

--

Greg Andrews

unread,
Sep 17, 2011, 3:30:29 PM9/17/11
to redi...@googlegroups.com
Hi Salvatore,

The changes you described to AOF make sense.  As I recall, the process of a slave attaching to a master and requesting the database via the SYNC command triggers actions in the master that duplicate (or closely imitate) the process of rewriting the AOF file.

Would you please comment on how this new method for managing AOF rewrites will affect slave SYNCs?

Many thanks to you and Pieter for your great work with Redis!

  -Greg


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.

Hampus Wessman

unread,
Sep 18, 2011, 3:25:01 AM9/18/11
to redi...@googlegroups.com
Slave syncs actually use background saves, so they won't be affected by the change. They are very similar to AOF rewrites, but they do all the disk writing in the child process. After a rewrite we also need to do some final disk operations in the main process and those are the ones being moved to a background thread (to some degree), in addition to ordinary AOF fsyncs.

Cheers,
Hampus

Didier Spezia

unread,
Sep 18, 2011, 7:43:31 AM9/18/11
to redi...@googlegroups.com

Hi Salvatore,

>I think you have an unmatched skill set about real world kernel
>behavior and semantics here, any help on this document is very
>welcomed!

You give me way too much credit. I hope there are not too many
real kernel hackers on this list, since they will have a good laugh ;-)

I've sent a pull request for this document.

>With rdb persistence we are in the fortunate condition of never
>writing nor fsyncing in the main thread...

Yes, but my point is this writing and fsyncing activity done
in the background process may impact AOF writing activity
done in the foreground thread of all Redis instances running on
the same box.

From the system point of view rdb saving involves two phases.
The first phase generates plenty of writes (using default
buffering). The second phase is the fsync.

The first phase will put pressure on the filesystem cache, it
will force the system to find free memory, potentially swapping
out some memory used by the processes.

The second phase will put pressure on the I/O subsystem
since a big peak of sequential I/Os will be generated.

IMO rdb saving would benefit of higher buffers for stdio (setvbuf)
to decrease number of system calls (interesting on a VM). Calling
fflush, fsync and posix_fadvise at regular interval at writing time
(for instance every 8 MB of written data) should reduce the pressure
on the filesystem cache (first phase) and smooth the I/O peak
(second phase).

I believe it may help to reduce latency for write operations in
other threads/processes on the system (including Redis AOF
write operations).

I did not really think about it, but this kind of strategy could be
also interesting for the AOF rewrite job.

Regards,
Didier.

Hampus Wessman

unread,
Sep 18, 2011, 9:10:44 AM9/18/11
to redi...@googlegroups.com
Hi Salvatore,

I have looked through all the code and tested it on my system too now. I
think it looks good and it seems to work very well!

I'm also running Linux on a physical computer with ext4 on a traditional
HDD. I ran a python script in the background that was writing heavily to
the disk, while also put some load on Redis. The whole system actually
got a bit slow and unresponsive because of all this... Redis still
continued to work well, though! As long as fsyncs finish within two
seconds it works great and you need to put a lot of load on your disks
to get it slower than that (managed to do that a few times). Blocking is
the logical thing to do in those cases, so that's just good.

I also ran it with valgrind (primarily drd and helgrind) a few times and
couldn't find any problems with the threading.

The only thing I see as a potential problem (but it's nothing new and
it's quite hard to get this perfectly right - even harder to test it) is
that some filesystems reorder writes very aggressively (e.g. ext4, even
though it has some hacks to avoid some problems people were having). It
is possible that a filesystem would write the metadata for a rename to
disk before it writes all pending data for the renamed file to disk. I
believe that we, in theory, could end up losing some data (the changes
made during a rewrite) when renaming the new AOF, because of that. There
were some heated discussions in many places about these things when ext4
was new.

Not much changed about this, as said. It's like that in unstable too and
it's unlikely to cause any serious problems. It could perhaps be good to
investigate this more in the future. Just pointing it out, because it's
somewhat related to this change. Actually, I think I have a nice and
fairly simple idea related to AOF persistence that seems to fix this as
a bonus and would work well together with this change. I'll probably
post something about that some other day...

With some changes I think bioWaitPendingJobsLE could be implemented more
efficiently by using the condition variable here too. That function is
not yet used anywhere (right?), so it doesn't really matter though.

It would be nice to log when the disk can't keep up (as someone else
also requested). Perhaps something like:
--- a/src/aof.c
+++ b/src/aof.c
@@ -101,6 +101,7 @@ void flushAppendOnlyFile(int force) {
}
/* Otherwise fall trough, and go write since we can't wait
* over two seconds. */
+ redisLog(REDIS_NOTICE, "Asynchronous disk operations
falling too far behind. Writing from main thread instead.");
}
}
/* If you are following this code path, then we are going to write so

I also think this should be safe enough for 2.4. I haven't found any
real problems in the current implementation.

That's all from me :) Hope you don't mind the long reply. I like the
solution. Very redis-style.

Cheers,
Hampus

Salvatore Sanfilippo

unread,
Sep 19, 2011, 11:04:18 AM9/19/11
to redi...@googlegroups.com
On Sun, Sep 18, 2011 at 3:10 PM, Hampus Wessman
<hampus....@gmail.com> wrote:
> Hi Salvatore,
>
> I have looked through all the code and tested it on my system too now. I
> think it looks good and it seems to work very well!

Great! Thanks

> I'm also running Linux on a physical computer with ext4 on a traditional
> HDD. I ran a python script in the background that was writing heavily to
> the disk, while also put some load on Redis. The whole system actually
> got a bit slow and unresponsive because of all this... Redis still
> continued to work well, though! As long as fsyncs finish within two
> seconds it works great and you need to put a lot of load on your disks
> to get it slower than that (managed to do that a few times). Blocking is
> the logical thing to do in those cases, so that's just good.

Also thanks for this alternative testing, it's cool to have a few data
points, for some reason performances of ext4 + fsync tends to vary a
lot in different Linux configuration with different hardware.

> I also ran it with valgrind (primarily drd and helgrind) a few times and
> couldn't find any problems with the threading.

Great!

> The only thing I see as a potential problem (but it's nothing new and
> it's quite hard to get this perfectly right - even harder to test it) is
> that some filesystems reorder writes very aggressively (e.g. ext4, even
> though it has some hacks to avoid some problems people were having). It
> is possible that a filesystem would write the metadata for a rename to
> disk before it writes all pending data for the renamed file to disk. I
> believe that we, in theory, could end up losing some data (the changes
> made during a rewrite) when renaming the new AOF, because of that. There
> were some heated discussions in many places about these things when ext4
> was new.

Interesting, so in short we can't consider renaming a new just
generated file against an old one completely safe. But is this also
true in the case we fsync() the new just created file before calling
rename? We do that in rewriteAppendOnlyFile().

> With some changes I think bioWaitPendingJobsLE could be implemented more
> efficiently by using the condition variable here too. That function is
> not yet used anywhere (right?), so it doesn't really matter though.

Absolutely, we are actually not using this function anymore...
removing it for now.

> It would be nice to log when the disk can't keep up (as someone else
> also requested). Perhaps something like:

Done! Thanks for everything :)

Hampus Wessman

unread,
Sep 19, 2011, 1:23:39 PM9/19/11
to redi...@googlegroups.com
On 09/19/2011 05:04 PM, Salvatore Sanfilippo wrote:
> On Sun, Sep 18, 2011 at 3:10 PM, Hampus Wessman
> <hampus....@gmail.com> wrote:
>> The only thing I see as a potential problem (but it's nothing new and
>> it's quite hard to get this perfectly right - even harder to test it) is
>> that some filesystems reorder writes very aggressively (e.g. ext4, even
>> though it has some hacks to avoid some problems people were having). It
>> is possible that a filesystem would write the metadata for a rename to
>> disk before it writes all pending data for the renamed file to disk. I
>> believe that we, in theory, could end up losing some data (the changes
>> made during a rewrite) when renaming the new AOF, because of that. There
>> were some heated discussions in many places about these things when ext4
>> was new.
> Interesting, so in short we can't consider renaming a new just
> generated file against an old one completely safe. But is this also
> true in the case we fsync() the new just created file before calling
> rename? We do that in rewriteAppendOnlyFile().

As far as I know, to first fsync() all the data and then rename() the
file is the standard solution for this, so that should be safe (and
doing it the other way around isn't completely so).

It's probably safe to do it without an fsync on ext4 by now, because
they detect common patterns like that. There may be other filesystems
not doing that, on the other hand.


This is probably far more than you want to know, but here are some more
related thoughts:

The rename() isn't synced to disk immediately either (that's why it
doesn't block) so it may not be safe to assume that the rename is
durable, without calling fsync() on the directory. See e.g.
http://stackoverflow.com/questions/3764822/how-to-durably-rename-a-file-in-posix
and
http://postgresql.1045698.n5.nabble.com/fsync-reliability-td4330289.html
(and the POSIX standard). I'm a little unsure about exactly how this
works in practice, though. Most applications seem to ignore it and
perhaps that's the best thing to do... It could perhaps, in very rare
cases, lead to both the new and old AOF existing after a crash and the
new having newer data in it (but not being loaded).

Then there's the problem with many disks that caches data internally, so
that an fsync() isn't sufficient for crash safety. On many systems it's
best to manually disable those caches (e.g. Linux). In Mac OS X I think
you should rather use a special command in the application instead. See
http://www.postgresql.org/docs/9.0/static/wal-reliability.html and
http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fsync.2.html
(in particular about fcntl and F_FULLSYNC). Perhaps we should document
this too (like PostgreSQL) and possibly change what we do on OSX (at
some point)? PostgreSQL seems to use fcntl with F_FULLSYNC on OSX.

Anyone who knows anything more about these things? It's quite
interesting and could be useful. It's perfectly possible that I'm wrong
about something here, but this is how I think it works anyway.

>> It would be nice to log when the disk can't keep up (as someone else
>> also requested). Perhaps something like:
> Done! Thanks for everything :)
>
> Salvatore

No problem :)

Cheers,
Hampus

Hampus Wessman

unread,
Sep 19, 2011, 4:12:25 PM9/19/11
to redi...@googlegroups.com


2011/9/19 Hampus Wessman <hampus....@gmail.com>

I have to admit, though, that those last two paragraphs would mostly affect people using 'appendfsync always' and most people don't use that, as far as I know. Others already risk losing a little data and are fine with that. The original thing that I pointed out could still be a certain problem after a rewrite that takes a lot of time (even for those running 'appendfsync everysec', as they could potentially lose a bit more data than expected).  We can always try to improve these things when we find the time, but for most people I would guess that it works more than well enough already.

Regards,
Hampus
 
Reply all
Reply to author
Forward
0 new messages