Omission of fsync in APoSD and RAMCloud

Varun Gandhi

unread,

Mar 18, 2025, 11:24:58 AMMar 18

to software-design-book

I'm re-reading APoSD as part of writing a book review and one thing that has struck me is how the POSIX file I/O API is praised as a beautiful example of a deep API.

Per APoSD's definition of "deep", I fully agree that the 5 file I/O APIs present a valid example of an API that is "deep." What is unclear to me is whether the POSIX file I/O API is supposed to be considered "good" overall in terms of reducing complexity (e.g. the book makes the point that if on ~average, a developer doesn't need to worry about something, then the complexity of that thing is low, not sure if that is meant to be applied here)

Notably, the 5 APIs presented are open, read, write, lseek and close. fsync is not one of them.

Without fsync, it is unclear how the listed APIs are sufficient, because they do not make sufficient guarantees about durability. And fsync is notoriously hard to understand, even for people who work on databases and file systems! Some related discussions and writing:

- https://lobste.rs/s/t2bmsy/fsync_after_open_is_elaborate_no_op

- https://transactional.blog/how-to-learn/disk-io

- https://danluu.com/file-consistency/

- https://x.com/jorandirkgreef/status/1892109953608958252

https://x.com/jorandirkgreef/status/1892109953608958252https://x.com/jorandirkgreef/status/1892109953608958252https://x.com/jorandirkgreef/status/1892109953608958252https://x.com/jorandirkgreef/status/1892109953608958252https://x.com/jorandirkgreef/status/18921099536089

5
5

Out of curiosity, I looked at the RAMCloud code, and I saw that there are 0 calls to either fsync or aio_fsync (search query), which feels at least a bit surprising given that RAMCloud is supposed to guarantee durability (but understandable given the complexity of using fsync correctly), but perhaps I'm missing something else in the RAMCloud code for buffer management.

I'm curious as to what people think about the omission of fsync, and whether it's presence (and complexity) affects the assessment of the other 5 file I/O APIs. Or do people think that it should not be weighted too heavily given that outside of database developers, the average developer probably does not spend much time worrying about proper usage of fsync?

Varun

https://x.com/jorandirkgreef/status/1892109953608958252z

Peter Ludemann

unread,

Mar 18, 2025, 1:41:24 PMMar 18

to software-design-book

On Tuesday, March 18, 2025 at 8:24:58 AM UTC-7 varung...@gmail.com wrote:

I'm re-reading APoSD as part of writing a book review and one thing that has struck me is how the POSIX file I/O API is praised as a beautiful example of a deep API.

As soon as you use the standard C library for I/O (which is essentially a more portable version of POSIX) on top of POSIX, life becomes more interesting. :)

E.g.: setvbuf(), fflush().

And do you trust the caching "firmware" in disk drives?

John Ousterhout

unread,

Mar 19, 2025, 6:45:36 PMMar 19

to Varun Gandhi, software-design-book

Hi Varun,

As you point out, there are other file-related system calls besides the 5 I listed in APOSD. In addition to fsync, there are specialized calls for networking such as sendmsg, and no doubt others besides these. I omitted fsync and the others because they are rarely needed. With just the 5 calls I listed you can do an enormous number of things.

I agree that if you want to build highly reliable systems, things get more complex. Fortunately, only a small number of people need to do this, so the additional complexity doesn't impact the vast majority of people who use the POSIX I/O APIs. So yes, this is a place where I'd argue that if a developer doesn't need to be aware of something, then the thing doesn't create complexity for the developer.

As for RAMCloud, it's been a while since I last worked on it so my memory is a little fuzzy, but I believe that the reason fsync isn't needed is because everything written to disk is replicated on multiple machines. RAMCloud must be able to withstand the loss of an entire machine, so recovery isn't affected by whether fsync is used or not. The only place where this matters is a datacenter-wide power outage, which could potentially affect all of the replicas. We assumed that production nodes would have enough battery backup to flush their in-memory data to disk after power is lost.

-John-

--
You received this message because you are subscribed to the Google Groups "software-design-book" group.
To unsubscribe from this group and stop receiving emails from it, send an email to software-design-...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/software-design-book/04fb7ff5-1917-4ab8-9ff5-29bd0542bda9n%40googlegroups.com.

Varun Gandhi

unread,

Mar 21, 2025, 2:36:02 PMMar 21

to John Ousterhout, software-design-book

Thank you for the response John. 😄

I get your point about most people not using fsync directly, so in that sense they're not affected. However, as I see it, applications which do not use fsync are potentially at a higher risk of bugs.

For example, nowadays it's increasingly common to use "spot instances" on cloud compute which can be arbitrarily terminated in the middle of operation. If a spot instance is writing some files to a shared disk (and not interfacing through a DB), and then doing some irreversible action later (e.g. sending a network request), it's possible the network request is sent but the corresponding file was not made durable on disk if fsync was not used. This failure mode also applies to local desktop apps storing data to flat files (instead of something like SQLite), and interacting with the network.

So in that sense, it feels like the POSIX file API design fails to create a "pit of success" the developer falls into. In Chapter 4, you've written this in the context of Java's buffered streams:

> Providing choice is good, but interfaces should be designed to make the common case as simple as possible (see the formula on page 6). Almost every user of file I/O will want buffering, so it should be provided by default. For those few situations where buffering is not desirable, the library can provide a mechanism to disable it.

This point about thinking carefully about the common case and defaults makes a lot of sense to me. But it feels like the POSIX file API (by making durability opt-in instead of opt-out) fails to satisfy this guideline.

---

For RAMCloud in particular, I believe that it's still possible to have global data loss in the presence of replication, when fsync is skipped, as this blog post by Redpanda shows (this was published in 2023 though, so about ~7 years after work on RAMCloud)

https://www.redpanda.com/blog/why-fsync-is-needed-for-data-safety-in-kafka-or-non-byzantine-protocols

The RAMCloud paper mentions "We assume a fail-stop model for failures [..] We do not attempt to handle Byzantine failures" which does appear to match the pre-conditions in the Redpanda blog post.

> Node crashes are the main vulnerability of running a system without fsync. However, consistent replication can withstand node crashes without compromising data consistency. So it appears that replication resolves this vulnerability and makes fsyncs unnecessary for replicated systems.

>

> [..], it is possible to minimize the likelihood of [simultaneous power loss] e.g. by using availability zones

>

> The argument presented above is a common misunderstanding. Even the loss of power on a single node, resulting in local data loss of unsynchronized data, can lead to silent global data loss in a replicated system that does not use fsync, regardless of the replication protocol in use.

>

> Note: The caveat is that most replication protocols only tolerate fail-stop faults, which means that while nodes may crash, they must have the same state (data) upon restart as they did at the moment of the crash.

>

> [Argument + examples with more details]

However, I have not analyzed Redpanda's argument carefully, and I have little expertise on fault tolerance, so I may very well be missing/misunderstanding something.

Varun

Peter Ludemann

unread,

Mar 21, 2025, 7:49:18 PMMar 21

to software-design-book

On Friday, March 21, 2025 at 11:36:02 AM UTC-7 varung...@gmail.com wrote:

Thank you for the response John. 😄

I get your point about most people not using fsync directly, so in that sense they're not affected. However, as I see it, applications which do not use fsync are potentially at a higher risk of bugs.

"Most people aren't affected" until they are ... for example, if your program crashes, you can't depend on the file output having the last write.

I've also seen plenty of code where return codes weren't checked, especially for close(). Also, if you're writing to a pipe or network, you need to take care of the case where the number of bytes written is less than the request; or how to handle EAGAIN or EINTR. And I'm pretty sure this list is not exhaustive.

(I think the C file functions take care of EINTR; but they introduce an extra layer of buffering, which can make debugging a crash even more fun.)

For example, nowadays it's increasingly common to use "spot instances" on cloud compute which can be arbitrarily terminated in the middle of operation. If a spot instance is writing some files to a shared disk (and not interfacing through a DB), and then doing some irreversible action later (e.g. sending a network request), it's possible the network request is sent but the corresponding file was not made durable on disk if fsync was not used. This failure mode also applies to local desktop apps storing data to flat files (instead of something like SQLite), and interacting with the network.

If your operations aren't idempotent, you're asking for trouble.

If you have multiple operations that need to be consistent, then you should use something like 2-phase commit or Chubby lock service or Paxos, or Raft, etc.

Dan Cross

unread,

Mar 21, 2025, 8:49:24 PMMar 21

to Varun Gandhi, John Ousterhout, software-design-book

On Fri, Mar 21, 2025 at 2:36 PM Varun Gandhi <varung...@gmail.com> wrote:
> Thank you for the response John. 😄
>
> I get your point about most people not using fsync directly, so in that sense they're not affected. However, as I see it, applications which do not use fsync are potentially at a higher risk of bugs.

Surely this depends on the semantics of the particular program. Not
all are concerned about ensuring durability against stable storage
(which is what `fsync` nominally gives you). Note that, for example,
in the FreeBSD bin/ source tree, the only program that uses `fsync` is
`dd`, and only when given a specific flag (the `fsync` oflag);
`fflush` only appears a handful of times, and mostly for stdout or
stderr.

Moreover, POSIX does not mandate that `fsync` is necessary after e.g.
a `write` or `pwrite`; this is up to the implementation, and one could
imagine an implementation that doesn't synchronize IO via a buffer
cache, as Unix traditionally did, and thus for which `fsync` would be
entirely superfluous.

But note that `open` can take either `O_SYNC` or `O_DSYNC` to sync all
updates or just data updates, so one can have synchronous semantics
without `fsync`.

What this suggests to me is that, for most programs, it's fine to
leave actually sync'ing to the underlying storage device to the
operating system. Those that care can use `fsync`, but most don't.

> For example, nowadays it's increasingly common to use "spot instances" on cloud compute which can be arbitrarily terminated in the middle of operation. If a spot instance is writing some files to a shared disk (and not interfacing through a DB), and then doing some irreversible action later (e.g. sending a network request), it's possible the network request is sent but the corresponding file was not made durable on disk if fsync was not used. This failure mode also applies to local desktop apps storing data to flat files (instead of something like SQLite), and interacting with the network.

I think you are mentioning databases ("...not interfacing through a
DB" and "...instead of something like SQLite") because those,
generally, do the necessary dance with `fsync` or some kind of more
elaborate commit protocol to ensure data is resident on stable storage
before returning "success" for mutating operations. But note that I
only interact with them via the POSIX file API on Unix-like systems in
the crudest of manners: e.g., via a socket to some kind of for an
RDBMS, or, in the limit, via a pipe for something like sqlite (usually
that's a library that I just link into my program). Of course, other
libraries like GDBM, BerkeleyDB, or even the venerable ndbm or dbm
libraries behave similarly.

But most programs just don't need those kinds of semantics.

> So in that sense, it feels like the POSIX file API design fails to create a "pit of success" the developer falls into. In Chapter 4, you've written this in the context of Java's buffered streams:
>
> > Providing choice is good, but interfaces should be designed to make the common case as simple as possible (see the formula on page 6). Almost every user of file I/O will want buffering, so it should be provided by default. For those few situations where buffering is not desirable, the library can provide a mechanism to disable it.
>
> This point about thinking carefully about the common case and defaults makes a lot of sense to me. But it feels like the POSIX file API (by making durability opt-in instead of opt-out) fails to satisfy this guideline.

This is predicated on the notion that all programs require the kinds
of durability semantics that are implied by using `fsync`, but NOT by
using `O_SYNC` or `O_DSYNC`. I don't think there's enough evidence
available to make that conclusion, but plenty of disconfirming
evidence. For programs that care, yes, they should use `fsync` or
something like it; for those that do not, there's no need (and the
performance disadvantage would be significant!).

- Dan C.

John Ousterhout

unread,

Mar 24, 2025, 8:32:34 PMMar 24

to Dan Cross, Varun Gandhi, software-design-book

Hi Varun,

A couple of quick responses:

On the issue of fsync, I agree with Dan: it all comes down to how often it's needed. If most developers really should be using it, then it needs to be included when thinking about the complexity of the I/O interface. You mentioned a few examples where it's needed, and I agree with those examples. But I think these are outliers. For almost all programs, losing a bit of data on a crash isn't a big enough problem to worry about syncing. I've written thousands of programs over my career, but only a handful where I felt the need for fsync. If you use fsync in most or all of your code, then fsync becomes a fundamental part of the I/O interface and so it's appropriate for you to include its complexity when thinking about the complexity of the I/O interface.

As for RAMCloud, I read over the Redpanda argument and I consider the behavior described there to be a bug. In the "Proof of impossibility" section, it describes Node {C} crashing with unsynced data, then restarting and rejoining the cluster as if it had never received that data at all. This behavior is unsafe and leads to the problems described in the posting. If a node crashes in a way that it cannot recover data it has acked, then the node must be considered permanently dead; it must not participate in the cluster any more. If a node promises that it has made data persistent, it must have actually made the data persistent!

For RAMCloud, when a node crashes (for any reason) we don't reuse any data that might have survived the crash; we recover the entire node from other replicas. When a node starts up, it can only reuse existing data if it shut down cleanly (which would include fsync). With that approach, I believe it is safe for nodes not to fsync data.

-John-

John Ousterhout

unread,

Mar 25, 2025, 1:22:13 PMMar 25

to Varun Gandhi, software-design-book

Hi Varun,

A couple of quick responses:

On the issue of fsync, I agree with Dan: it all comes down to how often it's needed. If most developers really should be using fsync, then it needs to be included when thinking about the complexity of the I/O interface. You mentioned a few examples where it's needed, and I agree with those examples. But I think these are outliers. For almost all programs, losing a bit of data on a crash isn't a big enough problem to worry about syncing. I've written thousands of programs over my career, but only a handful where I felt the need for fsync. But, if you use fsync in most or all of your code, then fsync becomes a fundamental part of the I/O interface for you and so it's appropriate for you to include its complexity when thinking about the complexity of the I/O interface.

As for RAMCloud, I read over the Redpanda argument and I consider the behavior described there to be a bug. In the "Proof of impossibility" section, it describes Node {C} crashing with unsynced data, then restarting and rejoining the cluster as if it had never received that data at all. This behavior is unsafe and leads to the problems described in the posting. If a node crashes in a way that it cannot recover data it has acked, then the node must be considered permanently dead; it cannot participate in the cluster any more. If a node promises that it has made data persistent, it must have actually made the data persistent!

For RAMCloud, when a node crashes (for any reason) we don't reuse any data that might have survived the crash; we recover the entire node from other replicas. When a node starts up, it can only reuse existing data if it shut down cleanly (which would include fsync). With that approach, I believe it is safe for nodes not to fsync data.

-John-

Varun Gandhi

unread,

Apr 5, 2025, 3:53:38 PMApr 5

to John Ousterhout, software-design-book

Hi John,

Thanks for the detailed response. Your explanation makes sense to me.

I believe the Redpanda blog post might have been written under the additional assumption that the system itself is responsible for durability. IIUC your point about crash recovery being possible in the absence of fsync relies on the fact that coordinator metadata is stored in an external fault-tolerant key-value store (as mentioned in the section 9 in The RAMCloud Storage System), so in some sense, at least some of the durability in RAMCloud is externalized, which is different from the Redpanda example.