NFS is based on RPC. RPC provides a synchronous service. This means
that a request is sent and the response is waited for. Thus, once an
application starts an RPC to a remote server, it will wait until it has
received a response from that service. All of this implies that all
RPC requests must be made in the context of a process or a thread,
since i/o must be done to read data from or write data to a network
connection and some mechanism for blocking must be used to wait until
the response is received.
NFS WRITE requests are defined to be synchronous on the server. This
means that when a server receives a WRITE request, it will transfer all
of the data in the WRITE request and all of the modified file system
metadata, inode and such, to disk before responding to the NFS WRITE
request. Even given today's disk technologies, this is a relatively
slow operation. The use of NVRAM or Prestoserve on the server can
speed this operation by using stable nonvolatile ram as a cache for the
disk. For even a reasonably quick NFS server without NVRAM, the
response time to WRITE requests may still average 100ms.
The server will still be mostly idling during the processing of the
WRITE request. It will be waiting for i/o to the disk to complete. In
general, most file systems and disk subsystems are not designed to be
used in a synchronous manner such as this. However, to make guarantees
about recovery after an NFS server crashes, the WRITE requests must be
performed in this fashion. The server can be doing other work in the
duration, even performing other WRITE requests to the file if they are
An NFS client will generally be asked to write data to a file on behalf
of an application. If the WRITE requests to the server are made in the
context of the application, then NFS write throughput will be very low
because the application will only be able to generate about 10 requests
per second, based on the 100ms response time. At 8k bytes per WRITE
request, the application will see throughput of about 80k bytes per
second. This is dismal, considering that the network bandwidth on an
ethernet is about 1M bytes per second and disks are general much faster
To solve this problem, the notion of asynchronous writes on the client
is implemented. In this case, asynchronous means that the NFS WRITE
requests are being generated from a different context than the
application. In the past, these different contexts were known as
biods, after the name of the process which was run to provide the
contexts. In Solaris 2.x, these contexts are known as async threads.
This is a different name and implementation for pretty much the same
thing. When the application uses the write system call to write some
data to the file, this data is copied into a kernel buffer or page, an
i/o request is queued for a biod or async thread, and then control is
returned to the application to continue running. Thus, the application
is allowed to continue doing what it is supposed to be doing, while the
data to be written is being sent to the server. If the queuing does
not fail, then no error is returned to the application because no error
has occurred, yet, potentially. Techniques known as "write behind" are
also used to increase performance, with respect to the application.
The i/o for partially filled blocks is not started immediately, but is
delayed in the hope that the application will fill the rest of the
block and only one NFS WRITE request will need to be made to the
Explicit operations such as writes made to a file opened with O_SYNC or
fsync operations are not subject to this asynchronous queuing. In
these cases, it is expected that the application wants to know that
data which has been written actually resides in the file on the disk of
the server. Thus, all WRITE requests in these two cases are made in
the context of the application itself.
The semantics of the biods or async threads can be taken advantage of
to greatly increase NFS WRITE throughput. The server can make use of
the fact that multiple NFS WRITE requests may come from the client in a
very short time frame. This is due to each biod or async thread
transmitting the WRITE request to the server and then blocking to wait
for a response to be received. The next biod or async thread can then
run and so on. Techniques known as write gathering or write clustering
can be implemented in the NFS server to gather together these various
NFS WRITE requests and process them as one WRITE request to the local
file system, thus cutting overhead and greatly reducing the number of
transactions that must be made to the disk.
There is an important difference between writing to a file which
resides on a UFS file system and one which resides on an NFS file
system. The issue is when space in the file system gets allocated to
hold the data being written. In the UFS, space is allocated
immediately during the write system call, and if none exists because
the file system is either full or the user is over his/her quota, an
error is returned. Thus, an error can be returned through the write
system call used to write the data. For NFS, the space allocation is
made when the NFS WRITE request is being processed by the server. File
system manipulation is always made by the server. The NFS client just
relays requests from applications to the NFS server. Thus, out of
space conditions can not be detected on the client until the data has
actually been transmitted to the server.
A side affect of the use of biods or async threads is that errors from
failed NFS WRITE requests will be returned to the biod process or the
async thread. The application will be unable to see these errors being
returned for WRITE requests being made on its behalf because they
didn't go to the application, but to the biod or async thread. To
resolve this problem, errors occurring from failed WRITE requests are
remembered and returned to the application when possible.
These stored errors can't be returned to the application in the write
system call which generated the data because that system call has
already finished by the time that the NFS WRITE has returned with an
error. Thus, another mechanism has to be used to return these errors
to the application. There are two such mechanisms available.
The first is the fsync system call. This call is used to ensure that
all data written to the file in the past has been successfully written
to the server, or if the data has not been successfully written, an
error is returned. This call is used by applications which want to be
sure that data which was previously written resides on stable storage,
whether on local stable storage or on stable storage located on the NFS
The second is the close system call. This call is made by all
applications, whether explicitly or not, to close all files which were
open. Most applications contain a sequence of open, write, and then
close system calls for files to be written to. The close system call
is generally the only entry point at which NFS is given control where
an error may be returned.
Thus, correctly written applications either need to fsync open files
before closing them or to pay attention to the return value from the
close system call to determine whether any intermediate write errors
may have occurred. Using the fsync system call allows the application
to do recovery if an error is returned, but is slow performance wise
and many applications are not interested in error recovery, but just
want to be able to report any errors which may have occurred. Thus,
the close system call is the usual point at which to discover errors
which may occurred while writing to a file. Unfortunately, many
applications are not written correctly. They either do not bother to
explicitly close opened files or pay no attention to the return value
from the close system call.
The problems associated with ignoring the return value from system
calls have been widely known for years. In general, ignoring return
values from system calls is a bad programming practice. In particular,
the practice of ignoring the the return value from the close system
call is widely seen, but is still not a good practice. Sometimes, the
value returned via the close system call is critical to correct
operation of the application.
The editor, vi, was modified in 1985 to fsync the file being written
and to pay attention to any error that might be returned from the fsync
operation. Version 19.11 of Xemacs contains support to fsync file data
and to pay attention to the return values from both the fsync and close
None of the functionality described here is new for Solaris 2.4. It
has existed in NFS implementations industry wide for the last 10+
years. Solaris 2.4 does allow more queuing, to attempt to take
advantage of larger memory systems, but the basics are still the same.
The characteristics of all or most implementations of NFS that create these
situations have existed throughout the decade that NFS has existed, and for
most long-term users and purveyors of NFS have assumed the status of unstated,
but widely understood basic assumptions. However, especially with the growth
in usage that has occurred in recent years, it is clear that we cannot assume
that factors once widely known and understood have remained so. It is clear,
in fact, from recent events that even those who supply systems supporting NFS,
Sun included, have not been rigorous in adhering to these assumptions.
Thus, we would like to take this opportunity to describe the issue, its
ramifications, and advise of steps that can be taken to limit exposure to
The basic issue:
Applications which write data to a file via NFS will not
normally see some errors relating to the file system or media
at the time of the write() operation. An example
of such an error is an "out of space" or "over quota" error.
Such errors are reported sometime later, perhaps during a
later write() operation but in any event they are guaranteed
to be reported to an fsync() or close() operation upon the
A number of applications (perhaps many applications) either
do not invoke fsync() or do not check for an error condition
from close(). Although the failure to check for errors is not
a good programming practice, it is nevertheless a fairly common
Because the error is never checked, it can go unnoticed by the
application which may lead a user to believe that files
were written when in fact they were not, thus leading
to a "silent" loss of data and the false belief that data is
intact in the newly written files.
The two characteristics of most or all NFS implementations that create this
circumstance are the application of performance-enhancing techniques such as
"write behind" and the distributed nature of NFS that means that the
definitive state of a file system can be known only by the server for the file
system. The resulting "pipeline" of file I/O operations allows applications
to continue while the I/O is still in progress.
This phenomenon need not involve NFS. Most implementations of the UNIX(R)
operating system, for instance, use similar pipelining techniques for local
file system operations, and circumstances such as media errors can also lead
to a situation in which the failure of an application to check for errors can
also result in the silent loss of data. In principle, any performance
enhancing technique such as caching, "write-behind", write clustering or any
similar action which creates asynchrony between the application and the I/O
operation is exposed to such errors -- what may differ between specific
circumstances is the number or kind of errors to which the application is
exposed. In the case of NFS, the distributed nature of the access to the file
system means that an additional possible asynchronous error is that of running
out of space on the server file system as a result of the demands imposed on
it from all the different systems using it simultaneously.
In any case, the rarity of deferred or asynchronous errors have allowed most
of us to forget such errors can occur, and thus we do not program
robustly enough. That such programming errors can be revealed by the
characteristics of services such as NFS does not make them less of a
Regretfully, we have found in investigating specific reports of these problems
that some commonly used file operations utilities in Solaris and other
software products contain such programming errors. Additionally, it is also
clear that over time, a variety of circumstances have acted to erode the
protections that have made exposure to this problem rare such that it appears
more frequently, especially in environments which employ quota-checking.
Here is what we will do:
1. Although once well-known, it is clear that the requirements for
robust programming have faded over time. One intent in posting
this message is to alert those writing programs that an asynchronous
failure, though generally rare, still requires programmer
We have noted that the discussion of this issue in the documentation
of our own products has eroded over time such that recent editions
do not mention it at all.
We will augment Solaris' documentation to try to systematically
improve understanding of this issue. Additionally, in our activities
with other vendors and standards bodies, we will endeavor to ensure
that the industry at large achieves this understanding so that
we all benefit from better-written programs.
2. Solaris 2.4, with its higher NFS throughput and performance, and
running on more capable hardware, has resulted in a "deepened"
pipeline of I/O operations which increase the exposure that
applications may have to getting "deferred" error reporting.
We are investigating how we can improve the heuristics and
operations of our NFS implementations and thus return the
level of exposure to that provided by earlier product editions.
3. We will repair the affected file utilities in Solaris and other
software products and make the repaired versions available to
all users of our products without charge and without regard to
the status of warranties and service agreements.
4. It is almost certain that most of the problems that are revealed
in Solaris and other SunSoft software products also exist in the
products of many suppliers. The industry holds a number of events
at which the suppliers of NFS and related technologies meet to
test and improve the quality of all implementations, and we will
use those opportunities to share our experiences so that other
users and suppliers of NFS can gain from what we have encountered.
We recommend that users examine their disk space usage and try to avoid
long-term circumstances in which they are running at or near capacity (or at
or near quota limits) as a means of improving their protection against
application failures due to deferred errors. And of course we recommend that
users in their own programming endeavor to write as robustly as they can.
We are also investigating whether we can supply a means to transparently
"correct" existing faulty applications so that they could be largely immune
from such asynchronous failures. If we are able to create such a mechanism we
will make this technology available to all users in source form, free of
charge. We are investigating this in recognition that, despite admonitions to
write robust programs, many programs from many sources fail to achieve this
Finally, we should note that we believe that practical exposure to the
problems discussed here is very rare. However, because a number of customers
have noted occurrences in recent weeks, and because we have found specific
errors in our products as a result, we felt it wise to advise the community at
large of the situation and our intentions to deal with it.
Robert A. Gingell
> It has come to Sun's attention that some customers have recently noted
> problems with a number of applications in writing files via NFS.
Thanks very much for the useful information which is much appreciated.
You are to be commended on your openness, which for me at least, does
Sun much credit.
(Intel, are you listening?)
>The problems associated with ignoring the return value from system
>calls have been widely known for years. In general, ignoring return
>values from system calls is a bad programming practice. In particular,
>the practice of ignoring the the return value from the close system
>call is widely seen, but is still not a good practice. Sometimes, the
>value returned via the close system call is critical to correct
>operation of the application.
Depends. Close() used to be defined as being able to return only
two error codes: EBADF and EINTR. Checking for those is useful
as they show errors in programming logic. The semantics of close
have now changed and it takes people a long time to get used to.
It should also be noted that Solaris 2.x NFS code is different in many
respects from the SunOS 4.1.x code and in such a way that it badly affects
*bad* writes. In SUnOS 4.1.x, I couldn't write more than 64K bytes over
quota, and the filesize after failure was shown as 0 on the client and server.
In Solaris 2.4, I was able to write 2MB(!) past my quota and assorted bad things
started happening: server performance was killed because of error messages
spewed by the kernel (alloccg messages, which really puzzled me). Stat()ing
the file on the client caused a long delay in which the client frantically
tried to flush the 2MB to the server again, probably because the attributes
on server and client conflicted. (This is in 2.4/101945-19)
It seems to me that the Solaris 2.4 client code has a long way to go
before its performance in these cases becomes acceptable. Atleast it
should discard all oustanding writes to a file that are past the
server's idea of EOF once it gets diskfull errors on any of those
I have worked on NFS and am well aware of the problems you mention
above. Your description was wonderful! I hope people will remember
to do fsync's and check the return code.
I'm well aware of checking the return code from both fsync and close.
It seems that if a non-zero return code is returned by fsync that the
program can inform the user and try the fsync again.
But what is the state of the dirty buffers on the client when the
close returns a non-zero return code? I'm reasonably sure this is
implementation dependent. If the close returns a non-zero value is
the file still open? If it isn't I take it there is no way to recover
from the error?
Yes, that really is my last name.