NFS writes, NFS accelerators and biod

Larry McVoy

unread,

Dec 23, 1991, 2:42:20 AM12/23/91

to

jo...@iastate.edu (John Hascall) writes:
> l...@slovax.Eng.Sun.COM (Larry McVoy) writes:
> }Careful here. Getattrs and lookups are relatively cheap, being satisfied
> }out of the server's cache most of the time. Some reads are also cheap,
> }by the same reasoning. Writes, because of their synchronous nature,
> }are very expensive. A system that is doing 8% writes will show quite
> }noticable speedup from an NVRAM addition. Remember - a write can be
> }up to 3 synchronous disk writes.
>
> Ok, a couple more (hopefully not-too-newbie) questions...
>
> On our servers our "readhit" and "readahit" rates are right around
> 99% and 90% respectively (I'm guessing this what you are talking
> about above for read caching? Do the getattrs & lookups come
> out of this too, or is there some other cache for them?)

You must be running a weirdo kernel - I can't find readhit in SunOS or
SVr3.2. I suspect that those vars are for pages/buffers, i.e., data.

However, if you are hitting the data that well then it is probable that
you are also hitting the inode cache as well. The inode cache is what
is queried for the getattrs/lookups.

> I'm also guessing the "3 writes for 1" is an effect of "wsize=..."
> option? Ours is set to 1024, so I assume this is not happening here.

Not at all, a completely different thing. Unix has three writes to do
for one logical write (assuming UFS is the real target file system):

Push the inode to disk so that it reflects the new size, if changed.
Push the indirect block of pointers to disk after allocating.
Push the data to disk.

The first write happens when you are sequentially writing a new file.
Rewrites probably don't bother (they should to save the modtime, but
I'll bet that is "optimized" out).

The second write starts happen for each block past the direct blocks.
The first 9 (12? I can never remember) block pointers are stored in the
inode, the next pointer points to a block of pointers, and so on. As
the file gets large, you're allocating out of that block of pointers;
that allocation needs to get saved to disk.

The third write is obvious (I hope).

> Also, how does one determine what are appropriate values for
> "timeo=" and "retrans="? (Ours are "timeo=20,retrans=16" -- the 20
> strikes me as being much too large!? -- a 1K ping all the way across
> our net is "round-trip (ms) min/avg/max = 11/11/15").

I never play with these but all my file systems are on my local net so
I'm no expert. I'd suggest you read Rick Maleph's (sp?) Usenix paper
in the January '91 Usenix proceedings - he played around w/ this sort of
thing.

> Many thanks,
> John

No sweat.
---
Larry McVoy (415) 336-7627 l...@sun.com

Larry McVoy

unread,

Dec 23, 1991, 2:48:20 AM12/23/91

to

v...@rhyolite.wpd.sgi.com (Vernon Schryver) writes:
> > A system that is doing 8% writes will show quite
> > noticable speedup from an NVRAM addition. Remember - a write can be
> > up to 3 synchronous disk writes.
>

> Unless you have a system on which you can turn on the Nasty, Evil,
> Non-Compliant Cheating mode, in which case a write is only 2, 1, or fewer
> asynchronous disk writes.

You know, Vernon, one day you'll get a bug report from Citibank. It
will say "I wrote my data, I fsync()ed the file, I closed it, my server
crashed and when it came back my data was gone". Then Citibank will
say that they want to return all their SGI machines because the
machines *lie* about data safety.

Sun can't afford to cheat. Not like that. Sun has many weak points,
many places where they fall short of customer expectations. But data
written to disk is treated very carefully.

If you want to do fast writes, then rev the damn protocol to do
clustered writes w/ a following sync/release or sync/retransmit. It
works just fine, it's just a pain to get the client OS to hang on to
the data long enough to do the retransmit. Feel free to hack the code,
but please stop patting yourself on the back because you lie to
customers about their data.

Louis A. Mamakos

unread,

Dec 23, 1991, 1:02:56 PM12/23/91

to

In article <15...@appserv.Eng.Sun.COM> l...@slovax.Eng.Sun.COM (Larry McVoy) writes:
>You know, Vernon, one day you'll get a bug report from Citibank. It
>will say "I wrote my data, I fsync()ed the file, I closed it, my server
>crashed and when it came back my data was gone". Then Citibank will
>say that they want to return all their SGI machines because the
>machines *lie* about data safety.
>
>Sun can't afford to cheat. Not like that. Sun has many weak points,
>many places where they fall short of customer expectations. But data
>written to disk is treated very carefully.

Is this the same Sun Microsystems that doesn't checksum the UDP
packets carrying NFS? Talk about the pot calling the kettle black...

louie

Geoff Collyer

unread,

Dec 23, 1991, 4:15:10 PM12/23/91

to

I'm probably going to regret this, but I'd love to know Sun's reasoning
on this...

To defend Vern somewhat, why should data written by an NFS client be
treated differently than data written by an NFS server? If I run a make
on an NFS client and the server crashes, why should that be any different
than running the same make on the server and having it crash? That is,
if a Unix server crashes, I expect to lose the last 30 seconds or so of
work, and I probably will (depending on timing), if I work on the server.
Why shouldn't I expect the same data loss if I work on a client instead?

Or is Sun implicitly assuming that no one works on servers and that the
normal Unix writebehind is a bug?
--
Geoff Collyer world.std.com!geoff, uunet.uu.net!geoff

Vernon Schryver

unread,

Dec 23, 1991, 4:52:25 PM12/23/91

to

0. "lies"? I think recent Sun press releases about NFS performance are
the other than truthful statements around here. But let's not get out
that dirty linen. Let's save it for customer visits.

1. As I keep writing, it's an option. If you care, don't turn it on.

2. If Citibank really does use NFS for their money, than I'm glad
I don't have a Citibank bank card! NFS has nothing to do with a
genuine transaction database.

3. I don't think it's a "pat on the back." As we've hashed out until
we're all sick of it, there is no rational difference between a Legato
board in a Sun and some other brand of computer with a UPS. that only
part of the memory in one of those two has a battery is irrelevant for
"data integrety" concerns. It does materially affect the profits of
the company selling the expensive boards.

4. My personal experience with SGI and Sun machines in the last few years
is that both are more likely to loose data after it is already properly
on the disk than due to a crash before a sync. UNIX kernels are and
should be as robust as disk drives.
Disclaimer: My recent experience with Sun's is less than with IRIS's.

5. It would be great if Sun would get off it's fundament and get out a
protocol turn with fixes the several bad bugs in the NFS protocol.
Sun's bureaucratic paralysis is destroying NFS. There are some modest
things that could be done to NFS to make it competative with AFS.
Because Sun is so dead, NFS is itself moribund.

The absense of an fsync operation over the wire is a bug, although not
as important as others, such as the absense of atomic-create, which has
caused Sun and others so much trouble with a certain locking morass,

6. It is very hard for anyone but Sun to extend or fix a Sun protocol!
Maybe almost as hard as for Sun itself.
Yes, I know about RPC numbers. For example, you may have noticed that
Silicon Graphics has registered NIS to get it an IP multicast address.
I've been nagging Sun about that for years. No, Silicon Graphics makes
no claim about the ownership of that Class D address. We just want to
get away from the silliness of one NIS server/wire and got tired of
waiting.

Vernon Schryver, v...@sgi.com

Klaus Steinberger

unread,

Dec 24, 1991, 3:48:07 AM12/24/91

to

v...@rhyolite.wpd.sgi.com (Vernon Schryver) writes:

> 1. As I keep writing, it's an option. If you care, don't turn it on.

I agree, I've really happy that my MIPS box has also this option.
Many systems support this option (SGI, MIPS, CDC, Convex etc.)
only SUN doesn't.

> 2. If Citibank really does use NFS for their money, than I'm glad
> I don't have a Citibank bank card! NFS has nothing to do with a
> genuine transaction database.

I agree, a transaction system should have other checks for data integrity.

> 3. I don't think it's a "pat on the back." As we've hashed out until
> we're all sick of it, there is no rational difference between a Legato
> board in a Sun and some other brand of computer with a UPS. that only
> part of the memory in one of those two has a battery is irrelevant for
> "data integrety" concerns. It does materially affect the profits of
> the company selling the expensive boards.

We've put our server on a UPS, but not with the main reason of data integrity.
It saves a lot on starting up the whole bunch of systems. There are always
some holes into which a client can fall through starting up the whole bunch
of systems after a power failure.

> on the disk than due to a crash before a sync. UNIX kernels are and
> should be as robust as disk drives.

I agree.

> The absense of an fsync operation over the wire is a bug, although not
> as important as others, such as the absense of atomic-create, which has
> caused Sun and others so much trouble with a certain locking morass,

I agree.

> 6. It is very hard for anyone but Sun to extend or fix a Sun protocol!
> Maybe almost as hard as for Sun itself.
> Yes, I know about RPC numbers. For example, you may have noticed that
> Silicon Graphics has registered NIS to get it an IP multicast address.
> I've been nagging Sun about that for years. No, Silicon Graphics makes
> no claim about the ownership of that Class D address. We just want to
> get away from the silliness of one NIS server/wire and got tired of
> waiting.

I think that's great idea. I have two wires, and have the problem to
maintain more YP-servers than i really need. I have two routers between my
wires, and only one of them is capable of serving NIS! IP Multicast would help
a lot. Please SUN adopt this.

Sincerely,
Klaus Steinberger
--
Klaus Steinberger Beschleunigerlabor der TU und LMU Muenchen
Phone: (+49 89)3209 4287 Hochschulgelaende
FAX: (+49 89)3209 4280 D-8046 Garching, Germany
BITNET: K2@DGABLG5P Internet: k...@bl.physik.tu-muenchen.de

Brendan Eich

unread,

Dec 28, 1991, 3:37:51 PM12/28/91

to

In article <k2.693564487@woodstock>, k...@bl.physik.tu-muenchen.de (Klaus Steinberger) writes:
> v...@rhyolite.wpd.sgi.com (Vernon Schryver) writes:

> > 6. It is very hard for anyone but Sun to extend or fix a Sun protocol!
> > Maybe almost as hard as for Sun itself.
> > Yes, I know about RPC numbers. For example, you may have noticed that
> > Silicon Graphics has registered NIS to get it an IP multicast address.
> > I've been nagging Sun about that for years. No, Silicon Graphics makes
> > no claim about the ownership of that Class D address. We just want to
> > get away from the silliness of one NIS server/wire and got tired of
> > waiting.
> I think that's great idea. I have two wires, and have the problem to
> maintain more YP-servers than i really need. I have two routers between my
> wires, and only one of them is capable of serving NIS! IP Multicast would help
> a lot. Please SUN adopt this.

I registered an IP multicast address for Sun RPC with Joyce Reynolds at ISI
in September. It will show up in the "Assigned Numbers" RFC soon, I trust:

224.0.2.2 - SUN RPC PMAPPROC_CALLIT

The addition of a clnt_multicast() routine to the RPC library's pmap_rmt.c
file, which shares code with clnt_broadcast(); the definition of a manifest
IP address constant in <rpcsvc/pmap_prot.h>; and the few lines of change to
the portmapper are all trivial, and I'll make diffs to Sun Reference source
available if enough people request them.

In addition to ypbind, SGI plans to use this simple, portmapper-forwarded,
UDP-multicast RPC in a printer status monitor protocol and other products.
Will Sun support it? Keep those cards and letters coming.

/be

Chuck McManis

unread,

Jan 3, 1992, 12:56:13 AM1/3/92

to

In article <u79...@sgi.sgi.com> bre...@illyria.wpd.sgi.com (Brendan Eich) writes:
>I registered an IP multicast address for Sun RPC with Joyce Reynolds at ISI
>in September. It will show up in the "Assigned Numbers" RFC soon, I trust:
>
> 224.0.2.2 - SUN RPC PMAPPROC_CALLIT
>
>The addition of a clnt_multicast() routine to the RPC library's pmap_rmt.c
>file, which shares code with clnt_broadcast(); the definition of a manifest
>IP address constant in <rpcsvc/pmap_prot.h>; and the few lines of change to
>the portmapper are all trivial, and I'll make diffs to Sun Reference source
>available if enough people request them.

How about sending us the source? Is it a hack or did you figure out a
reasonable paradigm for 1 to many "procedure" calls. Since you changed
the portmapper I assume you just hacked the callit() function to do
the same things with multicasts that it does with broadcasts.

Contrary to popular opinion we aren't just sitting on our hands here
wondering what will happen next. For you're info the _new_ version of
NIS (somewhat lamely named NIS+) has a multicast address (224.0.1.8)
which of course you can only get to via IP networks (I know most every-
body reading this is using IP but we do like to think about those OSI
folks now and again.)

We started to put this change into rpcbind (nee portmapper) but became
concerned that doing so was really just an implementation of a
bigger version of broadcast. (what's the benefit of multicasting to
an address that _every_ machine is listening on ?) This can cause
a fairly intense load on the network when used inappropriately.
(put your YP server a couple of hops away and then partition the
network) It made much more sense to us that we start by putting
multicast addresses in individual services that could use them (such
as the name service) and then look at other uses. Ideas that have
come to mind are a multicast "gateway" portmapper, running one
per network, that could keep track of services on that net that
wanted multicast semantics. Another is to use the new nameservice
like a portmapper, registering and unregistering services with it.
This has advantages over the portmapper since there are relatively
few name servers so the multicast load is lighter, and the name
service (NIS+) allows more complex searches than the portmapper
does. (multiple keys : hypothetical example, the name
"[service=nfs,fs=/usr/man],svcs.org_dir.foo.bar." might be used to
find all hosts that had registered nfs service and were exporting
/usr/man with it.)

None of these are ideas are product plans but they give you an idea
of what we've looked at and why we've rejected it.

>In addition to ypbind, SGI plans to use this simple, portmapper-forwarded,
>UDP-multicast RPC in a printer status monitor protocol and other products.
>Will Sun support it? Keep those cards and letters coming.

For the reasons above we probably wouldn't. We would be interested
in a printer status monitor service that had a multicast address for it.
In particular the System 5 line printer service hasn't had nearly the
tweaking done on it that the BSD lpr/lpd stuff has so I expect it to
be a goldmine of useful hacks for several years.

--
--Chuck McManis Sun Microsystems
uucp: {anywhere}!sun!cmcmanis BIX: <none> Internet: cmcm...@Eng.Sun.COM
These opinions are my own and no one elses, but you knew that didn't you.
"I tell you this parrot is bleeding deceased!"

Venki Swaminathan

unread,

Jan 3, 1992, 9:08:26 PM1/3/92

to

I just saw a couple of references on multicast SunRPC on
comp.protocols.nfs - does this mean people are implementing the
multicast in IP, or are they hacking it within the RPC?
And are there any PD versions of either/both available?

thanx.

-venki

Vernon Schryver

unread,

Jan 3, 1992, 11:28:31 PM1/3/92

to

Silicon Graphics has been shipping IGMP support (IP multicasting) for
a long time. Since people played dogfight at Interop90 using mrouted
tunneling forwarders, it must be years. That Deering multicast
code is available somewhere. At Stanford, maybe? It was integrated
into source at Berkeley, but didn't make it into the right tree or
something for the last 4.3 release I was told about.

As Brendan wrote, SGI is perfectly willing to offer his extensions to YP
(oops) NIS to anyone interested.

Vernon Schryver, v...@sgi.com