open() and ESTALE error

Andrey Alekseyev

не прочитано,

19 июн. 2003 г., 05:56:2419.06.2003

– freebsd...@freebsd.org

Hello,

I've been trying lately to develop a solution for the problem with
open() that manifests itself in ESTALE error in the following situation:

1. NFS server: echo "1111" > file01
2. NFS client: cat file01
3. NFS server: echo "2222" > file02 && mv file02 file01
4. NFS client: cat file01 (either old file01 contents or ESTALE)

My study shows that actually the problem appears to be in VOP_ACCESS()
which is called from vn_open(). If nfs_access() decides to "go to the wire"
in #4, it then uses a cached file handle which is indeed stale. Thus,
open() eventually fails with ESTALE too (ESTALE comes from underlying
nfs_request()).

I understand all the fundamental NFS-related integrity problems, but not
this one :) That is, I see no reason for open() to fail to open a file for
reading or writing if the system knows the problem is it's own. Why not
just do another lookup and try obtain a valid file handle?

I was playing with different parts of the kernel while "fixing" this for
myself. However, I believe, the simpliest patch would be for
vfs_syscalls.c:open() (I've also made a working patch against vn_open(),
though).

Could anyone please be so kind to comment this issue?

TIA

--- kern/vfs_syscalls.c.orig Thu Jun 19 13:22:50 2003
+++ kern/vfs_syscalls.c Thu Jun 19 13:29:11 2003
@@ -1008,6 +1008,7 @@
int type, indx, error;
struct flock lf;
struct nameidata nd;
+ int stale = 0;

oflags = SCARG(uap, flags);
if ((oflags & O_ACCMODE) == O_ACCMODE)
@@ -1025,8 +1026,15 @@
* the descriptor while we are blocked in vn_open()
*/
fhold(fp);
+again:
error = vn_open(&nd, flags, cmode);
if (error) {
+ /*
+ * if the underlying filesystem returns ESTALE
+ * we must have used a cached file handle.
+ */
+ if (error == ESTALE && stale++ == 0)
+ goto again;
/*
* release our own reference
*/
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Andrey Alekseyev

не прочитано,

19 июн. 2003 г., 10:27:3719.06.2003

– freebsd...@freebsd.org

Corrections:
- the patch is against STABLE
- I know the second lookup will fail if the *current* directory itself
is stale :)

> Could anyone please be so kind to comment this issue?

In particular, I'd like to know if I need NDINIT before entering vn_open()
again, as there are several comments throughout the code about "struct nd"
not being safe after namei() and lookup(). However, the patch seems to
work well without NDINIT. Another thing that interests me, is if any vnode
leakage is possible with re-entering vn_open() for the second time. It
seems to me that not (as the nd.ni_vp is vput()'d inside vn_open() in case
of any error after a successful lookup with namei()).

Andrey Alekseyev

не прочитано,

19 июн. 2003 г., 12:00:5819.06.2003

– freebsd...@freebsd.org

Another correction:
- statement below is valid for a configuration where nfsaccess_cache_timeout
is generally less than acmin, otherwise chances are the failure will be
VOP_OPEN while requesting new attributes by a call to VOP_GETATTR

> which is called from vn_open(). If nfs_access() decides to "go to the wire"
> in #4, it then uses a cached file handle which is indeed stale. Thus,
> open() eventually fails with ESTALE too (ESTALE comes from underlying
> nfs_request()).

Andrey Alekseyev

не прочитано,

19 июн. 2003 г., 14:31:5619.06.2003

– freebsd...@freebsd.org

In case anyone interested I wrote a "paper" for my own reference on
the FreeBSD NFS open() and attribute cache behavior.
It can be found here:
http://www.blackflag.ru/patches/nfs_attr.txt

Terry Lambert

не прочитано,

20 июн. 2003 г., 01:26:4020.06.2003

– Andrey Alekseyev, freebsd...@freebsd.org

Andrey Alekseyev wrote:
> I've been trying lately to develop a solution for the problem with
> open() that manifests itself in ESTALE error in the following situation:
>
> 1. NFS server: echo "1111" > file01
> 2. NFS client: cat file01
> 3. NFS server: echo "2222" > file02 && mv file02 file01
> 4. NFS client: cat file01 (either old file01 contents or ESTALE)
>
> My study shows that actually the problem appears to be in VOP_ACCESS()
> which is called from vn_open(). If nfs_access() decides to "go to the wire"
> in #4, it then uses a cached file handle which is indeed stale. Thus,
> open() eventually fails with ESTALE too (ESTALE comes from underlying
> nfs_request()).

The real problem here is that you know you did an operation
on the file which would break the name/nfsnode relationship,
but did not flush the cached name and nfsnode data.

A more correct solution would resync the nfsnode.

The main problem with your solution is that it doesn't work
in the case that you don't know the name of the remote file
(in which case, all you really have is a stale file handle,
with no way to unstale it).

I think this is a corner case that's probably not really
very interesting to solve. Now if you remembered the rename,
and applied your knowledge of the rename semantics to the
problem, you could replace the handle in the local nfsnode
for the file. This would not be as expensive as traversing
all of the nfsnodes, since you could use the same hash that's
used to translate a fh to a vp to get the vp.

This would fix a lot more cases than the single failure you
are fixing.

In general, though, you can't fix *any* of the cases without
introducing a vnode alias for an nfsnode that may have a local
alias already: there's no way to handle the hash collision in
that case, nor would you want to, since there's no way to deal
with the different vnodes that point to the different nfsnodes,
and have their own vmobject_t's: no matter how you look at it,
you can replace the vnode address in the open file(s) that
point to it, so you have to ESTALE.

-- Terry

Don Lewis

не прочитано,

20 июн. 2003 г., 02:18:4820.06.2003

– ui...@blackflag.ru, freebsd...@freebsd.org

I can't get very enthusiastic about changing the file system independent
code to fix a deficiency in the NFS implementation.

If the name of the file are you attempting to open is relative to your
current working directory, and your current working directory is nuked
on the server, vn_open will return ESTALE, and your patch above will
loop forever.

NFS really doesn't work very well if modifications are make by both a
client and the server, or by multiple clients. Solaris attempts to
compensate with a mount option:
noac Suppress data and attribute caching. The data
caching that is suppressed is the write-behind.
The local page cache is still maintained, but
data copied into it is immediately written to
the server.

If the rename on the server was done within the attribute validity time
on the client, vn_open() will succeed even without your patch, but you
may encounter the ESTALE error when you actually try to read or write
the file.

Unless you have some sort of locking protocol or other way of
synchronizing this sequence of operations on the client and server, the
server could do the rename while the client has the file open, after
which some I/O operation on the client will encounter ESTALE.

If the problem is that open() is failing a long time after the server
did the rename, then the best solution may be for the client to time out
file handles more aggressively. If the vnode on the client is closed,
the file handle could be timed out after acregmin/acregmax or
acdirmin/acdirmax, or a new handle timeout parameter. This may decrease
performance, but nothing is free ...

Andrey Alekseyev

не прочитано,

20 июн. 2003 г., 03:04:2220.06.2003

– Terry Lambert, freebsd...@freebsd.org

Terry,

Thanks much for you comments, but see below.

> The real problem here is that you know you did an operation
> on the file which would break the name/nfsnode relationship,
> but did not flush the cached name and nfsnode data.

nfs_request() actually calls cache_purge() on ESTALE, and vn_open()
frees vnode with vput() if a lookup was successful but there were
an error from the underlying filesystem (like ESTALE resulting from
nfs_request() which is eventually called from VOP_ACCESS or VOP_OPEN).

> A more correct solution would resync the nfsnode.

I think this is exactly what happens :) Actually, I believe, I'm just
getting another namecache entry with another vnode/nfsnode/file handle.

> The main problem with your solution is that it doesn't work
> in the case that you don't know the name of the remote file
> (in which case, all you really have is a stale file handle,
> with no way to unstale it).

I think, in this case (if the file was rm'd on the server), I'll just
get ENOENT from the second vn_open() attempt, which would be more
than appropriate. A real drawback is that for a stale "current"
directory it'll take another lookup to detect "true" ESTALE.

> This would fix a lot more cases than the single failure you
> are fixing.

Actually, as I said, I played with different parts of the code to solve
this (including, nfs_open(), nfs_access(), nfs_lookup() and vn_open())
only to find the previously mentioned solution to be the simpliest and
most suitable for all situations (for me!) :)

Andrey Alekseyev

не прочитано,

20 июн. 2003 г., 03:17:1020.06.2003

– Don Lewis, freebsd...@freebsd.org

Don,

> I can't get very enthusiastic about changing the file system independent
> code to fix a deficiency in the NFS implementation.

You're right. But it appears to be hard and inefficient to fix NFS for this
(I tried, though). It will certainly require to pass names below VFS.
On the other hand, there are NFS-related functions in the VFS already. See
vfs_syscalls.c:getfh(), fhopen() and similar functions. There are things
related to NFS server in the UFS/FFS code too. So, I finally decided
that my "fix" doesn't do much harm to the above mentioned concept :)

> current working directory, and your current working directory is nuked
> on the server, vn_open will return ESTALE, and your patch above will
> loop forever.

It won't loop forever :) The "stale" integer is in there exactly for that
purpose :) In case of a stale current directory, open() will still return
ESTALE. In case of a file that was rm'd from the server, I believe
it'll return something different.

> If the rename on the server was done within the attribute validity time
> on the client, vn_open() will succeed even without your patch, but you
> may encounter the ESTALE error when you actually try to read or write
> the file.

Sure! But open() will succeed and probably you'll even be lucky to get
file contents from the cache. But that's another story, related to
attributes tuning (I have another patch for that:) However, even with
the existing FreeBSD NFS attribute cache behaviour, it's ok for me.

> server could do the rename while the client has the file open, after
> which some I/O operation on the client will encounter ESTALE.

Sure. That's perfectly understood. I'm not trying to solve all the
NFS inefficiencies related to heavily shared files.

> acdirmin/acdirmax, or a new handle timeout parameter. This may decrease
> performance, but nothing is free ...

In the normal situation, namecache entry+vnode+nfsnode+file handle may
stay cached for a really long time (until re-used? deleted or renamed
on the *client*). Expiring file handles (a new mechanism?) means much the
same to me as simply obtaining a new name cache entry+other data
on ESTALE :) I may be wrong, though.

Anyway, thanks for the comments.

Don Lewis

не прочитано,

20 июн. 2003 г., 04:08:4720.06.2003

– ui...@blackflag.ru, freebsd...@freebsd.org

On 20 Jun, Andrey Alekseyev wrote:

> In the normal situation, namecache entry+vnode+nfsnode+file handle may
> stay cached for a really long time (until re-used? deleted or renamed
> on the *client*). Expiring file handles (a new mechanism?) means much the
> same to me as simply obtaining a new name cache entry+other data
> on ESTALE :) I may be wrong, though.

One case where there is a difference between timing out old file handles
and just invalidating them on ESTALE:

client% cmd1 > file1; cmd2 > file2
server% mv file1 tmpfile; mv file2 file1; mv tmpfile file1

wait an hour

client% cat /dev/null > file1

If file handles are cached indefinitely, and the client didn't recycle
the vnode for file1, which file on the server got truncated? Since
neither file was deleted on the server, you can't rely on ESTALE to
detect this situation.

Question: does the timeout of the directory attributes cause open() do
do an NFS lookup on the file, or does open() just find the vnode in the
cache and use its cached handle?

Terry Lambert

не прочитано,

20 июн. 2003 г., 05:08:0120.06.2003

– Andrey Alekseyev, freebsd...@freebsd.org

Andrey Alekseyev wrote:
> Terry,
>
> Thanks much for you comments, but see below.
>
> > The real problem here is that you know you did an operation
> > on the file which would break the name/nfsnode relationship,
> > but did not flush the cached name and nfsnode data.
>
> nfs_request() actually calls cache_purge() on ESTALE, and vn_open()
> frees vnode with vput() if a lookup was successful but there were
> an error from the underlying filesystem (like ESTALE resulting from
> nfs_request() which is eventually called from VOP_ACCESS or VOP_OPEN).

The place to correct this is probably the underlying FS. I'd
argue that getting ESTALE is a poke with a sharp stick that
makes this more likely to happen. ;^).

> > A more correct solution would resync the nfsnode.
>
> I think this is exactly what happens :) Actually, I believe, I'm just
> getting another namecache entry with another vnode/nfsnode/file handle.

You can't have this for other reasons; specifically, if you have
the file open at th time of the rename, and it becomes a ".#nfs..."
file (or whatever) on the server.

> > The main problem with your solution is that it doesn't work
> > in the case that you don't know the name of the remote file
> > (in which case, all you really have is a stale file handle,
> > with no way to unstale it).
>
> I think, in this case (if the file was rm'd on the server), I'll just
> get ENOENT from the second vn_open() attempt, which would be more
> than appropriate. A real drawback is that for a stale "current"
> directory it'll take another lookup to detect "true" ESTALE.

This is more a problem is the ESTALE handling. In the case where
you are doing a lookup, and get an ESTALE, it's probably correct
to translateit based on the semantics you are expecting in the
upper layer.

The problem here is that a given VOP can be called from multiple
system call implementations, and a given system call implementation
can call multiple VOPs to implement its functionality. This means
that you'd have to model the system call later state machine within
the filesystem itselt in order to return the "expected" error for
every possible case. This isn't a reasonable thing to expect.

> > This would fix a lot more cases than the single failure you
> > are fixing.
>
> Actually, as I said, I played with different parts of the code to solve
> this (including, nfs_open(), nfs_access(), nfs_lookup() and vn_open())
> only to find the previously mentioned solution to be the simpliest and
> most suitable for all situations (for me!) :)

Don Lewis has a good posting in response to you; you will likely
have read it before you read this response, so fee free to not
respond directly to this point.

Don points out that Solaris tries to fix this via the "noac" mount
option for client NFS.

What his quote:

noac Suppress data and attribute caching. The data
caching that is suppressed is the write-behind.
The local page cache is still maintained, but
data copied into it is immediately written to
the server.

hints at, but doesn't come right out and say, is that the cache
is flushed on write operations ("the data caching that is suppressed
is write-behind"). What this means practically, in terms of the
implementation of the NFS client code, is that everywhere there is
a client triggered change of state for metadata in the server that
could result in an ESTALE, the client cached information is flushed
out and has to be reacquired.

If this were happening in the NFS client today, then your rename
would not end up giving you an ESTALE, because the stale data would
have been discarded.

I'd also like to point out the following case:

{ A, B }
fd1 open on B
rename B -> C
rename A -> B

In this case, the FH in question would still work for B. What would
happen if it were:

{ A, B, C }
fd1 open on B
fd2 open on C
rename B -> C
rename A -> B

? With your patch, I think we would potentially convert fd2 to point
to B whien it really *should* be "ESTALE", which is wrong (think in
terms of 2 or more clients doing the operations).

-- Terry

Terry Lambert

не прочитано,

20 июн. 2003 г., 05:15:5920.06.2003

– Don Lewis, freebsd...@freebsd.org, ui...@blackflag.ru

Don Lewis wrote:
> > +again:
> > error = vn_open(&nd, flags, cmode);
> > if (error) {
> > + /*
> > + * if the underlying filesystem returns ESTALE
> > + * we must have used a cached file handle.
> > + */
> > + if (error == ESTALE && stale++ == 0)
> > + goto again;
> > /*
> > * release our own reference
> > */

[ ... ]

> If the name of the file are you attempting to open is relative to your
> current working directory, and your current working directory is nuked
> on the server, vn_open will return ESTALE, and your patch above will
> loop forever.

No, actually he thought of that (I read the code this way
the first time too, but was lucky enough to read it through
a couple of times while looking at the NFS sources in another
window, so I caught myself before sending anything).

Specifically, see the underline part of:

> > + if (error == ESTALE && stale++ == 0)

---------------

...he exits it after retrying it fails, and falls into the
standard ESTALE return case.

If this gets committed (which I think it shouldn't because I
can see a genuinely bad handle getting converted to a good one
in a couple of cases), that line should probably be rewritten
to be more obvious (e.g. move the "stale++" before the "if"
statment and adjust the compare to compensate for the difference
so no one else reads it the way we did).

Your other comments are good, though (see other post).

-- Terry

Andrey Alekseyev

не прочитано,

20 июн. 2003 г., 14:34:2920.06.2003

– Don Lewis, freebsd...@freebsd.org

Don,

> One case where there is a difference between timing out old file handles
> and just invalidating them on ESTALE:

Frankly, I just didn't find any mechanism in the STABLE kernel that
does "timing out" for file handles. Do you mean, it would be nice to have
it or are you trying to point it out to me? ;-P

> client% cmd1 > file1; cmd2 > file2
> server% mv file1 tmpfile; mv file2 file1; mv tmpfile file1
>
> wait an hour
>
> client% cat /dev/null > file1
>
> If file handles are cached indefinitely, and the client didn't recycle
> the vnode for file1, which file on the server got truncated? Since
> neither file was deleted on the server, you can't rely on ESTALE to
> detect this situation.

Eh, but the generation number for file1 should have been changed! This will
result in a definite ESTALE error for file1 from the server. That is, I
believe that if you attempt to open("file1", O_CREAT) after an hour, you'll
get ESTALE from the server (on which nfs_request() will invalidate "file1"
namecache entry and vnode+nfsnode+old-file-handle) and the second vn_open()
will re-lookup file1 and get a valid new file handle.

Actually, this is what indeed happens if the second open() comes from the
userland application :) I'm just trying to eliminate the need of modifying
a generic application.

For my example with moves, the next "cat" will always(!) succeed.

> Question: does the timeout of the directory attributes cause open() do
> do an NFS lookup on the file, or does open() just find the vnode in the
> cache and use its cached handle?

Well, for open() without O_CREAT the sequence is this:
open() -> vn_open() -> namei() -> lookup() -> VOP_LOOKUP() -> nfs_lookup()
|
VOP_ACCESS() -> nfs_access() [ -> nfs3_access_otw() <possibly>]
|
VOP_OPEN() -> nfs_open()

Lookup is always done first (obviously). It may return cached name which
contains a pointer to a cached vnode/nfsnode. Cached vnode/nfsnode is used
further in VOP_ACCESS() and VOP_OPEN(). Either function may or may not
update file attributes cached inside nfsnode. Neither VOP_ACCESS() or
VOP_OPEN() ever updates the *file handle*. File handle comes from
VOP_LOOKUP(). And VOP_LOOKUP() only places it there if vnode/nfsnode isn't
cached. Which I believe happens only if there is no cached filename in
the namecache. I really tried to do my best to describe everything in:
http://www.blackflag.ru/patches/nfs_attr.txt
Please take a look.

Whether ESTALE came from VOP_ACCESS() or VOP_OPEN() depends on several
factors. Namely, the value of nfsaccess_cache_timeout sysctl, acmin/acmax
and the age of the file in question.

Generally speaking, if nfsaccess_cache_timeout is less than acmin,
VOP_ACCESS() that comes right before VOP_OPEN() in vn_open() will try to do
an "access" RPC request and it'll fail if the file handle is stale. If
nfsaccess_cache_timeout is greater than acmin, than it's possible that
VOP_ACCESS() will answer "yes" basing on the cached attributes, but
VOP_GETATTR(), which is called from nfs_open() (which is VOP_OPEN() for
NFS) will in turn "go to the wire" and still nfs_request() will fail with
ESTALE.

Hope, I'm making it clear :)

Andrey Alekseyev

не прочитано,

20 июн. 2003 г., 14:56:1720.06.2003

– freebsd...@freebsd.org, truc...@freebsd.org

> Eh, but the generation number for file1 should have been changed! This will

I'm sorry, the generation number is not changed in your scenario. Thus,
I believe if the sequence of actions on the server is

mv file1 tmpfile
mv file2 file1
mv tmpfile file1

like you described, it's safe to continue to use a cached file handle
for file1 on the server since it still references the original file.
And file2 just disappears from the server.

Andrey Alekseyev

не прочитано,

20 июн. 2003 г., 15:20:5320.06.2003

– Terry Lambert, freebsd...@freebsd.org

Terry,

> The place to correct this is probably the underlying FS. I'd
> argue that getting ESTALE is a poke with a sharp stick that
> makes this more likely to happen. ;^).

Initially I was going to "fix" the underlying FS (that is, the NFS code).
But it's extremely hard to do "nice", because I need to re-lookup the name(!)
which is not referenced (easily? at all?) below VFS.

> > I think this is exactly what happens :) Actually, I believe, I'm just
> > getting another namecache entry with another vnode/nfsnode/file handle.
>
> You can't have this for other reasons; specifically, if you have
> the file open at th time of the rename, and it becomes a ".#nfs..."
> file (or whatever) on the server.

I didn't trace "sillyrename" scenario much. But I believe, nfs_sillyrename()
keeps it tight. At least, it uses nfs_lookitup() which may actually
*update* the file handle. And it plays with the name cache purging as well.
So I don't consider it as a real problem.

However, for open for reading/writing the scenario looks quite clear for me.
As I said in my previous message to Don, I'm just trying to eliminate
the need to modify otherwise generic application to cope with the necessity
of doing immediate open() if the first open failed with ESTALE. For a certain
more or less common situation :) And I know, the second open from the
userland application always works for the case I've described.

> Don points out that Solaris tries to fix this via the "noac" mount
> option for client NFS.

It does bad things to performance, though :) I'm not trying to uncache
everything. It's safe for me to use file pagecache if open() succeeds.
I'm not trying to reach an absolute shared file integrity with NFS, believe
me :)

> { A, B, C }
> fd1 open on B
> fd2 open on C
> rename B -> C
> rename A -> B
>
> ? With your patch, I think we would potentially convert fd2 to point
> to B whien it really *should* be "ESTALE", which is wrong (think in
> terms of 2 or more clients doing the operations).

You didn't specify client or server side, though. The result heavily
depends on the exact scenario.

With a single client, a new open() for "C" will result in fd2 if the
original "C" is still opened (because of sillyrename?).
Without fd2, any new open() for "C" will get a valid file handle for what
originally was "B". And that's a correct behaviour.

If the renames were on the server, then fd1 will be valid until the last
client's close. However, any reference to the original "C" will fail.
Re-opening "C" should result in a new file handle for what originally was "B".

Am I wrong?

Don Lewis

не прочитано,

20 июн. 2003 г., 16:39:5820.06.2003

– ui...@blackflag.ru, freebsd...@freebsd.org

On 20 Jun, Andrey Alekseyev wrote:

> Don,
>
>> One case where there is a difference between timing out old file handles
>> and just invalidating them on ESTALE:
>
> Frankly, I just didn't find any mechanism in the STABLE kernel that
> does "timing out" for file handles. Do you mean, it would be nice to have
> it or are you trying to point it out to me? ;-P

If there isn't such a mechanism, there should be.

>> client% cmd1 > file1; cmd2 > file2
>> server% mv file1 tmpfile; mv file2 file1; mv tmpfile file1
>>
>> wait an hour
>>
>> client% cat /dev/null > file1
>>
>> If file handles are cached indefinitely, and the client didn't recycle
>> the vnode for file1, which file on the server got truncated? Since
>> neither file was deleted on the server, you can't rely on ESTALE to
>> detect this situation.
>
> Eh, but the generation number for file1 should have been changed! This will
> result in a definite ESTALE error for file1 from the server. That is, I
> believe that if you attempt to open("file1", O_CREAT) after an hour, you'll
> get ESTALE from the server (on which nfs_request() will invalidate "file1"
> namecache entry and vnode+nfsnode+old-file-handle) and the second vn_open()
> will re-lookup file1 and get a valid new file handle.

If the client still has a cached copy of the file handle for file1,
won't it just use that and truncate file2 on the server? The handle
never doesn't stale because the file was never deleted on the server.

> Actually, this is what indeed happens if the second open() comes from the
> userland application :) I'm just trying to eliminate the need of modifying
> a generic application.
>
> For my example with moves, the next "cat" will always(!) succeed.
>
>> Question: does the timeout of the directory attributes cause open() do
>> do an NFS lookup on the file, or does open() just find the vnode in the
>> cache and use its cached handle?
>
> Well, for open() without O_CREAT the sequence is this:
> open() -> vn_open() -> namei() -> lookup() -> VOP_LOOKUP() -> nfs_lookup()
> |
> VOP_ACCESS() -> nfs_access() [ -> nfs3_access_otw() <possibly>]
> |
> VOP_OPEN() -> nfs_open()
>
> Lookup is always done first (obviously). It may return cached name which
> contains a pointer to a cached vnode/nfsnode. Cached vnode/nfsnode is used
> further in VOP_ACCESS() and VOP_OPEN(). Either function may or may not
> update file attributes cached inside nfsnode. Neither VOP_ACCESS() or
> VOP_OPEN() ever updates the *file handle*. File handle comes from
> VOP_LOOKUP(). And VOP_LOOKUP() only places it there if vnode/nfsnode isn't
> cached. Which I believe happens only if there is no cached filename in
> the namecache. I really tried to do my best to describe everything in:
> http://www.blackflag.ru/patches/nfs_attr.txt
> Please take a look.

If the client is mostly idle, then the cached filename is unlikely to be
flushed, so even after a long period of time, namei() will return the
old vnode and its associated file handle. If the file on the server was
renamed and not deleted, the server won't return ESTALE for the handle
and open() will return a descriptor for the original file on the server
that has since been renamed, not for the new file on the server that
lives at the path name passed to open() on the client.

Another example:

client% cmd1 > file1
client% cmd2 > file2
client% more file1
^Z
suspended

server% mv file1 tmpfile; mv file2 file1; mv tmpfile file2

wait 24 hours

client% cat /dev/null > file1
client% fg

The last cat comand should truncate file1 on the server, which is the
output of cmd2. When the more command resumes, it should still be able
to able to see the output of cmd1. The old file1 vnode and file handle
should remain valid, but the lookup to open file1 for the last cat
command needs to know that the cache entry has timed out and that the
handle associated with the cached vnode for file1 hasn't been validated
in a while. Lookup() needs to bypass the cache in the case and pass the
lookup request to the server. If the file handle returned is the same
as before, the cache entry should be freshened, if the file handle is
different then a new vnode needs to be allocated and associated with the
name cache entry and the new handle. The old vnode and its handle need
to be retained until either an rpc using this handle returns ESTALE, or
the the file is closed and the vnode is recycled.

> Whether ESTALE came from VOP_ACCESS() or VOP_OPEN() depends on several
> factors. Namely, the value of nfsaccess_cache_timeout sysctl, acmin/acmax
> and the age of the file in question.
>
> Generally speaking, if nfsaccess_cache_timeout is less than acmin,
> VOP_ACCESS() that comes right before VOP_OPEN() in vn_open() will try to do
> an "access" RPC request and it'll fail if the file handle is stale. If
> nfsaccess_cache_timeout is greater than acmin, than it's possible that
> VOP_ACCESS() will answer "yes" basing on the cached attributes, but
> VOP_GETATTR(), which is called from nfs_open() (which is VOP_OPEN() for
> NFS) will in turn "go to the wire" and still nfs_request() will fail with
> ESTALE.
>
> Hope, I'm making it clear :)

Yeah, but the solution that you propose doesn't fix the case where
ESTALE is not returned but namei() returns a cached vnode associated
with a file on the server that doesn't exist at the specified path name.

Also, fixing open() doesn't fix similar problems that can occur with
other syscalls that take path names, such as stat() and readlink().

If the lookup code is changed so that it more frequently revalidates the
name->vnode->handle entries, then the window where open() can fail due
to ESTALE would be greatly reduced.

Don Lewis

не прочитано,

20 июн. 2003 г., 16:41:1220.06.2003

– ui...@blackflag.ru, freebsd...@freebsd.org

On 20 Jun, Andrey Alekseyev wrote:

>> Eh, but the generation number for file1 should have been changed! This will
>
> I'm sorry, the generation number is not changed in your scenario. Thus,
> I believe if the sequence of actions on the server is
>
> mv file1 tmpfile
> mv file2 file1
> mv tmpfile file1
>
> like you described, it's safe to continue to use a cached file handle
> for file1 on the server since it still references the original file.
> And file2 just disappears from the server.

Well just its contents ... but this still violates POLA.

Andrey Alekseyev

не прочитано,

20 июн. 2003 г., 18:16:3420.06.2003

– Don Lewis, freebsd...@freebsd.org

Don,

> old vnode and its associated file handle. If the file on the server was
> renamed and not deleted, the server won't return ESTALE for the handle

I'm all confused and messed up :) Actually, a rename on the server is not
the same as sillyrename on the client. If you rename a file on the
server for which there is a cached file handle on the client, next time
the client will use its cached file handle, it'll get ESTALE from the server.
I don't know how this happens, though. Until I dig more around all the
rename paraphernalia, I won't know. If someone can clear this out, please
do. It'll be much appreciated. At this time I can't link this with the
inode generation number changes (as there is no new inode allocated when
the file is renamed).

I'm not strong in rename and sillyrename alchemy, just can deduce something
from the code, though not much. However, I've just tested my patch with
the rename-to-other-name-on-the-server scenario, and it seems to return
ENOENT to the application after the local file pagecache is invalidated and
the client tries to actually read the file from server using old name
and old file handle.

> Also, fixing open() doesn't fix similar problems that can occur with
> other syscalls that take path names, such as stat() and readlink().

That's a good point. However, if the patch for open() succeeds it can be
further extended to other syscalls as well.

> If the lookup code is changed so that it more frequently revalidates the
> name->vnode->handle entries, then the window where open() can fail due
> to ESTALE would be greatly reduced.

Sorry, I've got no time for that :) I'm generally not in this area of
activities. At least for the next few years I'm an extremely busy man :)

Again, thanks a lot for your comments.

John

не прочитано,

20 июн. 2003 г., 19:03:3420.06.2003

– Terry Lambert, freebsd...@freebsd.org, Don Lewis, ui...@blackflag.ru

----- Terry Lambert's Original Message -----

> Specifically, see the underline part of:
>
> > > + if (error == ESTALE && stale++ == 0)
> ---------------
>
> ...he exits it after retrying it fails, and falls into the
> standard ESTALE return case.
>
> If this gets committed (which I think it shouldn't because I
> can see a genuinely bad handle getting converted to a good one
> in a couple of cases), that line should probably be rewritten
> to be more obvious (e.g. move the "stale++" before the "if"

> statement and adjust the compare to compensate for the difference

> so no one else reads it the way we did).

hi folks,

After looking at his original patch, I suggested modifying
it for clarity to be of the form:

error = vn_open(&nd, flags, cmode);

if (error == ESTALE)
error = vn_open(&nd, flags, cmode); /* single retry */

While I understand a number of you have reservations against
this change, I think it worth serious consideration. Unless
someone is willing to go into each of the individual fs layers
and deal with ESTALE, this appears to be a relatively straight
forward and easy to understand approach.

Most of the main applications I run on clusters have all
had their open routines recoded similar to the following (this
from ftpd):

int try = 0;
while ((fin = fopen(name,"r")) == NULL && errno == ESTALE && try < 3 ) {
if (logging > 1)
syslog(LOG_INFO,"fopen(\"%s\"): %m: attempting retry",name);
}
if (fin == NULL && logging > 1)
syslog(LOG_INFO,"get fopen(\"%s\"): %m",name);

This is a real problem when using fbsd in high load / high
throughput situations where highly sequenced operations are
performed on a common set of data files from multiple machines. An
example of this environment can be seen here:

http://www.freebsd.org/~jwd/images/cluster.jpg

If no one has any patches which can provide a better solution
for handling ESTALE I would like to see Andreys' patch given
a chance.

Of course, if we don't want to do this, then I think it is
high time we documented that open(2) can return ESTALE and provide
a library routine that wraps open() with a retry :-)

-John

Don Lewis

не прочитано,

22 июн. 2003 г., 02:36:4622.06.2003

– ui...@blackflag.ru, freebsd...@freebsd.org

On 21 Jun, Andrey Alekseyev wrote:
> Don,
>
>> old vnode and its associated file handle. If the file on the server was
>> renamed and not deleted, the server won't return ESTALE for the handle
>
> I'm all confused and messed up :) Actually, a rename on the server is not
> the same as sillyrename on the client. If you rename a file on the
> server for which there is a cached file handle on the client, next time
> the client will use its cached file handle, it'll get ESTALE from the server.
> I don't know how this happens, though. Until I dig more around all the
> rename paraphernalia, I won't know. If someone can clear this out, please
> do. It'll be much appreciated. At this time I can't link this with the
> inode generation number changes (as there is no new inode allocated when
> the file is renamed).

When a file is renamed on the server, its file handle remains valid.

I had some time to write some scripts to exercise this stuff and
discovered some interesting things. The NFS server is a 4.8-stable box
named mousie, and the NFS client is running 5.1-current. The tests were
run in my NFS-mounted home directory.

Here's the first script:

#!/bin/sh -v
rm -f file1 file2
ssh -n mousie rm -f file1 file2
echo foo > file1
echo bar > file2
ssh -n mousie cat file1
ssh -n mousie cat file2
tail -f file1 &
sleep 1
cat file1
cat file2
ssh -n mousie 'mv file1 tmpfile; mv file2 file1; mv tmpfile file2'
cat file1
cat file2
echo baz >> file2
sleep 1
kill $!
ssh -n mousie cat file1
ssh -n mousie cat file2

Here's the output of the script:

#!/bin/sh -v
rm -f file1 file2
ssh -n mousie rm -f file1 file2
echo foo > file1
echo bar > file2
ssh -n mousie cat file1
foo
ssh -n mousie cat file2
bar
tail -f file1 &
sleep 1
foo
cat file1
foo
cat file2
bar
ssh -n mousie 'mv file1 tmpfile; mv file2 file1; mv tmpfile file2'
cat file1
bar
cat file2
foo
echo baz >> file2
sleep 1
baz
kill $!
Terminated
ssh -n mousie cat file1
bar
ssh -n mousie cat file2
foo
baz

Notice that immediately after the files are swapped on the server, the
cat commands on the client are able to immediately detect that the files
have been interchanged and they open the correct files. The tail
command shows that the original handle for file1 remains valid after the
rename operations and when more data is written to file2 after the
interchange, the data is appended to the file that was formerly file1.

My second script is an attempt to reproduce the open() -> ESTALE error.

#!/bin/sh -v
rm -f file1 file2
ssh -n mousie rm -f file1 file2
echo foo > file1
echo bar > file2
ssh -n mousie cat file1
ssh -n mousie cat file2
sleep 1
cat file1
cat file2
ssh -n mousie 'mv file1 file2'
cat file2
cat file1

And its output:

#!/bin/sh -v
rm -f file1 file2
ssh -n mousie rm -f file1 file2
echo foo > file1
echo bar > file2
ssh -n mousie cat file1
foo
ssh -n mousie cat file2
bar
sleep 1
cat file1
foo
cat file2
bar
ssh -n mousie 'mv file1 file2'
cat file2
foo
cat file1
cat: file1: No such file or directory

Even though file2 was unlinked and replaced by file1 on the server, the
client immediately notices the change and is able to open the proper
file.

Since my scripts weren't provoking the reported problem, I wondered if
this was a 4.x vs. 5.x problem, or if the problem didn't occur in the
current working directory, or if the problem only occurred if a
directory was specified in the file path. I modified my scripts to work
with a subdirectory and got rather different results:

#!/bin/sh -v
rm -f dir/file1 dir/file2
ssh -n mousie rm -f dir/file1 dir/file2
echo foo > dir/file1
echo bar > dir/file2
ssh -n mousie cat dir/file1
foo
ssh -n mousie cat dir/file2
bar
tail -f dir/file1 &
sleep 1
foo
cat dir/file1
foo
cat dir/file2
bar
ssh -n mousie 'mv dir/file1 dir/tmpfile; mv dir/file2 dir/file1; mv dir/tmpfile dir/file2'
sleep 120
cat dir/file1
bar
cat dir/file2
bar
echo baz >> dir/file2
sleep 1
kill $!
Terminated
ssh -n mousie cat dir/file1
bar
baz
ssh -n mousie cat dir/file2
foo

Even after waiting long enough for the cached attributes to time out,
the one of cat commands on the client opened the incorrect file and when
the shell executed the echo command to append to one of the files, the
wrong file was opened and appended to. Conclusion, the client is
confused and retrying open() on an ESTALE error is insufficient to fix
the problem.

By specifying a directory in the path, I'm was also able to reproduce
the ESTALE error one time, but now I always get:

#!/bin/sh -v
rm -f dir/file1 dir/file2
ssh -n mousie rm -f dir/file1 dir/file2
echo foo > dir/file1
echo bar > dir/file2
ssh -n mousie cat dir/file1
foo
ssh -n mousie cat dir/file2
bar
sleep 1
cat dir/file1
foo
cat dir/file2
bar
ssh -n mousie 'mv dir/file1 dir/file2'
sleep 120
cat dir/file2
foo
cat dir/file1
foo

unless I decrease the sleep time:

#!/bin/sh -v
rm -f dir/file1 dir/file2
ssh -n mousie rm -f dir/file1 dir/file2
echo foo > dir/file1
echo bar > dir/file2
ssh -n mousie cat dir/file1
foo
ssh -n mousie cat dir/file2
bar
sleep 1
cat dir/file1
foo
cat dir/file2
bar
ssh -n mousie 'mv dir/file1 dir/file2'
# sleep 120
sleep 1
cat dir/file2
cat: dir/file2: Stale NFS file handle
cat dir/file1
foo

In one of my tests, I got an xauth warning from ssh, which made me think
that maybe the manipulation of my .Xauthority file might affect the
results. When I reran the original tests without X11 forwarding, I got
results similar to those that I got when I specified a directory in the
path:

#!/bin/sh -v
rm -f file1 file2
ssh -x -n mousie rm -f file1 file2
echo foo > file1
echo bar > file2
ssh -x -n mousie cat file1
foo
ssh -x -n mousie cat file2
bar
sleep 1
cat file1
foo
cat file2
bar
ssh -x -n mousie 'mv file1 file2'
cat file2
cat: file2: Stale NFS file handle
cat file1
foo

Conclusion: relying on seeing an ESTALE error to retry is insufficient.
Depending on how files are manipulated, open() may successfully return a
descriptor for the wrong file and even enable the contents of that file
to be overwritten. The namei()/lookup() code is broken and that's what
needs to be fixed.

Andrey Alekseyev

не прочитано,

22 июн. 2003 г., 07:52:1522.06.2003

– Don Lewis, freebsd...@freebsd.org

Don,

> When a file is renamed on the server, its file handle remains valid.

Actually I was wrong in my assumption on how names are purged from the
namecache. And I didn't mean an operation with a file opened on the client.
And it actually happens that this sequence of commands will get you ENOENT
(not ESTALE) on the client because of a new lookup in #4:

1. server: echo "1111" > 1
2. client: cat 1
3. server: mv 1 2
4. client: cat 1 <--- ENOENT here

Name cache can be purged by nfs_lookup(), if the latter finds that the
capability numbers doesn't match. In this case, nfs_lookup() will send a
new "lookup" RPC request to the server. Name cache can also be purged from
getnewvnode() and vclean(). Which code does that for the above scenario
it's quite obscure to me. Yes, my knowledge is limited :)

By the way, what were the values of acregmin/acregmax/acdirmin/acdirmax and
also the value of vfs.nfs.access_cache_timeout in your tests?

I believe, the results of your test patterns heavily depend on the NFS
attributes cache tunables (which happen to affect all cycles of NFS
operation) and on the command execution timing as well. Moreover, I'm
suspect that all this is badly linked with the type and sequence of
operations on both the server and the client. Recall, I was about to fix
just *one* common scenario :)

With different values of acmin/acmax and access_cache_timeout, and manual
operations I was able to achieve the result you consider as "proper" above
and also, the "wrong" effect that you described below.

> And its output:
>
> #!/bin/sh -v
> rm -f file1 file2
> ssh -n mousie rm -f file1 file2
> echo foo > file1
> echo bar > file2
> ssh -n mousie cat file1
> foo
> ssh -n mousie cat file2
> bar
> sleep 1
> cat file1
> foo
> cat file2
> bar
> ssh -n mousie 'mv file1 file2'
> cat file2
> foo
> cat file1
> cat: file1: No such file or directory
>
> Even though file2 was unlinked and replaced by file1 on the server, the
> client immediately notices the change and is able to open the proper
> file.

My tests always eventually produce ESTALE for file2 here. However, I suspect
their must be configurations where I won't get ESTALE.

> Conclusion: relying on seeing an ESTALE error to retry is insufficient.
> Depending on how files are manipulated, open() may successfully return a
> descriptor for the wrong file and even enable the contents of that file
> to be overwritten. The namei()/lookup() code is broken and that's what
> needs to be fixed.

I don't think it's namei()/lookup() which is broken. I'm afraid, the name
and attribute caching logic is somewhat far from ideal. Namecache routines
seem to work fine, they just do actual parsing/lookup of a pathname. Other
functions manipulate with the cached names basing on their understanding
of the cache validity (both namecache and cached dir/file attributes).

I've also done a number of tcpdump's for different test patterns and I
believe, what happens with the cached vnode may depend on the results of
the "access" RPC request to the server.

As I said before, I was not going to fix all the NFS inefficiencies related
to heavily shared file environments. However, I still believe that
open-retry-on-ESTALE *may* help people to avoid certain erratic conditions.
At least, I think that having this functionality switchable with an
additional sysctl variable, *may* help lots of people in the black art of
tuning NFS <attribute> caching. As there are no exact descriptions on how
all of this behaves, people usually have to experiment with their own
certain environments.

Also, I agree it's not the "fix" of everything. And I didn't even mention
I want this to be integrated in the source :)

Actually, I know that it works for what I've been fixing locally and just
asked for technical comments about possible "vnode leakage" and nameidata
initialization which nobody provided yet ;-P

I appreciate *very much* all of the answers, though. Definitely, a food for
thought, but I'm a little bit tired of this issue already :)

Thanks again for your efforts.

Don Lewis

не прочитано,

29 июн. 2003 г., 21:24:1129.06.2003

– ui...@blackflag.ru, freebsd...@freebsd.org

On 22 Jun, Andrey Alekseyev wrote:
> Don,
>
>> When a file is renamed on the server, its file handle remains valid.
>
> Actually I was wrong in my assumption on how names are purged from the
> namecache. And I didn't mean an operation with a file opened on the client.
> And it actually happens that this sequence of commands will get you ENOENT
> (not ESTALE) on the client because of a new lookup in #4:
>
> 1. server: echo "1111" > 1
> 2. client: cat 1
> 3. server: mv 1 2
> 4. client: cat 1 <--- ENOENT here

That's what it is supposed to do, but my testing would seem to indicate
that step 4 could return the file contents for an extended period of
time after the file was renamed on the server.

> Name cache can be purged by nfs_lookup(), if the latter finds that the
> capability numbers doesn't match. In this case, nfs_lookup() will send a
> new "lookup" RPC request to the server. Name cache can also be purged from
> getnewvnode() and vclean(). Which code does that for the above scenario
> it's quite obscure to me. Yes, my knowledge is limited :)

The vpid == newvp->v_id test in nfs_lookup() just detects if the vnode
that the cache entry pointed to was recycled for another use while it
was on the free list. It doesn't detect whether the inode on the server
was recycled.

When I was thinking about this problem, the solution I came up with was
a lot like the
if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td)
&& vattr.va_ctime.tv_sec == VTONFS(newvp)->n_ctime)
code fragment, but I would have done the ctime check on both the target
and the parent directory and only ignored the cache entry if both ctimes
had been updated. Checking only the target should be more conservative,
though it would be slower because there would be more cases where the
client would have to do the RPC call.

If the file on the server associated with the cached entry on the client
is renamed on the server, its file handle will remain valid, but its
ctime will be updated, so VOP_GETATTR() will succeed, but the ctime
check should be activated and the cache entry purged.

If the file on the server is unlinked or another file mv'ed on top of
it, its file handle should no longer be valid, so the VOP_GETATTR() call
should fail, which should cause the cache entry to be purged and a new
lookup RPC should be done.

What I find interesting is that in order for for open() to fail with the
ESTALE error, the cache entry must be used, which means that this
VOP_GETATTR() call must be succeeding, but for some reason another VOP
call after namei() returns is failing with ESTALE.

I'm using the the default values for
acregmin/acregmax/acdirmin/acdirmax.

% sysctl vfs.nfs.access_cache_timeout
vfs.nfs.access_cache_timeout: 2

> I believe, the results of your test patterns heavily depend on the NFS
> attributes cache tunables (which happen to affect all cycles of NFS
> operation) and on the command execution timing as well. Moreover, I'm
> suspect that all this is badly linked with the type and sequence of
> operations on both the server and the client. Recall, I was about to fix
> just *one* common scenario :)

Some of my test cases waited for 120 seconds after the rename on the
server before attempting access from the client, which should be enough
time for the attribute cache to time out.

I think the main problem is namei()/lookup(). They shouldn't be
returning a vnode that is associated with a file handle that points to a
different or non-existent file on the server if the name to handle
association has been invalid for a long period of time. While it's not
possible to totally enforce coherency, we should be able to do a lot
better.

> I've also done a number of tcpdump's for different test patterns and I
> believe, what happens with the cached vnode may depend on the results of
> the "access" RPC request to the server.

That may be an important clue. The access cache may be properly
working, but the attribute cache timeout may be broken.

> As I said before, I was not going to fix all the NFS inefficiencies related
> to heavily shared file environments. However, I still believe that
> open-retry-on-ESTALE *may* help people to avoid certain erratic conditions.
> At least, I think that having this functionality switchable with an
> additional sysctl variable, *may* help lots of people in the black art of
> tuning NFS <attribute> caching. As there are no exact descriptions on how
> all of this behaves, people usually have to experiment with their own
> certain environments.
>
> Also, I agree it's not the "fix" of everything. And I didn't even mention
> I want this to be integrated in the source :)

I don't think we can totally fix the problem, but I would like to see
the source fixed so that people don't have to patch their applications
or their kernel for common usage patterns.

> Actually, I know that it works for what I've been fixing locally and just
> asked for technical comments about possible "vnode leakage" and nameidata
> initialization which nobody provided yet ;-P

I think you're probably ok on the vnode side, but one problem might be
the flags in the struct nameidata. The lookup code tends to fiddle with
them. I was also concerned about leaking the cn_pnbuf buffer, but it
looks like it may not get allocated or may get freed in the error case,
since kern_open() don't call NDFREE(&nd, NDF_ONLY_PNBUF) if namei()
fails.

Don Lewis

не прочитано,

30 июн. 2003 г., 16:55:2330.06.2003

– ui...@blackflag.ru, freebsd...@freebsd.org

On 29 Jun, I wrote:
> On 22 Jun, Andrey Alekseyev wrote:

>> Name cache can be purged by nfs_lookup(), if the latter finds that the
>> capability numbers doesn't match. In this case, nfs_lookup() will send a
>> new "lookup" RPC request to the server. Name cache can also be purged from
>> getnewvnode() and vclean(). Which code does that for the above scenario
>> it's quite obscure to me. Yes, my knowledge is limited :)
>
> The vpid == newvp->v_id test in nfs_lookup() just detects if the vnode
> that the cache entry pointed to was recycled for another use while it
> was on the free list. It doesn't detect whether the inode on the server
> was recycled.
>
> When I was thinking about this problem, the solution I came up with was
> a lot like the
> if (!VOP_GETATTR(newvp, &vattr, cnp->cn_cred, td)
> && vattr.va_ctime.tv_sec == VTONFS(newvp)->n_ctime)
> code fragment, but I would have done the ctime check on both the target
> and the parent directory and only ignored the cache entry if both ctimes
> had been updated. Checking only the target should be more conservative,
> though it would be slower because there would be more cases where the
> client would have to do the RPC call.

I actually meant to say the mtime of the parent directory.

After doing some more testing, I believe the problem I'm seeing is
caused by the rename on the server not updating the seconds field of the
file ctime. If the file was last changed at time N, if the client does
a lookup on the file and sees this ctime value, and the server renames
the file before the time on the server increments to the next second,
the ctime check nfs_lookup() won't detect that the cached lookup
information might be invalid.

The best way I could think of to fix this problem is to ignore the cache
entry and do the lookup RPC until we detect that the time on the server
has incremented to the next second, so that we know that the cached
lookup must be valid. The problem is that I don't know how to get a
timestamp from the server.

>> I've also done a number of tcpdump's for different test patterns and I
>> believe, what happens with the cached vnode may depend on the results of
>> the "access" RPC request to the server.
>
> That may be an important clue. The access cache may be properly
> working, but the attribute cache timeout may be broken.

I'm pretty sure that the problem that you are having with open()
returning ESTALE is caused by the difference between the access cache
timeout and the attribute cache timeout. It looks like your workaround
of retrying the open only works with NFSv3 because NFSv2() relies on
VOP_GETATTR(), and if the attribute cache timeout is too long the open()
will succeed and you'll only detect the failure when you actually do the
I/O.