Re: Linux NFSv4 clients are getting (bad sequence-id error!)

Graham Allan

unread,

Jul 16, 2015, 8:16:12 PM7/16/15

to

I'm curious how things are going for you with this?

Reading your thread did pique my interest since we have a lot of
Scientific Linux (RHEL clone) boxes with FreeBSD NFSv4 servers. I meant
to glance through our logs for signs of the same issue, but today I
started investigating a machine which appeared to have hung processes,
high rpciod load, and high traffic to the NFS server. Of course it is
exactly this issue.

The affected machine is running SL5 though most of our server nodes are
now SL6. I can see errors from most of them but the SL6 systems appear
less affected - I see a stream of the sequence-id errors in their logs but
things in general keep working. The one SL5 machine I'm looking at
has a single sequence-id error in today's logs, but then goes into a
stream of "state recovery failed" then "Lock reclaim failed". It's
probably partly related to the particular workload on this machine.

I would try switching our SL6 machines to NFS 4.1 to see if the
behaviour changes, but 4.1 isn't supported by our 9.3 servers (is it in
10.1?).

At the NFS servers, most of the sysctl settings are already tuned
from defaults. eg tcp.highwater=100000, vfs.nfsd.tcpcachetimeo=300,
128-256 nfs kernel threads.

Graham

On Fri, Jul 03, 2015 at 01:21:00AM +0200, Ahmed Kamal via freebsd-fs wrote:
> PS: Today (after adjusting tcp.highwater) I didn't get any screaming
> reports from users about hung vnc sessions. So maybe just maybe, linux
> clients are able to somehow recover from this bad sequence messages. I
> could still see the bad sequence error message in logs though
>
> Why isn't the highwater tunable set to something better by default ? I mean
> this server is certainly not under a high or unusual load (it's only 40 PCs
> mounting from it)
>
> On Fri, Jul 3, 2015 at 1:15 AM, Ahmed Kamal <email.ah...@googlemail.com
> > wrote:
>
> > Thanks all .. I understand now we're doing the "right thing" .. Although
> > if mounting keeps wedging, I will have to solve it somehow! Either using
> > Xin's patch .. or Upgrading RHEL to 6.x and using NFS4.1.
> >
> > Regarding Xin's patch, is it possible to build the patched nfsd code, as a
> > kernel module ? I'm looking to minimize my delta to upstream.
> >
> > Also would adopting Xin's patch and hiding it behind a
> > kern.nfs.allow_linux_broken_client be an option (I'm probably not the last
> > person on earth to hit this) ?
> >
> > Thanks a lot for all the help!
> >
> > On Thu, Jul 2, 2015 at 11:53 PM, Rick Macklem <rmac...@uoguelph.ca>
> > wrote:
> >
> >> Ahmed Kamal wrote:
> >> > Appreciating the fruitful discussion! Can someone please explain to me,
> >> > what would happen in the current situation (linux client doing this
> >> > skip-by-1 thing, and freebsd not doing it) ? What is the effect of that?
> >> Well, as you've seen, the Linux client doesn't function correctly against
> >> the FreeBSD server (and probably others that don't support this
> >> "skip-by-1"
> >> case).
> >>
> >> > What do users see? Any chances of data loss?
> >> Hmm. Mostly it will cause Opens to fail, but I can't guess what the Linux
> >> client behaviour is after receiving NFS4ERR_BAD_SEQID. You're the guy
> >> observing
> >> it.
> >>
> >> >
> >> > Also, I find it strange that netapp have acknowledged this is a bug on
> >> > their side, which has been fixed since then!
> >> Yea, I think Netapp screwed up. For some reason their server allowed this,
> >> then was fixed to not allow it and then someone decided that was broken
> >> and
> >> reversed it.
> >>
> >> > I also find it strange that I'm the first to hit this :) Is no one
> >> running
> >> > nfs4 yet!
> >> >
> >> Well, it seems to be slowly catching on. I suspect that the Linux client
> >> mounting a Netapp is the most common use of it. Since it appears that they
> >> flip flopped w.r.t. who's bug this is, it has probably persisted.
> >>
> >> It may turn out that the Linux client has been fixed or it may turn out
> >> that most servers allowed this "skip-by-1" even though David Noveck (one
> >> of the main authors of the protocol) seems to agree with me that it should
> >> not be allowed.
> >>
> >> It is possible that others have bumped into this, but it wasn't isolated
> >> (I wouldn't have guessed it, so it was good you pointed to the RedHat
> >> discussion)
> >> and they worked around it by reverting to NFSv3 or similar.
> >> The protocol is rather complex in this area and changed completely for
> >> NFSv4.1,
> >> so many have also probably moved onto NFSv4.1 where this won't be an
> >> issue.
> >> (NFSv4.1 uses sessions to provide exactly once RPC semantics and doesn't
> >> use
> >> these seqid fields.)
> >>
> >> This is all just mho, rick
> >>
> >> > On Thu, Jul 2, 2015 at 1:59 PM, Rick Macklem <rmac...@uoguelph.ca>
> >> wrote:
> >> >
> >> > > Julian Elischer wrote:
> >> > > > On 7/2/15 9:09 AM, Rick Macklem wrote:
> >> > > > > I am going to post to nf...@ietf.org to see what they say. Please
> >> > > > > let me know if Xin Li's patch resolves your problem, even though I
> >> > > > > don't believe it is correct except for the UINT32_MAX case. Good
> >> > > > > luck with it, rick
> >> > > > and please keep us all in the loop as to what they say!
> >> > > >
> >> > > > the general N+2 bit sounds like bullshit to me.. its always N+1 in a
> >> > > > number field that has a
> >> > > > bit of slack at wrap time (probably due to some ambiguity in the
> >> > > > original spec).
> >> > > >
> >> > > Actually, since N is the lock op already done, N + 1 is the next lock
> >> > > operation in order. Since lock ops need to be strictly ordered,
> >> allowing
> >> > > N + 2 (which means N + 2 would be done before N + 1) makes no sense.
> >> > >
> >> > > I think the author of the RFC meant that N + 2 or greater fails, but
> >> it
> >> > > was poorly worded.
> >> > >
> >> > > I will pass along whatever I get from nf...@ietf.org. (There is an
> >> archive
> >> > > of it somewhere, but I can't remember where.;-)
> >> > >
> >> > > rick
> >> > > _______________________________________________
> >> > > freeb...@freebsd.org mailing list
> >> > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >> > > To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"
> >> > >
> >> >
> >>
> >
> >
> _______________________________________________
> freeb...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

--
-------------------------------------------------------------------------
Graham Allan - al...@physics.umn.edu - g...@umn.edu - (612) 624-5040
School of Physics and Astronomy - University of Minnesota
-------------------------------------------------------------------------
_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Ahmed Kamal via freebsd-fs

unread,

Jul 17, 2015, 7:22:34 AM7/17/15

to

Hi Graham,

So my RHEL5 boxes certainly have trouble with nfs4 .. I'm running about 20
boxes and almost all of them develop a choking process every day or two.
I'm now in the process of upgrading our RHEL boxes to v6.x .. This is
motivated to migrate to NFS4.1, although now that you say NFS4 is more
tolerant on EL6, I might just remain on that. So far I did one week of
basic testing of a VM on el6 with nfs4.1 vs my FreeBSD 10.1, so far I
didn't hit problems (although the testing was light). Next week, I'll
probably upgrade one of our production machines to el6 and see how it fares.

PS: I had upgraded our el5 box with elrepo kernel (v3.2) .. which I thought
would be way much newer (even newer than el6) .. But I still had trouble
with it .. so I reverted to stock el5 kernel! Not sure if this means Linux
is not the only component at fault ?!

Graham Allan

unread,

Jul 17, 2015, 12:06:05 PM7/17/15

to

I have maintenance scheduled this weekend so maybe I will try to add Xin
Li's patch on one of our 9.3 servers and can see if the sequence-id
messages diminish (even though it didn't help for you - possibly SL6
will behave differently).

As for SL6/NFS4 being more tolerant, I suspect the problem is dependent
on the specific job. This is the first time I have seen it at all (that
is, with the stuck processes and high rpciod load), and I only see one
person running this code. Although looking back ~60 days in logs I can
see the sequence-id messages occurring all over the place from other
machines, apparently without incident.

For the more intense users who are running on 200 servers at once, I
wonder if they are not hitting the NFS server in the same way - possibly
they are mostly writing somewhere else like hadoop and only reading from
NFS. However our compute farm conversions to SL6 and NFSv4 are fairly
recent, so something may yet show up.

I wonder if we have any avenue to file a bug with Redhat. I have a very
basic subscription which only lets me look at their KB, but I could
upgrade it - but then, as I'm running a clone product I probably don't
have a viable report.

Graham

On 7/17/2015 6:21 AM, Ahmed Kamal wrote:
> Hi Graham,
>
> So my RHEL5 boxes certainly have trouble with nfs4 .. I'm running about
> 20 boxes and almost all of them develop a choking process every day or
> two. I'm now in the process of upgrading our RHEL boxes to v6.x .. This
> is motivated to migrate to NFS4.1, although now that you say NFS4 is
> more tolerant on EL6, I might just remain on that. So far I did one week
> of basic testing of a VM on el6 with nfs4.1 vs my FreeBSD 10.1, so far I
> didn't hit problems (although the testing was light). Next week, I'll
> probably upgrade one of our production machines to el6 and see how it fares.
>
> PS: I had upgraded our el5 box with elrepo kernel (v3.2) .. which I
> thought would be way much newer (even newer than el6) .. But I still had
> trouble with it .. so I reverted to stock el5 kernel! Not sure if this
> means Linux is not the only component at fault ?!
>

--
-------------------------------------------------------------------------
Graham Allan - g...@umn.edu - al...@physics.umn.edu

Rick Macklem

unread,

Jul 17, 2015, 3:31:11 PM7/17/15

to

Graham Allan wrote:
> I'm curious how things are going for you with this?
>
> Reading your thread did pique my interest since we have a lot of
> Scientific Linux (RHEL clone) boxes with FreeBSD NFSv4 servers. I meant
> to glance through our logs for signs of the same issue, but today I
> started investigating a machine which appeared to have hung processes,
> high rpciod load, and high traffic to the NFS server. Of course it is
> exactly this issue.
>
> The affected machine is running SL5 though most of our server nodes are
> now SL6. I can see errors from most of them but the SL6 systems appear
> less affected - I see a stream of the sequence-id errors in their logs but
> things in general keep working. The one SL5 machine I'm looking at
> has a single sequence-id error in today's logs, but then goes into a
> stream of "state recovery failed" then "Lock reclaim failed". It's
> probably partly related to the particular workload on this machine.
>

Normally, reclaims for Opens and byte range Locks only work after a server
reboot (during the grace time period just after reboot). However, Opens can
be re-acquired via a normal (non-reclaim) open.

I don't know what the Linux client does in this case, but I suspect that might
be doing a normal Open when the reclaim one fails.
--> As such, things would start working normally again, but byte range Locks would
get lost.
- I suspect the high load on rpciod occurs when a read-ahead or write-behind keeps
retrying with the "stale" stateid. (Maybe that behaviour was changed for SL6?)

> I would try switching our SL6 machines to NFS 4.1 to see if the
> behaviour changes, but 4.1 isn't supported by our 9.3 servers (is it in
> 10.1?).
>

Yes, 10.1 has an NFSv4.1 server in it. (Since NFSv4.1 doesn't use the seqid#s, it
shouldn't have the "bad seqid" errors.)

> > >> > > _______________________________________________
> > >> > > freeb...@freebsd.org mailing list
> > >> > > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > >> > > To unsubscribe, send any mail to
> > >> > > "freebsd-fs-...@freebsd.org"
> > >> > >
> > >> >
> > >>
> > >
> > >
> > _______________________________________________
> > freeb...@freebsd.org mailing list
> > http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"
>

> --
> -------------------------------------------------------------------------

> Graham Allan - al...@physics.umn.edu - g...@umn.edu - (612) 624-5040

Rick Macklem

unread,

Jul 17, 2015, 3:32:25 PM7/17/15

to

Graham Allan wrote:
> I'm curious how things are going for you with this?
>
> Reading your thread did pique my interest since we have a lot of
> Scientific Linux (RHEL clone) boxes with FreeBSD NFSv4 servers. I meant
> to glance through our logs for signs of the same issue, but today I
> started investigating a machine which appeared to have hung processes,
> high rpciod load, and high traffic to the NFS server. Of course it is
> exactly this issue.
>
> The affected machine is running SL5 though most of our server nodes are
> now SL6. I can see errors from most of them but the SL6 systems appear
> less affected - I see a stream of the sequence-id errors in their logs but
> things in general keep working. The one SL5 machine I'm looking at
> has a single sequence-id error in today's logs, but then goes into a
> stream of "state recovery failed" then "Lock reclaim failed". It's
> probably partly related to the particular workload on this machine.
>

> I would try switching our SL6 machines to NFS 4.1 to see if the
> behaviour changes, but 4.1 isn't supported by our 9.3 servers (is it in
> 10.1?).
>

Btw, I've done some testing against a fairly recent Fedora and haven't seen
the problem. If either of you guys could load a recent Fedora on a test client
box, it would be interesting to see if it suffers from this. (My experience is
that the Fedora distros have more up to date Linux NFS clients.)

Ahmed Kamal via freebsd-fs

unread,

Jul 20, 2015, 11:27:24 PM7/20/15

to

Hi folks,

I've upgraded a test client to rhel6 today, and I'll keep an eye on it to
see what happens.

During the process, I made the (I guess mistake) of zfs send | recv to a
locally attached usb disk for backup purposes .. long story short, sharenfs
property on the received filesystem was causing some nfs/mountd errors in
logs .. I wasn't too happy with what I got .. I destroyed the backup
datasets and the whole pool eventually .. and then rebooted the whole nas
box .. After reboot my logs are still flooded with

Jul 21 05:12:36 nas kernel: nfsrv_cache_session: no session
Jul 21 05:13:07 nas last message repeated 7536 times
Jul 21 05:15:08 nas last message repeated 29664 times

Not sure what that means .. or how it can be stopped .. Anyway, will keep
you posted on progress.

Ahmed Kamal via freebsd-fs

unread,

Jul 20, 2015, 11:52:59 PM7/20/15

to

More info .. Just noticed nfsd is spinning the cpu at 500% :( I just did
the dtrace with:

dtrace -n profile-1001 { @[stack()] = count(); }
The result is at http://paste2.org/vb8ZdvF2 (scroll to bottom)

Since rebooting the nfs server didn't fix it .. I imagine I'd have to
reboot all NFS clients .. This would be really sad .. Any advice is most
appreciated .. Thanks

On Tue, Jul 21, 2015 at 5:26 AM, Ahmed Kamal <

Ahmed Kamal via freebsd-fs

unread,

Jul 21, 2015, 12:51:43 AM7/21/15

to

rhel6 servers logs were flooded with errors like: http://paste2.org/EwLGcGF6
The Freebsd box was being pounded with 40Mbps of nfs traffic .. probably
Linux was retrying too hard ?! I had to reboot all PCs and after the last
one, nfsd CPU usage dropped immediately to zero

On Tue, Jul 21, 2015 at 5:52 AM, Ahmed Kamal <

Rick Macklem

unread,

Jul 21, 2015, 8:25:48 PM7/21/15

to

Ahmed Kamal wrote:
> rhel6 servers logs were flooded with errors like: http://paste2.org/EwLGcGF6
> The Freebsd box was being pounded with 40Mbps of nfs traffic .. probably
> Linux was retrying too hard ?! I had to reboot all PCs and after the last
> one, nfsd CPU usage dropped immediately to zero
>

The error 10052 is NFS4ERR_BAD_SESSION. For the Destroy_Session operation it
could be generated for several reasons, but the most likely is that the Destroy_session
operation isn't the last one in the compound (as required by RFC-5661).
Snippet of RFC-5661:
If the COMPOUND request starts with SEQUENCE, and if the sessionids
specified in SEQUENCE and DESTROY_SESSION are the same, then

o DESTROY_SESSION MUST be the final operation in the COMPOUND
request.

If it happens again, capture packets for a short period of time (just need
one of the RPC requests with Destroy_session in it). When you look at the
RPC request in wireshark, if it isn't the last operation in the compound,
then that's why this is happening (and broken w.r.t. the RFC, I suspect).

rick
ps: If you email me a small raw packet capture with this RPC in it, I can
take a look at it.

Rick Macklem

unread,

Jul 23, 2015, 6:41:29 AM7/23/15

to

Ahmed Kamal wrote:
> rhel6 servers logs were flooded with errors like: http://paste2.org/EwLGcGF6
> The Freebsd box was being pounded with 40Mbps of nfs traffic .. probably
> Linux was retrying too hard ?! I had to reboot all PCs and after the last
> one, nfsd CPU usage dropped immediately to zero
>

Btw, it would be interesting to know what triggers these things (overload of
the nfs server resulting in very slow response or ???). Basically Destroy_session
isn't an operation that a client would normally do. I have no idea why the Linux
client would do it. (A session is what achieves the "exactly once" semantics for
the RPCs. It should really be in the RPC layer, but the NFSv4 working group put
it in NFSv4.1 because they didn't want to replace Sun RPC. I can't think of a reason
to destroy a session except on dismount. Maybe if the client thinks the session is
broken for some reason??)

Maybe something like "vmstat -m", "vmstat -z" and "nfsstat -s -e" running repeatedly
(once/sec with timestamps via "date" or similar) so that you can see what was happening just
before the meltdowns.

A raw packet trace of just when the meltdown starts would be useful, but I can't think
of how you'd get one of reasonable size. Maybe having "tcpdump -s 0 -w <file>.pcap <client-host>"
run for 1sec and then kill/restart it repeatedly with different file names, so you might get
a useful 1sec capture at the critical time?

Anyhow, good luck with it, rick

Ahmed Kamal via freebsd-fs

unread,

Jul 23, 2015, 11:27:33 AM7/23/15

to

Well .. The problem is now gone, so I guess I can't collect more data till
it happens (or hopefully doesn't :) happen again .. So as I described, I
had to restart the FreeBSD NFS server box first .. maybe this caused linux
clients to give up after 5 mins, and attempt to destroy the session ? When
the NFS server was back up .. It was being bombarded (50Mbps traffic) with
rpc traffic, probably saying this "destroy session" message.

What I don't understand however is, why doesn't this end. What does FreeBSD
reply with? Shouldn't it say, Okay, I don't know anything about this
session, so consider it destroyed .. suit yourself linux .. or does it
refuse to destroy, causing Linux to keep on retrying like crazy ?

Mehmet Erol Sanliturk

unread,

Jul 23, 2015, 1:55:42 PM7/23/15

to

On Thu, Jul 23, 2015 at 8:26 AM, Ahmed Kamal via freebsd-fs <
freeb...@freebsd.org> wrote:

> Well .. The problem is now gone, so I guess I can't collect more data till
> it happens (or hopefully doesn't :) happen again .. So as I described, I
> had to restart the FreeBSD NFS server box first .. maybe this caused linux
> clients to give up after 5 mins, and attempt to destroy the session ? When
> the NFS server was back up .. It was being bombarded (50Mbps traffic) with
> rpc traffic, probably saying this "destroy session" message.
>
> What I don't understand however is, why doesn't this end. What does FreeBSD
> reply with? Shouldn't it say, Okay, I don't know anything about this
> session, so consider it destroyed .. suit yourself linux .. or does it
> refuse to destroy, causing Linux to keep on retrying like crazy ?
>
>

My opinion is that in the latest Linux NFS client there is a problem : It
is consuming too much time to communicate with the Linux server . For that
reason , I have switched backed to Fedora 18 as a client because of this
"fighting" with the server visible from the switch lights and getting
response after a long activity which is meaningless to make so much
activity get a response .

Server is Fedora 19 .

Mehmet Erol Sanliturk

Graham Allan

unread,

Jul 23, 2015, 2:13:36 PM7/23/15

to

For our part, the user whose code triggered the pathological behaviour
on SL5 reran it on SL6 without incident - I still see lots of
sequence-id errors in the logs, but nothing bad happened.

I'd still like to ask them to rerun again on SL5 to see if the "accept
skipped seqid" patch had any effect, though I think we expect not. Maybe
it would be nice if I could get set up to capture rolling tcpdumps of
the nfs traffic before they run that though...

Graham

On 7/20/2015 10:26 PM, Ahmed Kamal wrote:
> Hi folks,
>
> I've upgraded a test client to rhel6 today, and I'll keep an eye on it
> to see what happens.
>
> During the process, I made the (I guess mistake) of zfs send | recv to a
> locally attached usb disk for backup purposes .. long story short,
> sharenfs property on the received filesystem was causing some nfs/mountd
> errors in logs .. I wasn't too happy with what I got .. I destroyed the
> backup datasets and the whole pool eventually .. and then rebooted the
> whole nas box .. After reboot my logs are still flooded with
>
> Jul 21 05:12:36 nas kernel: nfsrv_cache_session: no session
> Jul 21 05:13:07 nas last message repeated 7536 times
> Jul 21 05:15:08 nas last message repeated 29664 times
>
> Not sure what that means .. or how it can be stopped .. Anyway, will
> keep you posted on progress.

--
-------------------------------------------------------------------------
Graham Allan - g...@umn.edu - al...@physics.umn.edu

Rick Macklem

unread,

Jul 23, 2015, 5:53:22 PM7/23/15

to

Graham Allan wrote:
> For our part, the user whose code triggered the pathological behaviour
> on SL5 reran it on SL6 without incident - I still see lots of
> sequence-id errors in the logs, but nothing bad happened.
>
> I'd still like to ask them to rerun again on SL5 to see if the "accept
> skipped seqid" patch had any effect, though I think we expect not. Maybe
> it would be nice if I could get set up to capture rolling tcpdumps of
> the nfs traffic before they run that though...
>
> Graham
>
> On 7/20/2015 10:26 PM, Ahmed Kamal wrote:
> > Hi folks,
> >
> > I've upgraded a test client to rhel6 today, and I'll keep an eye on it
> > to see what happens.
> >
> > During the process, I made the (I guess mistake) of zfs send | recv to a
> > locally attached usb disk for backup purposes .. long story short,
> > sharenfs property on the received filesystem was causing some nfs/mountd
> > errors in logs .. I wasn't too happy with what I got .. I destroyed the
> > backup datasets and the whole pool eventually .. and then rebooted the
> > whole nas box .. After reboot my logs are still flooded with
> >
> > Jul 21 05:12:36 nas kernel: nfsrv_cache_session: no session
> > Jul 21 05:13:07 nas last message repeated 7536 times
> > Jul 21 05:15:08 nas last message repeated 29664 times
> >
> > Not sure what that means .. or how it can be stopped .. Anyway, will
> > keep you posted on progress.
>

Oh, I didn't see the part about "reboot" before. Unfortunately, it sounds like the
client isn't recovering after the session is lost. When the server reboots, the
client(s) will get NFS4ERR_BAD_SESSION errors back because the server reboot has
deleted all sessions. The NFS4ERR_BAD_SESSION should trigger state recovery on the client.
(It doesn't sound like the clients went into recovery, starting with a Create_session
operation, but without a packet trace, I can't be sure?)

rick

Ahmed Kamal via freebsd-fs

unread,

Jul 23, 2015, 5:55:39 PM7/23/15

to

Can you please let me know the ultimate packet trace command I'd need to
run in case of any nfs4 troubles .. I guess this should be comprehensive
even at the expense of a larger output size (which we can trim later)..
Thanks a lot for the help!

Rick Macklem

unread,

Jul 23, 2015, 5:59:34 PM7/23/15

to

Ahmed Kamal wrote:
> Can you please let me know the ultimate packet trace command I'd need to
> run in case of any nfs4 troubles .. I guess this should be comprehensive
> even at the expense of a larger output size (which we can trim later)..
> Thanks a lot for the help!
>

tcpdump -s 0 -w <file>.pcap host <client-host-name>
(<file> refers to a file name you choose and <client-host-name> refers to
the host name of a client generating traffic.)
--> But you won't be able to allow this to run for long during the storm or the
file will be huge.

Then you look at <file>.pcap in wireshark, which knows NFS.

rick

Ahmed Kamal via freebsd-fs

unread,

Jul 28, 2015, 8:48:12 AM7/28/15

to

Hi again Rick,

Seems that I'm still being unlucky with nfs :/ I caught one of the newly
installed RHEL6 boxes having high CPU usage, and bombarding the BSD NFS box
with 10Mbps traffic .. I caught a tcpdump as you mentioned .. You can
download it here:

https://dl.dropboxusercontent.com/u/51939288/nfs41-high-client-cpu.pcap.bz2

I didn't restart the client yet .. so if you catch me in the next few hours
and want me to run any diagnostics, let me know. Thanks a lot all for
helping

Rick Macklem

unread,

Jul 28, 2015, 7:39:47 PM7/28/15

to

Ahmed Kamal wrote:
> Hi again Rick,
>
> Seems that I'm still being unlucky with nfs :/ I caught one of the newly
> installed RHEL6 boxes having high CPU usage, and bombarding the BSD NFS box
> with 10Mbps traffic .. I caught a tcpdump as you mentioned .. You can
> download it here:
>
> https://dl.dropboxusercontent.com/u/51939288/nfs41-high-client-cpu.pcap.bz2
>

Ok, the packet trace suggests that the NFSv4 server is broken (it is replying
with NFS4ERR_STALE_CLIENTID for a recently generated ClientID).
Now, I can't be sure, but the only explanation I can come up with is...
- For some arches (I only have i386, so I wouldn't have seen this during testing),
time_t is 64bits (uint64_t).
--> If time_seconds somehow doesn't fit in the low order 32bits, then the code
would be busted for these arches because nfsrvboottime is set to time_seconds
when the server is started and then there are comparisons like:
if (nfsrvboottime != clientid.lval[0])
return (NFSERR_STALECLIENTID);
/* where clientid.lval[0] is a uint32_t */
Anyhow, if this is what is happening, the attached simple patch should fix it.
(I don't know how time_seconds would exceed 4billion, but the clock code is
pretty convoluted, so I can't say if it can possibly happen?)

rick
ps: Hmm, on i386 time_seconds ends up at 1438126486, so maybe it can exceed
4*1024*1024*1024 - 1 on amd64?

64bitboottime.patch

Ahmed Kamal via freebsd-fs

unread,

Jul 29, 2015, 5:34:15 AM7/29/15

to

hmm, if I understand you correctly, this time_seconds value is the number
of seconds till the box booted ? If so, I guess this is not really the
cause of what we're seeing as the box is only up for 8 days

bsd# uptime
11:28AM up 8 days, 6:20, 6 users, load averages: 0.94, 0.91, 0.84

The NFS client box's uptime is
linux# uptime
11:31:39 up 8 days, 5:51, 11 users, load average: 87.74, 87.43, 87.35

and yes the huge load is most likely due to this NFS bug

Rick Macklem

unread,

Jul 29, 2015, 7:35:48 AM7/29/15

to

Ahmed Kamal wrote:
> hmm, if I understand you correctly, this time_seconds value is the number
> of seconds till the box booted ?

No, it is the number of seconds when the box booted. Once, it was supposed to
be the number of seconds since Jan. 1, 1970, but I don't if that is still the
case. (For my i386, it is about 1.4billion when it boots, so I'd guess they
still use Jan. 1, 1970. 3600*24*365*45 = 1419120000. Yea, I didn't bother with
leap years, etc.)

Now, I don't know if the clock could somehow set it to a value > 4billion when
the nfsd starts (it copies time_seconds to nfsrvboottime as it starts up), but the
current clock code is pretty convoluted stuff, so??

rick
ps: From the NFSv4 server's point of view, it only needs a number that is unique and
changes every time the server reboots. As such, using the low order 32bits of
it would be sufficient, even if it exceeds 4billion. However, the code incorrectly
assumes it won't exceed 4*1024*1024*1024 - 1 unless you apply the patch.

Ahmed Kamal via freebsd-fs

unread,

Jul 29, 2015, 7:51:46 AM7/29/15

to

hmm Thanks Rick ..

You mentioned the error appears when nfsrvboottime != clientid.lval[0] .. I
understand nfsrvboottime is number of seconds since the epoch (1970) .. Can
you please explain what clientid.lval[0] is, and (if it comes from the
client?) what guarantees it should equal nfsrvboottime ?

Apart from trying to understand the problem. Can you send me a small c
program that runs the same code that computes nfsrvboottime and writes that
to the terminal window. I would like to avoid testing a kernel patch on
this system since it runs in production. And last time I rebooted the nfs
server, I ended up having to reboot all clients (every single workstation)
so that was painful .. So if we just want to know if the number if bigger
than 4 billion or not, I think this small app can help us get this value
right ?

Rick Macklem

unread,

Jul 29, 2015, 5:00:58 PM7/29/15

to

Ahmed Kamal wrote:
> hmm Thanks Rick ..
>
> You mentioned the error appears when nfsrvboottime != clientid.lval[0] .. I
> understand nfsrvboottime is number of seconds since the epoch (1970) .. Can
> you please explain what clientid.lval[0] is, and (if it comes from the
> client?) what guarantees it should equal nfsrvboottime ?
>

The low order 32bits of nfsrvboottime is assigned to clientid.lval[0] when a
new clientid is generated and then sent to a client. The client simply puts the
clientid in requests (the bits are opaque to the client). Since nfsrvboottime
was assumed by the code to fit in 32bits and does not change (it is set from
time_seconds once during startup), this check detects if the clientid that the
client has was acquired from the server before the most recent reboot of the
server. (Since the server didn't reboot between ExchangeID and Create_session
in the packet trace, the value should be the same as what is in nfsrvboottime.)

However, if nfsrvboottime is 64bits (which is what time_t is on some arches) and
somehow has a non-zero high order 32bits, the value will be != because clientid.lval[0]
is a uint32_t. (The patch just casts nfsrvboottime to (uint32_t) to make the comparison
ignore the high order bits of nfsrvboottime.)

> Apart from trying to understand the problem. Can you send me a small c
> program that runs the same code that computes nfsrvboottime and writes that
> to the terminal window. I would like to avoid testing a kernel patch on
> this system since it runs in production. And last time I rebooted the nfs
> server, I ended up having to reboot all clients (every single workstation)
> so that was painful .. So if we just want to know if the number if bigger
> than 4 billion or not, I think this small app can help us get this value
> right ?
>

Since it is in the kernel and declared "static" I don't know of a way for a C app to manipulate it.
(I just booted with a printf in the kernel for it, to see what it was set to.)

I just spotted a place in the code that allocates clientids (nfsrv_setclient()) that
looks broken when the entry exists on the last hash table element.

I will email you a patch for this. To be honest, the only way this will get resolved
is if you can test the patch(es). I understand that rebooting the server isn't
attractive (and ideally is done only when all clients are dismounted from it), but
this is all I can suggest. (Maybe create a kernel with the patches on the server and
then reboot with the new kernel when it crashes or has to be rebooted for some other
reason? This is no rush for me, so it depends on how problematic the issue is for you?)

rick

Rick Macklem

unread,

Jul 29, 2015, 6:48:06 PM7/29/15

to

Ahmed Kamal wrote:
> hmm Thanks Rick ..
>
> You mentioned the error appears when nfsrvboottime != clientid.lval[0] .. I
> understand nfsrvboottime is number of seconds since the epoch (1970) .. Can
> you please explain what clientid.lval[0] is, and (if it comes from the
> client?) what guarantees it should equal nfsrvboottime ?
>

> Apart from trying to understand the problem. Can you send me a small c
> program that runs the same code that computes nfsrvboottime and writes that
> to the terminal window. I would like to avoid testing a kernel patch on
> this system since it runs in production. And last time I rebooted the nfs
> server, I ended up having to reboot all clients (every single workstation)
> so that was painful .. So if we just want to know if the number if bigger
> than 4 billion or not, I think this small app can help us get this value
> right ?
>

Ok, I took a closer look at the packet trace (thanks for creating it) and I saw
that the clientid value being returned by the server was bogus. (The field that
shouldn't change except when the server is rebooted was changing.)

Given the above, I think I did find the bug and the attached patch should fix it.
(This is NFSv4.1 specific and has nothing to do with the NFSv4.0 seqid issue.)
The patch actually fixes 3 things, although I don't think the other 2 would affect
you in practice:
1 - When a confirmed clientid already exists, nfsrv_setclient() wasn't setting the
clientidp argument, so the reply included garbage off the stack.
I think this is what caused your problem.
2 - If the high order 32bits of the nfsrvboottime is non-zero, the comparisons with
clientid.lval[0] would fail. I don't think this should actually happen until the
year 2138, but is fixed in the patch.
3 - It was possible to leave an unconfirmed duplicate clientid structure in the list
when the match in nfsrv_setclient() found it in the last hash list. I do not think
this would have caused any problem, since the new one would be at the head of the
list. The old one would eventually been scavenged and cleared out, although it
would have occupied storage until then.

The attached patch fixes all of these and I think will fix your problem. Maybe you
can create a patched kernel and then find a time to reboot the server someday?

If any other reader is using the NFSv4 server, please test this patch if possible.

Thanks for creating the packet trace and sorry about this bug causing you grief, rick
ps: I'd guess you are one of the first serious users of the NFSv4.1 server, but hopefully
it will behave ok for you with this patch.

nfsv41exch.patch