NFS performance limited due to very high kernel% usage

Renaat

unread,

Sep 11, 2002, 5:54:43 AM9/11/02

to

** please reply to the group, if not remember to leave out the "spam" part
in the email address **

Hi all,

I'm having a question regarding NFS, and the excessive kernel% that clients
encounter.

I have a quad CPU E4500 server which is an NFS-server, using 2 mirrored T3
trays with loads of disks configured in RAID5. Total volume size is about
200GB. I also have a Netapp F760 for testing too. These nodes are hooked up
to our network using Gigabit Ethernet.

I have several NFS-clients, ranging from T1, E220, to another E4500 (4cpu),
which all run a mail application, that access mailboxes over NFS to the
storage nodes mentioned above.

What I see is that when I stress test, that the kernel activity on the
clients goes way up high, up to 75%. It doesn't matter if the client is
having 1 or multiple CPU's... I would expect everything to go up, but not in
this proportion...

when I look at an "iostat -xn 4" on the client, the NFS mount doesn't seem
to be that busy:

extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b
device
302.0 0.0 395.5 0.0 0.0 0.3 0.0 0.9 0 23
192.168.13.33:/global/emtc01-ds2/nfs/data
353.3 0.0 403.8 0.0 0.0 0.3 0.0 0.9 0 28
192.168.13.33:/global/emtc01-ds2/nfs/data
366.0 0.0 429.0 0.0 0.0 0.3 0.0 0.9 0 29
192.168.13.33:/global/emtc01-ds2/nfs/data
351.5 0.0 423.2 0.0 0.0 0.3 0.0 0.9 0 27
192.168.13.33:/global/emtc01-ds2/nfs/data
352.5 0.0 424.2 0.0 0.0 0.3 0.0 1.0 0 29
192.168.13.33:/global/emtc01-ds2/nfs/data
325.5 0.0 409.5 0.0 0.0 0.3 0.0 0.9 0 24
192.168.13.33:/global/emtc01-ds2/nfs/data
328.6 0.0 408.4 0.0 0.0 0.3 0.0 0.9 0 26
192.168.13.33:/global/emtc01-ds2/nfs/data
314.6 0.0 401.1 0.0 0.0 0.3 0.0 0.9 0 25
192.168.13.33:/global/emtc01-ds2/nfs/data
291.9 0.0 376.6 0.0 0.0 0.3 0.0 0.9 0 23
192.168.13.33:/global/emtc01-ds2/nfs/data
333.1 0.0 392.4 0.0 0.0 0.3 0.0 0.9 0 26
192.168.13.33:/global/emtc01-ds2/nfs/data
341.5 0.0 402.9 0.0 0.0 0.3 0.0 0.9 0 27
192.168.13.33:/global/emtc01-ds2/nfs/data
325.3 0.0 395.1 0.0 0.0 0.3 0.0 1.0 0 28
192.168.13.33:/global/emtc01-ds2/nfs/data
333.0 0.0 407.6 0.0 0.0 0.3 0.0 0.9 0 26
192.168.13.33:/global/emtc01-ds2/nfs/data
325.5 0.0 414.8 0.0 0.0 0.3 0.0 0.9 0 25
192.168.13.33:/global/emtc01-ds2/nfs/data

I'm guessing this'll probably have to do with tuning NFS and the likes, but
what ? Has any of you an explanation for this?

On the storage node side.. disks aren't busy at all! There's no iowait%,
there is some kernel activity, but only to a normal 10% of total cpu
capacity, so that seems acceptable..

extended device statistics tty
device r/s w/s kr/s kw/s wait actv svc_t %w %b
2/md120 0.0 111.0 0.0 222.0 0.0 3.0 27.4 0 5
2/md120 0.0 109.5 0.0 222.0 0.0 2.8 25.4 0 5
2/md120 0.0 110.0 0.0 220.0 0.0 2.9 26.6 0 5
2/md120 0.0 110.5 0.0 221.0 0.0 3.0 27.5 0 5
2/md120 0.0 117.0 0.0 236.3 0.0 3.2 27.1 0 5

Thanks for throwing in some tuning hints or experiences!

Regards,

Renaat Dumon

Renaat

unread,

Sep 12, 2002, 3:10:22 AM9/12/02

to

anyone ? :-(

"Renaat" <renaat_stopt...@hotmail.com> wrote in message
news:3d7f12db$0$189$ba62...@news.skynet.be...

Dan

unread,

Sep 12, 2002, 3:15:01 PM9/12/02

to

"Renaat" <renaat_stopt...@hotmail.com> wrote in message

news:3d803de2$0$194$ba62...@news.skynet.be...
> anyone ? :-(
<snip>

> >
> > What I see is that when I stress test, that the kernel activity on the
> > clients goes way up high, up to 75%. It doesn't matter if the client is
> > having 1 or multiple CPU's... I would expect everything to go up, but
not
> in

<snip>

I just connected the stress test with the use for mail server. Is stress
test "dumping" huge amounts of mail to the servers? I am wondering how
sendmail reacts with a huge load of new mail. Based on your configuration,
you can set it up to do a lot. For example any kind of spam strategys,
rejecting specific originators, screening subject lines. Is that what is
taking up the CPU? Just wondering if something is spawning an extra process
for each mail recieved and when all show up at once they bog down system.
By that time they have made their own copy of the message and so no
additional activity on NFS shown.

Just shooting in the dark here!.

Good Luck!.

Renaat

unread,

Sep 13, 2002, 3:02:00 AM9/13/02

to

Hi, thanks for the replies,

but ... :-)

I'm not running sendmail (I work for an ISP and we're trying to set up a new
platform, so I have full control, no customers to worry about). The vendor
uses this architecture because it's scalable. One can add front-ends to
scale for processing power, and one can add storage nodes (NetApps, or any
NFS capable config) for higher disk performance, or for load spreading.. I
know of customers of this vendor that are serving 7.000.000 mailboxes with
this architecture, with a couple more machines, granted, but still..

I'm not using any spam filtering techniques, which would indeed take up
CPU... Even I did, the %usr would go up! Instead, the %kernel is way high!
I've seen it go up to 75% , and all the rest that was left was for %usr (the
software itself)

The way I see it, it comes down to this:

1) either the front-end is taking up considerable resources handling network
traffic (SMTP/POP3), but it doubt it, since a "netstat -an" shows then
number of parallel connections I fire at the box... There are no TIME_WAIT.
So were talking about some 50 established connections

2) NFS client handling is taking up all those resources.. When I do a
"iostat -xn" and look at the NFS mount, I can see it's reported %b as 100
(so 100% busy), so that's where my bottleneck lays. However, network
traffic isn't excessive, maybe 1MB/s in+out. I configured the NFS
mount with the rsize=8192,wsize=8192 switch, but that doesn't change a
thing, the network here is quite good and performant to keep with large
packet trains..

I think some NFS tuning could be in order, but I don't have any knowledge
nor resources for this... What could couse %kernel be that high ? (apart
from the requests that I'm making). How do I maximise throughput capacity ?

Thanks again,

Renaat

<Michael Vilain <vil...@spamcop.net>> wrote in message
news:news-E6083D.1...@news.tdl.com...
> In article <xJ5g9.6569$e44.3...@news4.srv.hcvlny.cv.net>,

> IIRC, sendmail will fork up to N copies of itself, each of which will
> process a mail message. If there are more than N copies, sendmail stops
> forking new ones until an existing copy finishes and exits. If there's
> some sort of wait while sendmail has to automount a user's home
> directory (to see if they have a $HOME/.forward), then this could add to
> the delay. Ideally, I would think that sendmail should not be
> processing NFS mounted /var/mail or /var/spool/mqueue directories.
> Clients accessing the mailserver:/var/mail mount point via an automount
> shouldn't be that much of a drain.
>
> --
> DeeDee, don't press that button! DeeDee! NO! Dee...
>
>
>

Brian

unread,

Sep 15, 2002, 7:53:51 AM9/15/02

to

"Renaat" <renaat_stopt...@hotmail.com> wrote in message

news:3d818d6e$0$194$ba62...@news.skynet.be...

> Hi, thanks for the replies,
>

> I think some NFS tuning could be in order, but I don't have any knowledge
> nor resources for this... What could couse %kernel be that high ? (apart
> from the requests that I'm making). How do I maximise throughput capacity
?
>
> Thanks again,
>
> Renaat

Ripped out the of the NFS Server Performance and Tuning Guide for Sun™
Hardware

For improved performance, NFS server configurations should set the number of
NFS threads. Each thread is capable of processing one NFS request. A larger
pool of
threads enables the server to handle more NFS requests in parallel. The
default setting of 16 in Solaris 2.4 through Solaris 8 software environments
results in slower
NFS response times. Scale the setting with the number of processors and
networks and increase the number of NFS server threads by editing the
invocation of nfsd in
/etc/init.d/nfs.server:

Extra NFS threads do not cause a problem.

To Set the Number of NFS Threads take the maximum of the following three
suggestions:
Use 2 NFS threads for each active client process.
Use 16 to 32 NFS threads for each CPU. Use roughly 16 for a SPARCclassic or
a SPARCstation 5 system. Use 32 NFS threads for a system with a 60 MHz
SuperSPARC processor.
Use 16 NFS threads for each 10 Mbits of network capacity. For example, if
you have one SunFDDI™ interface, set the number of threads to 160. With two
SunFDDI interfaces, set the thread count to 320, and so on.

The bufhwm variable, set in the /etc/system file, controls the maximum
amount of memory allocated to the buffer cache and is specified in Kbytes.
The default value of bufhwm is 0, which allows up to 2 percent of system
memory to be used. This can be increased up to 20 percent and may need to be
increased to 10 percent for a dedicated NFS file server with a relatively
small memory system. On a larger system, the bufhwm variable may need to be
limited to prevent the system from running out of the operating system
kernel virtual address space. The buffer cache is used to cache inode,
indirect block, and cylinder group related disk I/O only. The following is
an example of a buffer cache (bufhwm) setting in the /etc/system file that
can handle up to 10 Mbytes of cache. This is the highest value to which you
should set bufhwm.

set bufhwm=10240

To show the DNLC hit rate (cache hits), type vmstat -s.

% vmstat -s
... lines omitted
79062 total name lookups (cache hits 94%)
16 toolong

Directory names less than 30 characters long are cached and names that are
too long to be cached are also reported. A cache miss means that a disk I/O
may be needed to read the directory when traversing the path name components
to get to a file. A hit rate of less than 90 percent requires attention.
Cache hit rates can significantly affect NFS performance. getattr, setattr
and lookup usually represent greater than 50 percent of all NFS calls. If
the requested information isn't in cache, the request will generate a disk
operation that results in a performance penalty as significant as that of a
read or write request. The only limit to the size of the DNLC cache is
available kernel memory. If the hit rate (cache hits) is less than 90
percent and a problem does not exist with the number of longnames, tune the
ncsize variable which follows. The variable ncsize refers to the size of the
DNLC in terms of the number of name and vnode translations that can be
cached. Each DNLC entry uses about 50 bytes of extra kernel memory.

1. Set ncsize in the /etc/system file to values higher than the default
(based on maxusers.)
As an initial guideline, since dedicated NFS servers do not need a lot of
RAM, maxusers will be low and the DNLC will be small; double its size.
The default value of ncsize is: ncsize (name cache) = 17 * maxusers +90
2. Set NFS server benchmarks to 16000.
3. Set maxusers at 34906.

A memory-resident inode is used whenever an operation is performed on an
entity in the file system. The inode read from disk is cached in case it is
needed again. ufs_ninode is the size that the UNIX file system attempts to
keep the list of idle inodes. You can have ufs_ninod set to 1 but have
10,000 idle inodes. As active inodes become idle, if the number of idle
inodes goes over ufs_ninode, then memory is reclaimed by tossing out idle
inodes.

Every entry in the DNLC cache points to an entry in the inode cache, so both
caches should be sized together. The inode cache should be at least as big
as the DNLC cache. For best performance, it should be the same size in the
Solaris 2.4 through Solaris 8 operating environments.

Since it is just a limit, you can tweak ufs_ninode with adb (mdb) on a
running system with immediate effect. The only upper limit is the amount of
kernel memory used by the inodes. The tested upper limit corresponds to
maxusers = 2048, which is the same as ncsize at 34906.

To report the size of the kernel memory allocation use sar -k.

In the Solaris 2.4 operating environment, each inode uses 300 bytes of
kernel memory from the lg_mem pool.

In the Solaris 2.5.1, 2.6, 7, and 8 operating environments, each inode uses
320 bytes of kernel memory from the lg_mem pool. ufs_ninode is automatically
adjusted to be at least ncsize. Tune ncsize to get the hit rate up and let
the system pick the default ufs_ninodes.

With the Solaris 2.5.1. 2.6, 7 and 8 software, ufs_ninode is automatically
adjusted to be at least ncsize. Tune ncsize to get the hit rate up and let
the system pick the default ufs_ninodes.

If the inode cache hit rate is below 90 percent, or if the DNLC requires
tuning for local disk file I/O workloads, take the following steps:
1. Increase the size of the inode cache.
2. Change the variable ufs_ninode in your /etc/system file to the same size
as the DNLC (ncsize).
For example, for the Solaris version 2.4 software, type: set ufs_ninode=5000
The default value of the inode cache is the same as that for ncsize:
ufs_ninode (default value) = 17 * maxusers + 90.
Caution – Do not set ufs_ninode less than ncsize. The ufs_ninode parameter
limits the number of inactive inodes, rather than the total number of active
and inactive inodes.

See http://docs.sun.com/?q=nfs+tuning&p=/doc/806-2195-10 for the entire book

Brian