howto combat highly pathologic latencies on a server?

Hans-Peter Jansen

unread,

Mar 10, 2010, 12:20:03 PM3/10/10

to

in a commercial setting, with all those evil elements at work like VMware,
NFS, XFS, openSUSE, diskless fat clients, you name it...

System description:

Dual socket board: Tyan S2892, 2 * AMD Opteron 285 @ 2.6 GHz, 8 GB RAM,
PRO/1000 MT Dual Port Server NIC, Areca ARC-1261 16 channel RAID
controller, with 3 sets of RAID 5 arrays attached:
System is running from: 4 * WD Raptor 150GB (WDC WD1500ADFD-00NLR5)
VMware (XP-) images used via NFS: 6 * WD Raptor 74 GB (WDC WD740GD-00FLA0)
Homes, diskless clients, appl. data: 4 * Hitachi 1 GB (HDE721010SLA330).

All filesystems are xfs. The server serves about 20 diskless PC's, most use
an Intel Pro/1000 GT NIC, all attached on a 3com 3870 48-port 10/100/1000
switch.

OS is openSUSE 11.1/i586 with kernel 2.6.27.45 (the same kernel as SLE 11).

It serves mostly NFS, SMB, and does mild database (MySQL) and email
processing (Cyrus IMAP, Postfix...). It also drives an ancient (but very
important) terminal based transport order mgmt system, that often syncs
it's data. Unfortunately, it's also used for running a VMware-Server
(1.0.10) XP-client, that itself does simple database stuff (employers time
registration).

Users generally describe this system as slow, although the load on the
server is less than 1.5 most of the time. Interestingly, the former system,
using ancient kernels (2.6.11, SuSE 9.3) was perceived significantly
quicker (but not fast..).

The diskless clients are started once in the morning (taking 60-90 sec), use
an aufs2 layered NFS mount for their openSUSE 11.1 system, and simple NFS
mounted homes and shared folders. 2/3th also need running a VMware XP
client (also NFS mounted). Their CPUs range from Athlon 64 3000+ up to
Phenom X4 955, with 2 or 4 GB RAM.

While this system usually operates fine, it suffers from delays, that are
displayed in latencytop as: "Writing page to disk: 8425,5 ms":
ftp://urpla.net/lat-8.4sec.png, but we see them also in the 1.7-4.8 sec
range: ftp://urpla.net/lat-1.7sec.png, ftp://urpla.net/lat-2.9sec.png,
ftp://urpla.net/lat-4.6sec.png and ftp://urpla.net/lat-4.8sec.png.

From other observations, this issue "feels" like it is induced by single
syncronisation points in the block layer, eg. if I create heavy IO load on
one RAID array, say resizing a VMware disk image, it can take up to a
minute to log in by ssh, although the ssh login does not touch this area at
all (different RAID arrays). Note, that the latencytop snapshots above are
made during normal operation, not this kind of load..

The network side looks fine, as its main interface rarely passes 40MiB/s,
and usually keeps in the 1 Kib/s - 5 MiB/s range.

The xfs filesystems are mounted with rw,noatime,attr2,nobarrier,noquota
(yes, I do have a BBU on the areca, and disk write cache is effectively
turned off).

The clients mount their system:
/:ro/rw,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nointr,nolock,proto=tcp,
timeo=600,retrans=2,sec=sys,mountvers=3,mountproto=udp
/home: similar
/shared: without nolock

Might later kernels mitigate this problem? As this is a production system,
that is used 6.5 days a week, I cannot do dangerous experiments, also
switching to 64 bit is a problem due to the legacy stuff described above...
OTOH, my users suffer from this, and anything helping in this respect is
highly appreciated.

Thanks in advance,
Pete
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Christoph Hellwig

unread,

Mar 10, 2010, 1:20:02 PM3/10/10

to

On Wed, Mar 10, 2010 at 06:17:42PM +0100, Hans-Peter Jansen wrote:
> While this system usually operates fine, it suffers from delays, that are
> displayed in latencytop as: "Writing page to disk: 8425,5 ms":
> ftp://urpla.net/lat-8.4sec.png, but we see them also in the 1.7-4.8 sec
> range: ftp://urpla.net/lat-1.7sec.png, ftp://urpla.net/lat-2.9sec.png,
> ftp://urpla.net/lat-4.6sec.png and ftp://urpla.net/lat-4.8sec.png.
>
> >From other observations, this issue "feels" like it is induced by single
> syncronisation points in the block layer, eg. if I create heavy IO load on
> one RAID array, say resizing a VMware disk image, it can take up to a
> minute to log in by ssh, although the ssh login does not touch this area at
> all (different RAID arrays). Note, that the latencytop snapshots above are
> made during normal operation, not this kind of load..

I had very similar issues on various systems (mostly using xfs, but some
with ext3, too) using kernels before ~ 2.6.30 when using the cfq I/O
scheduler. Switching to noop fixed that for me, or upgrading to a
recent kernel where cfq behaves better again.

Dave Chinner

unread,

Mar 10, 2010, 6:30:02 PM3/10/10

to

On Wed, Mar 10, 2010 at 06:17:42PM +0100, Hans-Peter Jansen wrote:

Make sure the filesystem has the "lazy-count=1" attribute set (use
xfs_info to check, xfs_admin to change). That will remove the
superblock from most transactions and significant reduce latency of
transactions as they serialise while locking it...

Cheers,

Dave
--
Dave Chinner
da...@fromorbit.com

David Rees

unread,

Mar 10, 2010, 6:50:02 PM3/10/10

to

On Wed, Mar 10, 2010 at 9:17 AM, Hans-Peter Jansen <h...@urpla.net> wrote:
> While this system usually operates fine, it suffers from delays, that are
> displayed in latencytop as: "Writing page to disk: � � 8425,5 ms":
> ftp://urpla.net/lat-8.4sec.png, but we see them also in the 1.7-4.8 sec
> range: ftp://urpla.net/lat-1.7sec.png, ftp://urpla.net/lat-2.9sec.png,
> ftp://urpla.net/lat-4.6sec.png and ftp://urpla.net/lat-4.8sec.png.
>
> From other observations, this issue "feels" like it is induced by single
> syncronisation points in the block layer, eg. if I create heavy IO load on
> one RAID array, say resizing a VMware disk image, it can take up to a
> minute to log in by ssh, although the ssh login does not touch this area at
> all (different RAID arrays). Note, that the latencytop snapshots above are
> made during normal operation, not this kind of load..
>

> Might later kernels mitigate this problem? As this is a production system,
> that is used 6.5 days a week, I cannot do dangerous experiments, also
> switching to 64 bit is a problem due to the legacy stuff described above...
> OTOH, my users suffer from this, and anything helping in this respect is
> highly appreciated.

Seems like a 2.6.32 based kernel which has per-BDI writeback and "CFQ
low latency mode" changes might help a good deal. I know that on one
of my bigger machines (similar in specs to yours) which has a lot of
processes which do a decent amount of IO, latency and load average has
gone down after going to a 2.6.32 kernel from a 2.6.31 kernel (Fedora
11 system).

Like Chris suggested, I've also heard that using the noop IO scheduler
can work well on Areca controllers on some kernels and workloads.
It's worth a shot and you can even try changing it at run-time.

-Dave

Hans-Peter Jansen

unread,

Mar 10, 2010, 7:20:01 PM3/10/10

to

On Wednesday 10 March 2010, 19:15:48 Christoph Hellwig wrote:
> On Wed, Mar 10, 2010 at 06:17:42PM +0100, Hans-Peter Jansen wrote:
> > While this system usually operates fine, it suffers from delays, that
> > are displayed in latencytop as: "Writing page to disk: 8425,5 ms":
> > ftp://urpla.net/lat-8.4sec.png, but we see them also in the 1.7-4.8 sec
> > range: ftp://urpla.net/lat-1.7sec.png, ftp://urpla.net/lat-2.9sec.png,
> > ftp://urpla.net/lat-4.6sec.png and ftp://urpla.net/lat-4.8sec.png.
> >
> > >From other observations, this issue "feels" like it is induced by
> > > single
> >
> > syncronisation points in the block layer, eg. if I create heavy IO load
> > on one RAID array, say resizing a VMware disk image, it can take up to
> > a minute to log in by ssh, although the ssh login does not touch this
> > area at all (different RAID arrays). Note, that the latencytop
> > snapshots above are made during normal operation, not this kind of
> > load..
>
> I had very similar issues on various systems (mostly using xfs, but some
> with ext3, too) using kernels before ~ 2.6.30 when using the cfq I/O
> scheduler. Switching to noop fixed that for me, or upgrading to a
> recent kernel where cfq behaves better again.

Christoph, thanks for this valuable suggestion: I've changed it to noop
right away, and also:

vm.dirty_ratio = 20
vm.dirty_background_ratio = 1

since the defaults of 40 and 10 seem to also not fit my needs. Even the 20
might be still oversized with 8GB total mem.

Thanks,
Pete

Hans-Peter Jansen

unread,

Mar 10, 2010, 7:30:01 PM3/10/10

to

Dave, this modification sounds promising. Will do them during the weekend.
Also Christoph mentioned some pending patches for fdatasync and NFS metadata
updates in his XFS status report from February, that sounded _really_
exciting.

Happily awaiting these bits in the stable universe ;-)

Thanks,
Pete

Hans-Peter Jansen

unread,

Mar 10, 2010, 8:30:02 PM3/10/10

to

Yes, already done. Hopefully my users will notice.. As I've upgraded this
server and the clients only two weeks ago, calming things down has highest
priority.

Switching kernel versions in production systems is always painful, thus I
try to avoid that, but this time I already needed to roll my own kernel for
the clients due to some aufs2 vs. apparmor disharmony. That led to the loss
of the latter - I can live without apparmor, but certainly not without a
reliable layered filesystemน.

Anyway, thanks for your suggestion and confirmation, David. It is
appreciated.

Cheers,
Pete

น) In a way, this is my primary justification to also use Linux on the
desktopsฒ! Install one, and get the rest (nearly) free..
http://download.opensuse.org/repositories/home:/frispete:/aufs2 and below..
ฒ) Don't tell anybody, that I don't like the other OS ;-)

Hans-Peter Jansen

unread,

Mar 11, 2010, 12:00:03 PM3/11/10

to

On Thursday 11 March 2010, 00:29:40 Dave Chinner wrote:

Done that now on my local test system, but on one of its filesystems,
xfs_admin -c1 didn't succeed, it simply stopped (waiting for a futex):

Famous last syscall:
6750 futex(0x868330c8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

Consequently, xfs_repair behaved similar, hanging in phase 6, traversing
filesystem... I have a huge strace from this run, if someone is interested.

It's an 3 TB Raid 5 array (4 * 1 TB hd) with one FS also driven by areca:

meta-data=/dev/sdb1 isize=256 agcount=4, agsize=183105406
blks
= sectsz=512 attr=2
data = bsize=4096 blocks=732421623, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

Luckily, xfs_repair -P finally did succeed. Phuah..

This is with: xfs_repair version 2.10.1.

After calling xfs_admin -c1, all filesystems showed differences in
superblock features (from a xfs_repair -n run). Is xfs_repair mandatory, or
does the initial mount fix this automatically?

Thanks,
Pete

Dave Chinner

unread,

Mar 13, 2010, 8:20:02 AM3/13/10

to

Mandatory - there are extra fields in the AGF headers that track
free space btree block usage (the tree itself) that need to be
calculated correctly. This allows the block usage in the filesystem
to be tracked from the AGFs rather than the superblock, hence
removing the single poにnt of contention in the allocation path...

xfs_repair does this calculation for us - putting that code into the
kernel to avoid running repair is a lot of work for a relatively
rare operation....

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Hans-Peter Jansen

unread,

Mar 16, 2010, 11:00:02 AM3/16/10

to

That was an bad idea. I've reverted the vm tweaks, as it turned things even
worser.

After switching to noop and activating lazy count on all filesystems, the
pathologic behavior with running io hooks seems to be relieved, but the
latency due to VMware-Server persists:

Cause Maximum Percentage
Writing a page to disk 435.8 msec 9.9 %
Writing buffer to disk (synchronous) 295.3 msec 1.6 %
Scheduler: waiting for cpu 80.1 msec 11.7 %
Reading from a pipe 9.3 msec 0.0 %
Waiting for event (poll) 5.0 msec 76.2 %
Waiting for event (select) 4.8 msec 0.4 %
Waiting for event (epoll) 4.7 msec 0.0 %
Truncating file 4.3 msec 0.0 %
Userspace lock contention 3.3 msec 0.0 %

Process vmware-vmx (7907) Total: 7635.8 msec
Writing a page to disk 435.8 msec 43.8 %
Scheduler: waiting for cpu 9.1 msec 52.7 %
Waiting for event (poll) 5.0 msec 3.5 %
[HostIF_SemaphoreWait] 0.2 msec 0.0 %

Although, I set writeThrough to "FALSE" on that VM, and it is operating on a
monolithic flat 24 GB "drive" file, it's not allowed to swap, and it's
itself only lightly used, it always writes (? whatever) synchronously and
trashes the latency of the whole system. (It's nearly always the one, that
latencytop shows, with combined latencies ranging from one to eigth secs.)

I would love to migrate that stuff to a saner VM technology (e.g. kvm), but
unfortunately, the Opteron 285 cpus are socket 940 based, and thus not
supported by any current para-virtualisation. Correct me, if I'm wrong,
please.

This VMware Server 1.0.* stuff is also getting in the way on trying to
upgrade to a newer kernel. The only way getting up the kernel stairs might
be VMware Server 2, but without serious indications, that it works way
better, I won't take that route. Hints welcome.

Upgrading the hardware combined with using ssd drives seems the only really
feasible approach, but given the economic preassure in the transport
industry, that's currently not possible, either.

Anyway thanks for your suggestions,