unexplained high load report

Keith Keller

unread,

Nov 22, 2012, 12:21:56 AM11/22/12

to

Hi all,

I recently put a new fileserver into service, and am a bit confused by
an unexplained high-ish load. I am wondering what sorts of things I
should look for next.

Some background: this is a newly installed CentOS 6.3 box running nfsd
and smbd. It was updated last week from yum, so should have all the
major updates available, including the latest kernel. It has a 3ware
controller that supports an external disk array, currently with an
11-disk RAID6 under lvm. During pre-deploy burn-in I didn't find any
problems.

While under seemingly no load, the actual load is reported at around 4:

20:55:03 up 4 days, 23:14, 1 user, load average: 3.95, 3.96, 3.91

The previous server would not get that high unless there was significant
disk activity. But at the time I took this w, there was little disk
activity on the filesystem, and both nfsd and smbd were almost always
in S state. And, somewhat surprisingly, I can notice no performance
degradation in either reads or writes despite the modestly high load;
on the old fileserver I would see a slight speed hit on disk accesses
when the load got this high.

top doesn't show anything unusual, and no process is generally using
more than 2% of the CPU. One example (sorted by CPU, first few lines):

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1759 root 20 0 0 0 0 D 1.5 0.0 6:42.60 xfsaild/dm-3
1 root 20 0 19352 1600 1284 S 0.0 0.0 0:01.58 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:02.08 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:33.47 ksoftirqd/0

The xfsaild daemons (four of them, one for each XFS-mounted filesystem,
I am guessing) are generally in D state, but web searches on this have
usually pointed to a concurrent issue (e.g., other XFS daemons in R
state instead of S), which this box isn't experiencing. I can't find
any other reports of xfsaild in D state being a problem.

powertop reports this. Nothing seems out of the ordinary:

Cn Avg residency
C0 (cpu running) ( 1.6%)
polling 0.0ms ( 0.0%)
C1 mwait 0.3ms ( 0.0%)
C2 mwait 0.9ms ( 0.1%)
C3 mwait 0.6ms ( 0.0%)
C4 mwait 22.0ms (98.4%)
P-states (frequencies)
2.40 Ghz 0.0%
2.31 Ghz 0.0%
2.21 Ghz 0.0%
2.10 Ghz 0.0%
2.00 Ghz 0.0%
1.91 Ghz 0.0%
1.80 Ghz 0.0%
1.71 Ghz 0.0%
1.60 Ghz 0.0%
1500 Mhz 0.0%
1400 Mhz 0.0%
1300 Mhz 0.0%
1200 Mhz 100.0%
Wakeups-from-idle per second : 45.9 interval: 15.0s
no ACPI power usage estimate available
Top causes for wakeups:
45.0% (641.7) <interrupt> : extra timer interrupt
21.6% (307.3) <kernel core> : hrtimer_start (tick_sched_timer)
7.0% (100.0) xfsaild/dm-1 : xfsaild (process_timeout)
7.0% (100.0) xfsaild/dm-2 : xfsaild (process_timeout)
7.0% ( 99.7) xfsaild/dm-0 : xfsaild (process_timeout)
7.0% ( 99.5) xfsaild/dm-3 : xfsaild (process_timeout)
1.3% ( 18.3) usbhid-ups : hrtimer_start_range_ns
(hrtimer_wakeup)
1.2% ( 17.1) <kernel core> : hrtimer_start_range_ns
(tick_sched_timer)
0.5% ( 6.9) <interrupt> : ahci
0.3% ( 4.7) <interrupt> : ehci_hcd:usb2
0.3% ( 4.0) USB device 2-1.1 : Smart-UPS 3000 RM FW:666.6.D USB
FW:2.4 (American Power Conversion)

[snip]

The only difference I can find so far is the number of nfsd processes:
the new box is currently running only 8, whereas the old box ran 32. I
do plan on updating this, and perhaps that will repair the issue, but
I'd expect to see more nfsd processes in D state if this were an issue;
every time I look they are almost always in S. (If I know I'm writing a
large file I can rarely catch an nfsd move to D state.)

So: what have I missed, and what else can I check to give me more
information? Or should I not be so worried about a high load that's not
materially impacting the system?

--keith

--
kkeller...@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information

Richard Kettlewell

unread,

Nov 22, 2012, 5:19:39 AM11/22/12

to

Keith Keller <kkeller...@wombat.san-francisco.ca.us> writes:
> While under seemingly no load, the actual load is reported at around 4:
>
> 20:55:03 up 4 days, 23:14, 1 user, load average: 3.95, 3.96, 3.91
>

[...]

> The xfsaild daemons (four of them, one for each XFS-mounted filesystem,
> I am guessing) are generally in D state, but web searches on this have
> usually pointed to a concurrent issue (e.g., other XFS daemons in R
> state instead of S), which this box isn't experiencing. I can't find
> any other reports of xfsaild in D state being a problem.

Four processes (actually kernel threads in this case, I think) in ‘D’
certainly explains a load average of nearly four. Why they’re in that
state I couldn’t say.

--
http://www.greenend.org.uk/rjk/

Sylvain Robitaille

unread,

Nov 22, 2012, 11:09:33 AM11/22/12

to

On Thu, 22 Nov 2012 10:19:39 +0000, Richard Kettlewell wrote:

> Four processes (actually kernel threads in this case, I think)

> in ???D??? certainly explains a load average of nearly four.

I agree. Keith, I'm reasonably certain you want to be looking more
closely there. Although, I don't use XFS and so can't claim to have any
expertise with it, it doesn't seem to me as though this condition would
be considered "normal".

--
----------------------------------------------------------------------
Sylvain Robitaille s...@encs.concordia.ca

Systems analyst / AITS Concordia University
Faculty of Engineering and Computer Science Montreal, Quebec, Canada
----------------------------------------------------------------------

Keith Keller

unread,

Nov 22, 2012, 9:15:27 PM11/22/12

to

On 2012-11-22, Sylvain Robitaille <s...@alcor.concordia.ca> wrote:
> On Thu, 22 Nov 2012 10:19:39 +0000, Richard Kettlewell wrote:
>
>> Four processes (actually kernel threads in this case, I think)
>> in ???D??? certainly explains a load average of nearly four.
>
> I agree. Keith, I'm reasonably certain you want to be looking more
> closely there. Although, I don't use XFS and so can't claim to have any
> expertise with it, it doesn't seem to me as though this condition would
> be considered "normal".

Thanks Richard and Sylvain--that was my hunch as well. I will check
with the XFS mailing list.

sergei.j...@gmail.com

unread,

Nov 23, 2012, 4:42:41 AM11/23/12

to

Expecting the same issue after kernel upgrade to 2.6.32-279.14.1.el6.x86_64
High load in uptime and top. But server works fine, iostat, vmstat help

Keith Keller

unread,

Nov 23, 2012, 3:11:58 PM11/23/12

to

On 2012-11-23, sergei.j...@gmail.com <sergei.j...@gmail.com> wrote:
> Expecting the same issue after kernel upgrade to 2.6.32-279.14.1.el6.x86_64
> High load in uptime and top. But server works fine, iostat, vmstat help

My understanding is that there was a patch in recent RHEL/CentOS
kernels that fixed a bug by introducing this behavior. There was a real
bug fix to the main line kernel that hasn't made it to the RHEL/CentOS
kernels yet.

See here:

http://oss.sgi.com/pipermail/xfs/2012-November/022767.html

It's not totally clear to me whether there is actually a problem, or
whether it's safe to ignore the higher load until there's a patch. You
could probably downgrade if you wanted, but unless I hear otherwise I
plan on waiting, since the patches did fix some fairly serious bugs.

sergei.j...@gmail.com

unread,

Nov 25, 2012, 6:16:15 AM11/25/12

to

My personal choice was to migrate to ext4