Hi all,
I recently put a new fileserver into service, and am a bit confused by
an unexplained high-ish load. I am wondering what sorts of things I
should look for next.
Some background: this is a newly installed CentOS 6.3 box running nfsd
and smbd. It was updated last week from yum, so should have all the
major updates available, including the latest kernel. It has a 3ware
controller that supports an external disk array, currently with an
11-disk RAID6 under lvm. During pre-deploy burn-in I didn't find any
problems.
While under seemingly no load, the actual load is reported at around 4:
20:55:03 up 4 days, 23:14, 1 user, load average: 3.95, 3.96, 3.91
The previous server would not get that high unless there was significant
disk activity. But at the time I took this w, there was little disk
activity on the filesystem, and both nfsd and smbd were almost always
in S state. And, somewhat surprisingly, I can notice no performance
degradation in either reads or writes despite the modestly high load;
on the old fileserver I would see a slight speed hit on disk accesses
when the load got this high.
top doesn't show anything unusual, and no process is generally using
more than 2% of the CPU. One example (sorted by CPU, first few lines):
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1759 root 20 0 0 0 0 D 1.5 0.0 6:42.60 xfsaild/dm-3
1 root 20 0 19352 1600 1284 S 0.0 0.0 0:01.58 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.09 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:02.08 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:33.47 ksoftirqd/0
The xfsaild daemons (four of them, one for each XFS-mounted filesystem,
I am guessing) are generally in D state, but web searches on this have
usually pointed to a concurrent issue (e.g., other XFS daemons in R
state instead of S), which this box isn't experiencing. I can't find
any other reports of xfsaild in D state being a problem.
powertop reports this. Nothing seems out of the ordinary:
Cn Avg residency
C0 (cpu running) ( 1.6%)
polling 0.0ms ( 0.0%)
C1 mwait 0.3ms ( 0.0%)
C2 mwait 0.9ms ( 0.1%)
C3 mwait 0.6ms ( 0.0%)
C4 mwait 22.0ms (98.4%)
P-states (frequencies)
2.40 Ghz 0.0%
2.31 Ghz 0.0%
2.21 Ghz 0.0%
2.10 Ghz 0.0%
2.00 Ghz 0.0%
1.91 Ghz 0.0%
1.80 Ghz 0.0%
1.71 Ghz 0.0%
1.60 Ghz 0.0%
1500 Mhz 0.0%
1400 Mhz 0.0%
1300 Mhz 0.0%
1200 Mhz 100.0%
Wakeups-from-idle per second : 45.9 interval: 15.0s
no ACPI power usage estimate available
Top causes for wakeups:
45.0% (641.7) <interrupt> : extra timer interrupt
21.6% (307.3) <kernel core> : hrtimer_start (tick_sched_timer)
7.0% (100.0) xfsaild/dm-1 : xfsaild (process_timeout)
7.0% (100.0) xfsaild/dm-2 : xfsaild (process_timeout)
7.0% ( 99.7) xfsaild/dm-0 : xfsaild (process_timeout)
7.0% ( 99.5) xfsaild/dm-3 : xfsaild (process_timeout)
1.3% ( 18.3) usbhid-ups : hrtimer_start_range_ns
(hrtimer_wakeup)
1.2% ( 17.1) <kernel core> : hrtimer_start_range_ns
(tick_sched_timer)
0.5% ( 6.9) <interrupt> : ahci
0.3% ( 4.7) <interrupt> : ehci_hcd:usb2
0.3% ( 4.0) USB device 2-1.1 : Smart-UPS 3000 RM FW:666.6.D USB
FW:2.4 (American Power Conversion)
[snip]
The only difference I can find so far is the number of nfsd processes:
the new box is currently running only 8, whereas the old box ran 32. I
do plan on updating this, and perhaps that will repair the issue, but
I'd expect to see more nfsd processes in D state if this were an issue;
every time I look they are almost always in S. (If I know I'm writing a
large file I can rarely catch an nfsd move to D state.)
So: what have I missed, and what else can I check to give me more
information? Or should I not be so worried about a high load that's not
materially impacting the system?
--keith
--
kkeller...@wombat.san-francisco.ca.us
(try just my userid to email me)
AOLSFAQ=
http://www.therockgarden.ca/aolsfaq.txt
see X- headers for PGP signature information