[Rocks-Discuss] NFS , Kjournald - status D (uninterruptible sleep)

Denis

unread,

Dec 17, 2007, 8:42:21 AM12/17/07

to lists.npaci-rocks-discussion

Dears, I've got a double dual xeon front end, wich sometimes is
getting too slow.

While I was observing one of these slowly times, I realized the time
wait of processors come close to 100% and nfsd and kjournald process
become to D status (uniterruptible sleep) .Does someone know why this
can be happening or how can I troubleshoot it?

Below is a 'draw' of my top at one of these moments.

[root@cromo src]# top
top - 10:59:02 up 6 days, 19:30, 17 users, load average: 14.77, 13.46, 13.00
Tasks: 189 total, 1 running, 188 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0% us, 0.4% sy, 0.0% ni, 0.2% id, 98.5% wa, 0.0% hi, 0.9% si
Cpu1 : 0.0% us, 0.4% sy, 0.0% ni, 0.0% id, 99.6% wa, 0.0% hi, 0.0% si
Cpu2 : 0.0% us, 0.2% sy, 0.0% ni, 7.8% id, 92.0% wa, 0.0% hi, 0.0% si
Cpu3 : 0.2% us, 0.8% sy, 0.0% ni, 0.0% id, 98.7% wa, 0.2% hi, 0.0% si
Mem: 4025540k total, 4018264k used, 7276k free, 7908k buffers
Swap: 4096564k total, 3288k used, 4093276k free, 3553252k cached

PID USER PR NI VIRT RES SHR %CPU %MEM TIME+ COMMAND

3941 root 15 0 0 0 0 D 2 0.0 80:29.30 nfsd
3938 root 15 0 0 0 0 D 1 0.0 81:02.02 nfsd
2459 root 15 0 0 0 0 D 1 0.0 30:57.04 kjournald
3937 root 15 0 0 0 0 D 1 0.0 82:07.99 nfsd
96 root 15 0 0 0 0 S 0 0.0 34:12.48 kswapd0
3940 root 15 0 0 0 0 D 0 0.0 80:22.55 nfsd
3942 root 15 0 0 0 0 D 0 0.0 80:23.66 nfsd
3939 root 15 0 0 0 0 D 0 0.0 82:22.55 nfsd
4283 root 16 0 20936 8136 1276 S 0 0.2 0:44.47 hald
21289 denismpa 16 0 6280 1092 776 R 0 0.0 0:00.12 top
1 root 16 0 4752 552 460 S 0 0.0 0:01.28 init
2 root RT 0 0 0 0 S 0 0.0 0:00.21 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.05 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.28 migration/1
5 root 34 19 0 0 0 S 0 0.0 0:00.09 ksoftirqd/1
6 root RT 0 0 0 0 S 0 0.0 0:00.15 migration/2
7 root 34 19 0 0 0 S 0 0.0 0:00.05 ksoftirqd/2
8 root RT 0 0 0 0 S 0 0.0 0:00.15 migration/3
9 root 34 19 0 0 0 S 0 0.0 0:00.06 ksoftirqd/3
10 root 5 -10 0 0 0 S 0 0.0 0:01.01 events/0
11 root 5 -10 0 0 0 S 0 0.0 0:00.11 events/1
12 root 5 -10 0 0 0 S 0 0.0 0:00.06 events/2
13 root 5 -10 0 0 0 S 0 0.0 0:00.12 events/3
14 root 5 -10 0 0 0 S 0 0.0 0:00.00 khelper
15 root 15 -10 0 0 0 S 0 0.0 0:00.00 kacpid
63 root 5 -10 0 0 0 S 0 0.0 0: 00.00 kblockd/0
64 root 5 -10 0 0 0 S 0 0.0 0:00.00 kblockd/1
65 root 5 -10 0 0 0 S 0 0.0 0:00.00 kblockd/2
66 root 5 -10 0 0 0 S 0 0.0 0: 00.00 kblockd/3
67 root 15 0 0 0 0 S 0 0.0 0:00.06 khubd
97 root 12 -10 0 0 0 S 0 0.0 0:00.00 aio/0
98 root 5 -10 0 0 0 S 0 0.0 0:00.00 aio/1
99 root 5 -10 0 0 0 S 0 0.0 0:00.00 aio/2
100 root 5 -10 0 0 0 S 0 0.0 0:00.00 aio/3
244 root 23 0 0 0 0 S 0 0.0 0:00.00 kseriod
359 root 20 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_0
360 root 15 0 0 0 0 S 0 0.0 0:00.00 aacraid
379 root 5 -10 0 0 0 S 0 0.0 0:00.00 ata/0
380 root 5 -10 0 0 0 S 0 0.0 0: 00.00 ata/1
381 root 5 -10 0 0 0 S 0 0.0 0:00.00 ata/2
382 root 8 -10 0 0 0 S 0 0.0 0:00.00 ata/3
388 root 20 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_1
389 root 21 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_2
403 root 15 0 0 0 0 D 0 0.0 0:03.87 kjournald
1708 root 6 -10 3604 444 364 S 0 0.0 0:00.00 udevd

Thanks and regards
--
Denis Anjos.
Cisco Certified Network Associate.
Universidade Federal do ABC
Santo André - SP - BR

Somsak Sriprayoonsakul

unread,

Dec 17, 2007, 9:24:15 AM12/17/07

to Denis, lists.npaci-rocks-discussion

How many compute nodes do you have?

nfsd busy usually because it's busy serving file access to compute nodes.

Are you running any job that required scratch directory access? If so,
you need to instruct those program to use local directory instead.

Also, try

dmesg

On frontend and computes nodes

It may show some of NFS error.

Hope this helps,

--

-----------------------------------------------------------------------------------
Somsak Sriprayoonsakul

Thai National Grid Center
Software Industry Promotion Agency
Ministry of ICT, Thailand
soms...@thaigrid.or.th
-----------------------------------------------------------------------------------

Paul Kopec

unread,

Dec 17, 2007, 10:00:24 AM12/17/07

to Denis, lists.npaci-rocks-discussion

Denis,

I did the following to help me out (assuming it is NFS slowing you down).

..from my notes...
30) Increase the number of nfs daemons (threads)
so the cluster won’t get bogged down when there are a lot of jobs running.
a) vi /etc/rc.d/init.d/nfs (around line 42)
- change [ -z “$RPCNFSDCOUNT” ] && RPCNFSDCOUNT=8
- to [ -z “$RPCNFSDCOUNT” ] && RPCNFSDCOUNT=16
b) /etc/rc.d/init.d/nfs reload
c) /etc/rc.d/init.d/nfs restart

Good Luck!

Paul

______________________________________________________
Paul Kopec
Project Manager
University of Michigan
Dept. of Human Genetics
1241 E. Catherine Street
5928 Buhl Building
Ann Arbor, MI 48109-0618
734-763-5411
pko...@umich.edu

Joe Landman

unread,

Dec 17, 2007, 10:38:49 AM12/17/07

to Denis, lists.npaci-rocks-discussion

Denis wrote:
> Dears, I've got a double dual xeon front end, wich sometimes is
> getting too slow.
>
>
> While I was observing one of these slowly times, I realized the time
> wait of processors come close to 100% and nfsd and kjournald process
> become to D status (uniterruptible sleep) .Does someone know why this
> can be happening or how can I troubleshoot it?

With your load average so high, it looks like whatever you are using for
your storage controller (SATA,SCSI, ...) is slowing your IO down. This
is usually the case under heavy load with "cheap" network or storage
controllers.

> top - 10:59:02 up 6 days, 19:30, 17 users, load average: 14.77, 13.46, 13.00

Loads of 14.77 and so on, when CPUs are effectively idle, usually means
IO is backing up in queue. You can see this with vmstat

vmstat 1

and look at the "b" column (usually second from left). If this number
is not zero, it is likely close to the actual user load reported.

Could you show several lines of vmstat output? This is mine from my laptop

landman@lightning:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
0 0 38516 342460 353356 535632 0 0 2 9 1 36 15 8
77 0
0 0 38516 342612 353356 535656 0 0 0 432 517 1430 1 1
98 0
1 0 38516 342468 353356 535656 0 0 0 0 604 2302 30 4
66 0
3 0 38516 342508 353356 535656 0 0 0 0 530 1482 4 4
91 0
1 0 38516 342372 353360 535652 0 0 0 172 529 1561 3 0
97 0

> 359 root 20 0 0 0 0 S 0 0.0 0:00.00 scsi_eh_0
> 360 root 15 0 0 0 0 S 0 0.0 0:00.00 aacraid

Ah... you are using aacraid. Similar (lack of) performance to 3ware.
Under heavy load, 3ware has problems keeping up, and user load rises
dramatically. Last I checked aacraid based units had similar issues.

I might suggest either an alternative RAID card, or using software RAID
if possible.

--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: lan...@scalableinformatics.com
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615

mark.b...@uphs.upenn.edu

unread,

Dec 17, 2007, 10:52:01 AM12/17/07

to Paul Kopec, lists.npaci-rocks-discussion

In the message dated: Mon, 17 Dec 2007 10:00:24 EST,
The pithy ruminations from Paul Kopec on
<Re: [Rocks-Discuss] NFS , Kjournald - status D (uninterruptible sleep) - Slow
front-end occasionally> were:
=> Denis,
=>
=> I did the following to help me out (assuming it is NFS slowing you down).
=>
=> ..from my notes...
=> 30) Increase the number of nfs daemons (threads)=20
=> so the cluster won=92t get bogged down when there are a lot of jobs running.
=> a) vi /etc/rc.d/init.d/nfs (around line 42)

Ick! This is really not a good way to make this kind of change, for several
reasons:

it's not the expected way of doing things, so anyone else who works
on this system won't necessarily think to look here

IMHO, editing scripts increases the chances of introducing errors over
editing config files

the change will be overwritten if the init script is updated

the change will be overridden by a setting in /etc/sysconfig/nfs

I strongly suggest that you become familiar with the preferred (RedHat) methods
of changing system daemon parameters. Most of them are controlled via files in
/etc/sysconfig that correspond to the daemon name.

For example, the /etc/sysconfig/nfs file might contain:

===========================================================================
# Pass any additional options for mountd.
# MOUNTD_OPTIONS=

# Pin mountd to a given port rather than random one from portmapper
# MOUNTD_PORT=4002

# Don't advertise TCP for mount.
# MOUNTD_TCP=no

# NFS V3
# MOUNTD_NFS_V3=auto|yes|no
MOUNTD_NFS_V3=auto

# NFS V2
# MOUNTD_NFS_V2=auto|yes|no
MOUNTD_NFS_V2=auto

# The number of open file descriptors
MOUNTD_OPEN_FILES=512

# Pass the number of instances of nfsd (8 is default; 16 or more
# might be needed to handle heavy client traffic)
RPCNFSDCOUNT=24

# Increase the memory limits on the socket input queues for
# the nfs processes .. NFS benchmark SPECsfs demonstrate a
# need for a larger than default size (64kb) .. setting
# TUNE_QUEUE to yes will set the values to 256kb.
TUNE_QUEUE="yes"
NFS_QS=262144

# Set fixed ports for lockd
# LOCKD_TCPPORT=4001
# LOCKD_UDPPORT=4001

# Set fixed port for remote quota server port
# RQUOTAD_PORT=""

# Set fixed port for statd
# STATD_PORT=4000

# Set fixed port for statd (outgoing)
# STATD_OUTGOING_PORT=4000

# Set statd hostname
# STATD_HOSTNAME=

# This is for NFS4
SECURE_NFS=no

# This is options for the gssd daemon
RPCGSSD_OPTIONS="-m"

# This is options for the idmap daemon
RPCIDMAPD_OPTIONS=""

# This is options for the svcgssd daemon
RPCSVCGSSD_OPTIONS=""
===========================================================================

=> - change [ -z =93$RPCNFSDCOUNT=94 ] && RPCNFSDCOUNT=3D8
=> - to [ -z =93$RPCNFSDCOUNT=94 ] && RPCNFSDCOUNT=3D16
=> b) /etc/rc.d/init.d/nfs reload
=> c) /etc/rc.d/init.d/nfs restart
=>
=>
=> Good Luck!
=>
=> Paul
=>

[SNIP!]

=> ______________________________________________________
=> Paul Kopec
=> Project Manager
=> University of Michigan
=> Dept. of Human Genetics
=> 1241 E. Catherine Street
=> 5928 Buhl Building
=> Ann Arbor, MI 48109-0618
=> 734-763-5411
=> pko...@umich.edu
=>
=>

----
Mark Bergman mark.b...@uphs.upenn.edu
System Administrator
Section of Biomedical Image Analysis 215-662-7310
Department of Radiology, University of Pennsylvania

http://pgpkeys.pca.dfn.de:11371/pks/lookup?search=mark.bergman%40.uphs.upenn.edu

The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message.

Paul Kopec

unread,

Dec 17, 2007, 11:22:36 AM12/17/07

to deni...@gmail.com, lists.npaci-rocks-discussion

Denis,

I just wanted to make sure you got the message I'm doing it the wrong way.

Looks like it is better to modify the RPCNFSDCOUNT parameter in
/etc/sysconfig/nfs.

Paul

______________________________________________________

Denis

unread,

Dec 17, 2007, 12:44:53 PM12/17/07

to soms...@thaigrid.or.th, lists.npaci-rocks-discussion

Somsak,

I wrote:

> > While I was observing one of these slowly times, I realized the time
> > wait of processors come close to 100% and nfsd and kjournald process
> > become to D status (uniterruptible sleep) .

> > [root@cromo src]# top
> > top - 10:59:02 up 6 days, 19:30, 17 users, load average: 14.77, 13.46, 13.00
> > Tasks: 189 total, 1 running, 188 sleeping, 0 stopped, 0 zombie
> > Cpu0 : 0.0% us, 0.4% sy, 0.0% ni, 0.2% id, 98.5% wa, 0.0% hi, 0.9% si
> > Cpu1 : 0.0% us, 0.4% sy, 0.0% ni, 0.0% id, 99.6% wa, 0.0% hi, 0.0% si
> > Cpu2 : 0.0% us, 0.2% sy, 0.0% ni, 7.8% id, 92.0% wa, 0.0% hi, 0.0% si
> > Cpu3 : 0.2% us, 0.8% sy, 0.0% ni, 0.0% id, 98.7% wa, 0.2% hi, 0.0% si
> > Mem: 4025540k total, 4018264k used, 7276k free, 7908k buffers
> > Swap: 4096564k total, 3288k used, 4093276k free, 3553252k cached
> >
> > PID USER PR NI VIRT RES SHR %CPU %MEM TIME+ COMMAND
> >
> > 3941 root 15 0 0 0 0 D 2 0.0 80:29.30 nfsd
> > 3938 root 15 0 0 0 0 D 1 0.0 81:02.02 nfsd
> > 2459 root 15 0 0 0 0 D 1 0.0 30:57.04 kjournald
> > 3937 root 15 0 0 0 0 D 1 0.0 82:07.99 nfsd
> > 96 root 15 0 0 0 0 S 0 0.0 34:12.48 kswapd0
> > 3940 root 15 0 0 0 0 D 0 0.0 80:22.55 nfsd
> > 3942 root 15 0 0 0 0 D 0 0.0 80:23.66 nfsd
> > 3939 root 15 0 0 0 0 D 0 0.0 82:22.55 nfsd

you wrote:
> How many compute nodes do you have?

I have 11 nodes.

>
> nfsd busy usually because it's busy serving file access to compute nodes.
>
> Are you running any job that required scratch directory access? If so,
> you need to instruct those program to use local directory instead.

The users of this clusters have been already advised to use localy scratch.

Might be some one isn't using it for mistake.

How can I debug wich jobs or users or process are writing/reading through nfsd?

>
> Also, try
>
> dmesg

>
> On frontend and computes nodes
>
> It may show some of NFS error.

I took a look at dmesg outputs, but there looks like everything is ok.

>
> Hope this helps,

Thanks a lot!

Cheers,

Denis

unread,

Dec 17, 2007, 1:33:12 PM12/17/07

to lan...@scalableinformatics.com, lists. npaci-rocks-discussion

Joe,

2007/12/17, Joe Landman <lan...@scalableinformatics.com>:

> Denis wrote:
> > Dears, I've got a double dual xeon front end, wich sometimes is
> > getting too slow.
> >
> >
> > While I was observing one of these slowly times, I realized the time
> > wait of processors come close to 100% and nfsd and kjournald process
> > become to D status (uniterruptible sleep) .Does someone know why this
> > can be happening or how can I troubleshoot it?
>
> With your load average so high, it looks like whatever you are using for
> your storage controller (SATA,SCSI, ...) is slowing your IO down. This
> is usually the case under heavy load with "cheap" network or storage
> controllers.

I'm using this controller:

06:03.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
Subsystem: Adaptec AAR-2810SA PCI SATA 8ch (Corsair-8)
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr+ Stepping- SERR+ FastB2B+
Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=slow >TAbort-
<TAbort- <MAbort- >SERR- <PERR-
Latency: 64 (250ns min, 250ns max), Cache Line Size 10
Interrupt: pin A routed to IRQ 217
Region 0: Memory at b8000000 (32-bit, prefetchable) [size=64M]
Expansion ROM at ffff8000 [disabled] [size=32K]
Capabilities: [80] Power Management version 2
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 PME-Enable- DSel=0 DScale=0 PME-

there are my HDs:

Disk /dev/sda: 319.9 GB, 319998918656 bytes
255 heads, 63 sectors/track, 38904 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 * 1 2550 20482843+ 83 Linux
/dev/sda2 2551 3060 4096575 82 Linux swap
/dev/sda3 3061 3570 4096575 83 Linux
/dev/sda4 3571 38904 283820355 5 Extended
/dev/sda5 3571 38904 283820323+ 83 Linux

Disk /dev/sdb: 959.9 GB, 959994396672 bytes
255 heads, 63 sectors/track, 116712 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sdb1 * 1 116712 937489108+ 83 Linux

A hdparm reports these stats:

[root@cromo etc]# hdparm -Tt /dev/sda

/dev/sda:
Timing cached reads: 5104 MB in 2.00 seconds = 2552.39 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate
ioctl for device
Timing buffered disk reads: 186 MB in 3.03 seconds = 61.48 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate
ioctl for device
[root@cromo etc]# hdparm -Tt /dev/sdb

/dev/sdb:
Timing cached reads: 10320 MB in 2.00 seconds = 5160.79 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate
ioctl for device
Timing buffered disk reads: 134 MB in 11.41 seconds = 11.74 MB/sec
HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate
ioctl for device

[...]

> Could you show several lines of vmstat output? This is mine from my laptop

Here we go:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----

r b swpd free buff cache si so bi bo in cs us sy id wa

0 1 3612 239600 2828 3364300 0 0 1494 1300 10 4 0 3 80 17
0 1 3612 187756 2872 3413420 0 0 44 24608 4274 10679 0 12 72 17
0 1 3612 153196 2900 3446100 0 0 28 17248 3349 6018 0 7 72 20
0 4 3612 134804 2908 3453980 0 0 8 61140 1795 4528 0 3 31 65
1 0 3612 144592 2920 3458796 0 0 4 18972 1689 2102 0 2 37 62
0 2 3612 86936 2980 3512048 0 0 56 26656 4554 14563 0 13 69 18
0 2 3612 45424 3024 3552532 0 0 40 19076 3774 8148 0 9 71 20
0 1 3612 15792 3048 3579368 0 0 44 25696 4430 11954 0 12 72 16
0 4 3612 7892 3064 3576972 0 0 16 74808 2430 12205 0 6 57 37
0 1 3612 17828 3080 3581512 0 0 8 18832 2423 6117 0 5 50 45
1 0 3612 16944 3068 3581048 0 0 40 26816 4126 15803 0 12 66 22
0 0 3612 16032 3128 3578200 0 0 60 32640 5137 23987 0 15 68 17
1 0 3612 15664 3144 3576620 0 0 56 33200 5178 17395 0 15 71 14
0 5 3612 7956 3160 3573272 0 0 20 74092 2077 5504 0 4 54 41
0 6 3612 12212 3168 3578568 0 0 8 30936 1964 3848 0 3 42 55
1 1 3612 15752 3200 3579760 0 0 24 22352 2975 9692 0 8 60 32
0 1 3612 16696 3240 3577340 0 0 40 22496 3954 15224 0 12 71 17
2 0 3612 15472 3212 3576824 0 0 40 26624 4466 14382 0 13 71 16
1 3 3612 5908 3228 3575924 0 0 16 72304 2328 8965 0 5 39 55
0 1 3612 14188 3244 3579512 0 0 8 17892 2169 4147 0 4 57 39
0 2 3612 17136 3292 3575520 0 0 56 28100 4672 16158 0 15 69 16

[...]

> Ah... you are using aacraid. Similar (lack of) performance to 3ware.
> Under heavy load, 3ware has problems keeping up, and user load rises
> dramatically. Last I checked aacraid based units had similar issues.
>
> I might suggest either an alternative RAID card, or using software RAID
> if possible.

Do you think that something like a module update, or firmware update
can help to improve the performance?

How can I see what is the version of the module being used by my Rocks?

Do you really think that a software RAID could be a good idea instead
of a hardware one?

>
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: lan...@scalableinformatics.com
> web : http://www.scalableinformatics.com
> http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423
> fax : +1 866 888 3112
> cell : +1 734 612 4615
>

Thanks and regards,

Joe Landman

unread,

Dec 17, 2007, 1:54:51 PM12/17/07

to Denis, lists. npaci-rocks-discussion

Denis wrote:

> I'm using this controller:
>
> 06:03.0 RAID bus controller: Adaptec AAC-RAID (rev 01)
> Subsystem: Adaptec AAR-2810SA PCI SATA 8ch (Corsair-8)
> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
> ParErr+ Stepping- SERR+ FastB2B+
> Status: Cap+ 66Mhz+ UDF- FastB2B+ ParErr- DEVSEL=slow >TAbort-
> <TAbort- <MAbort- >SERR- <PERR-
> Latency: 64 (250ns min, 250ns max), Cache Line Size 10
> Interrupt: pin A routed to IRQ 217
> Region 0: Memory at b8000000 (32-bit, prefetchable) [size=64M]
> Expansion ROM at ffff8000 [disabled] [size=32K]
> Capabilities: [80] Power Management version 2

I am looking this adapter up. Looks like a PCI-x based unit, based upon
an old (and slow) IO processor and a small ram (64 MB) for cache.

[...]

> A hdparm reports these stats:
>
> [root@cromo etc]# hdparm -Tt /dev/sda
>
> /dev/sda:
> Timing cached reads: 5104 MB in 2.00 seconds = 2552.39 MB/sec
> HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate
> ioctl for device

Looks like the driver doesn't implement something here.

> Timing buffered disk reads: 186 MB in 3.03 seconds = 61.48 MB/sec
> HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate
> ioctl for device

This is what I would expect for a single device (or a RAID1 mirror) on
slower devices.

> [root@cromo etc]# hdparm -Tt /dev/sdb
>
> /dev/sdb:
> Timing cached reads: 10320 MB in 2.00 seconds = 5160.79 MB/sec
> HDIO_DRIVE_CMD(null) (wait for flush complete) failed: Inappropriate
> ioctl for device
> Timing buffered disk reads: 134 MB in 11.41 seconds = 11.74 MB/sec

Wow.... look at that number. 11.7 MB/s No wonder requests are stacking up.

It does look like the disk is blocking. Could be a combination of a
slow adapter and slow drives.

>
> [...]
>
>> Ah... you are using aacraid. Similar (lack of) performance to 3ware.
>> Under heavy load, 3ware has problems keeping up, and user load rises
>> dramatically. Last I checked aacraid based units had similar issues.
>>
>> I might suggest either an alternative RAID card, or using software RAID
>> if possible.
>
> Do you think that something like a module update, or firmware update
> can help to improve the performance?

It is possible, though I am wondering if you can do other bits of tuning
first. Lets start with this

blockdev --setra 1024 /dev/sdb

and retry that hdparm test. If this doesn't help, then it is fairly
likely that the other bits of tuning won't either.

>
> How can I see what is the version of the module being used by my Rocks?

try

dmesg | less

and then look for aacraid

>
> Do you really think that a software RAID could be a good idea instead
> of a hardware one?

Well ... for this device, no. It looks like it probably couldn't keep
up with your drives. Assume your drives could talk at 60 MB/s. This
unit looks like it is on a 500 MB/s PCI-x link (66 MHz 64 bit though the
memory looks like it is 32 bit, so it is possible it is addressing it
that way, in which case you might be at 256 MB/s maximum on this unit).

The low end of RAID controllers are generally (badly) underpowered at
performing the XOR raid calcs. Software RAID is often quite a bit
faster than the lower end hardware RAID. If you have a PCIe-x8 slot,
and extra drive bays, I might recommend a migration to a faster RAID
card, and drives.

FWIW: low end JackRabbit units show hdparm numbers in excess of 600
MB/s, as compared to the 11.7 MB/s numbers above.

You might give a serious look at the LSI SATA RAID cards. I can't say
they would solve your problem, but it looks like something is severely
interfering with your IO performance, and my first guess would be the
RAID card.

Also, as a sanity check, see if your RAID is running in degraded mode.
If so, this has a known tendency to adversely impact IO performance.

What is your RAID configuration by the way? RAID5? RAID6? RAID 10?

Reply all

Reply to author

Forward

[Rocks-Discuss] NFS , Kjournald - status D (uninterruptible sleep) - Slow front-end occasionally

Denis

Somsak Sriprayoonsakul

Paul Kopec

Joe Landman

mark.b...@uphs.upenn.edu

Paul Kopec

Denis

Denis

Joe Landman