[Lustre-discuss] Fragmented I/O

Kevin Hildebrand

unread,

May 11, 2011, 8:07:14 PM5/11/11

to lustre-...@lists.lustre.org

Hi, I'm having some performance issues on my Lustre filesystem and it
looks to me like it's related to I/Os getting fragmented before being
written to disk, but I can't figure out why. This system is RHEL5,
running Lustre 1.8.4.

All of my OSTs look pretty much the same-

read | write
pages per bulk r/w rpcs % cum % | rpcs % cum %
1: 88811 38 38 | 46375 17 17
2: 1497 0 38 | 7733 2 20
4: 1161 0 39 | 1840 0 21
8: 1168 0 39 | 7148 2 24
16: 922 0 40 | 3297 1 25
32: 979 0 40 | 7602 2 28
64: 1576 0 41 | 9046 3 31
128: 7063 3 44 | 16284 6 37
256: 129282 55 100 | 162090 62 100

read | write
disk fragmented I/Os ios % cum % | ios % cum %
0: 51181 22 22 | 0 0 0
1: 45280 19 42 | 82206 31 31
2: 16615 7 49 | 29108 11 42
3: 3425 1 50 | 17392 6 49
4: 110445 48 98 | 129481 49 98
5: 1661 0 99 | 2702 1 99

read | write
disk I/O size ios % cum % | ios % cum %
4K: 45889 8 8 | 56240 7 7
8K: 3658 0 8 | 6416 0 8
16K: 7956 1 10 | 4703 0 9
32K: 4527 0 11 | 11951 1 10
64K: 114369 20 31 | 134128 18 29
128K: 5095 0 32 | 17229 2 31
256K: 7164 1 33 | 30826 4 35
512K: 369512 66 100 | 465719 64 100

Oddly, there's no 1024K row in the I/O size table...

...and these seem small to me as well, but I can't seem to change them.
Writing new values to either doesn't change anything.

# cat /sys/block/sdb/queue/max_hw_sectors_kb
320
# cat /sys/block/sdb/queue/max_sectors_kb
320

Hardware in question is DELL PERC 6/E and DELL PERC H800 RAID
controllers, with MD1000 and MD1200 arrays, respectively.

Any clues on where I should look next?

Thanks,

Kevin

Kevin Hildebrand
University of Maryland, College Park
Office of Information Technology
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Kevin Van Maren

unread,

May 11, 2011, 10:28:00 PM5/11/11

to Kevin Hildebrand, lustre-...@lists.lustre.org

You didn't say, but I think they are LSI-based: are you using the mptsas
driver with the PERC cards? Which driver version?

First, max_sectors_kb should normally be set to a power of 2 number,
like 256, over an odd size like 320. This number should also match the
native raid size of the device, to avoid read-modify-write cycles. (See
Bug 22886 on why not to make it > 1024 in general).

See Bug 17086 for patches to increase the max_sectors_kb limitation for
the mptsas driver to 1MB, or the true hardware maximum, rather than a
driver limit; however, the hardware may still be limited to sizes < 1MB.

Also, to clarify the sizes: the smallest bucket >= transfer_size is the
one incremented, so a 320KB IO increments the 512KB bucket. Since your
HW says it can only do a 320KB IO, there will never be a 1MB IO.

You may want to instrument your HBA driver to see what is going on (ie,
why the max_hw_sectors_kb is < 1024).

Kevin

Kevin Hildebrand

unread,

May 12, 2011, 7:11:29 AM5/12/11

to Kevin Van Maren, lustre-...@lists.lustre.org

The PERC 6 and H800 use megaraid_sas, I'm currently running
00.00.04.17-RH1.

The max_sectors numbers (320) are what is being set by default- I am able
to set it to something smaller than 320, but not larger.

Kevin

Kevin Van Maren

unread,

May 12, 2011, 9:16:31 AM5/12/11

to Kevin Hildebrand, lustre-...@lists.lustre.org

Kevin Hildebrand wrote:
>
> The PERC 6 and H800 use megaraid_sas, I'm currently running
> 00.00.04.17-RH1.
>
> The max_sectors numbers (320) are what is being set by default- I am
> able to set it to something smaller than 320, but not larger.

Right. You can not set max_sectors_kb larger than max_hw_sectors_kb
(Linux normally defaults most drivers to 512, but Lustre sets them to be
the same): you may want to instrument your HBA driver to see what is
going on (ie, why the max_hw_sectors_kb is < 1024). I don't know if it
is due to a driver limitation or a true hardware limit.

Most drivers have a limit of 512KB by default; see Bug 22850 for the
patches that fixed the QLogic and Emulex fibre channel drivers.

Kevin

Kevin Hildebrand

unread,

May 12, 2011, 10:52:40 AM5/12/11

to lustre-...@lists.lustre.org

One of the oddities that I'm seeing that has me grasping at write
fragmentation and I/O sizes may not be directly related to these things at
all. Periodically, iostat will show that one or more of my OST disks will
be running at 99% utilization. Reads per second is somewhere in the
150-200 range, while read kB/second is quite small. In addition, average
request size is also very small. llobdstat output on the OST in question
usually has zero, or very small values for reads and writes, and values
for stats/punches/creates/deletes in the ones and twos.
While this is happening, lustre starts complaining about 'slow commitrw',
'slow direct_io', etc. At this time, accesses from clients are usually
hanging.

Why would the disk(s) be pegged while llobdstat shows zero activity?

After a few minutes in this state, the %util drops back down to single
digit percentages and normal I/O resumes on the clients.

Thanks,
Kevin

Jason Rappleye

unread,

May 12, 2011, 10:57:41 AM5/12/11

to Kevin Hildebrand, lustre-...@lists.lustre.org

On May 12, 2011, at 7:52 AM, Kevin Hildebrand wrote:

>
> One of the oddities that I'm seeing that has me grasping at write
> fragmentation and I/O sizes may not be directly related to these things at
> all. Periodically, iostat will show that one or more of my OST disks will
> be running at 99% utilization. Reads per second is somewhere in the
> 150-200 range, while read kB/second is quite small.

That sounds familiar. You're probably experiencing these:

https://bugzilla.lustre.org/show_bug.cgi?id=24183
http://jira.whamcloud.com/browse/LU-15

Jason

--
Jason Rappleye
System Administrator
NASA Advanced Supercomputing Division
NASA Ames Research Center
Moffett Field, CA 94035

Reply all

Reply to author

Forward