[Lustre-discuss] HW RAID - fragmented I/O

91 views
Skip to first unread message

Wojciech Turek

unread,
Jun 8, 2011, 11:53:31 AM6/8/11
to lustre-discuss
I am setting up a new lustre filesystem using LSI engenio based disk
enclosures with integrated dual RAID controllers. I configured disks
into 8+2 RAID6 groups using 128kb segment size (chunk size). This
hardware uses mpt2sas kernel module on the Linux host side. I use the
whole block device for an OST (to avoid any alignment issues). When
running sgpdd-survey I can see high through numbers (~3GB/s write,
5GB/s read), Also controllers stats show that number of IOPS = number
of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows
slower results (~2GB/s write , ~2GB/s read ) and controller stats show
more then double IOPS than MB/s. Looking at output from iostat -m -x 1
and brw_stats I can see that a large number of I/O operations are
smaller than 1MB, mostly 512kb. I know that there was some work done
on optimising the kernel block device layer to process 1MB I/O
requests and that those changes were committed to Lustre 1.8.5. Thus I
guess this I/O chopping happens below the Lustre stack, maybe in the
mpt2sas driver?

I am hoping that someone in Lustre community can shed some light on to
my problem.

In my setup I use:
Lustre 1.8.5
CentOS-5.5

Some parameters I tuned from defaults in CentOS:
deadline I/O scheduler

max_hw_sectors_kb=4096
max_sectors_kb=1024


brw_stats output
--

find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost;
do cat $ost/brw_stats ; done | grep "disk I/O size" -A9

disk I/O size ios % cum % | ios % cum %
4K: 206 0 0 | 521 0 0
8K: 224 0 0 | 595 0 1
16K: 105 0 1 | 479 0 1
32K: 140 0 1 | 1108 1 3
64K: 231 0 1 | 1470 1 4
128K: 536 1 2 | 2259 2 7
256K: 1762 3 6 | 5644 6 14
512K: 31574 64 71 | 30431 35 50
1M: 14200 28 100 | 42143 49 100
--
disk I/O size ios % cum % | ios % cum %
4K: 187 0 0 | 457 0 0
8K: 244 0 0 | 598 0 1
16K: 109 0 1 | 481 0 1
32K: 129 0 1 | 1100 1 3
64K: 222 0 1 | 1408 1 4
128K: 514 1 2 | 2291 2 7
256K: 1718 3 6 | 5652 6 14
512K: 32222 65 72 | 29810 35 49
1M: 13654 27 100 | 42202 50 100
--
disk I/O size ios % cum % | ios % cum %
4K: 196 0 0 | 551 0 0
8K: 206 0 0 | 551 0 1
16K: 79 0 0 | 513 0 1
32K: 136 0 1 | 1048 1 3
64K: 232 0 1 | 1278 1 4
128K: 540 1 2 | 2172 2 7
256K: 1681 3 6 | 5679 6 13
512K: 31842 64 71 | 31705 37 51
1M: 14077 28 100 | 41789 48 100
--
disk I/O size ios % cum % | ios % cum %
4K: 190 0 0 | 486 0 0
8K: 200 0 0 | 547 0 1
16K: 93 0 0 | 448 0 1
32K: 141 0 1 | 1029 1 3
64K: 240 0 1 | 1283 1 4
128K: 558 1 2 | 2125 2 7
256K: 1716 3 6 | 5400 6 13
512K: 31476 64 70 | 29029 35 48
1M: 14366 29 100 | 42454 51 100
--
disk I/O size ios % cum % | ios % cum %
4K: 209 0 0 | 511 0 0
8K: 195 0 0 | 621 0 1
16K: 79 0 0 | 558 0 1
32K: 134 0 1 | 1135 1 3
64K: 245 0 1 | 1390 1 4
128K: 509 1 2 | 2219 2 7
256K: 1715 3 6 | 5687 6 14
512K: 31784 64 71 | 31172 36 50
1M: 14112 28 100 | 41719 49 100
--
disk I/O size ios % cum % | ios % cum %
4K: 201 0 0 | 500 0 0
8K: 241 0 0 | 604 0 1
16K: 82 0 1 | 584 0 1
32K: 130 0 1 | 1092 1 3
64K: 230 0 1 | 1331 1 4
128K: 547 1 2 | 2253 2 7
256K: 1695 3 6 | 5634 6 14
512K: 31501 64 70 | 31836 37 51
1M: 14343 29 100 | 41517 48 100

--
Wojciech Turek
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Kevin Van Maren

unread,
Jun 8, 2011, 12:30:08 PM6/8/11
to Wojciech Turek, lustre-discuss
Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not
in the rest of the kernel. Driver limits are not normally noticed by
(non-Lustre) people, because the default kernel limits IO to 512KB.

May want to see Bug 22850 for the changes required eg, for the
Emulex/lpfc driver.

Glancing at the stock RHEL5 kernel, it looks like the issue is
MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to
match the default kernel limit, but it is possible there is also a
driver/HW limit. You should be able to increase that to 256 and see if
it works...


Also note that the size buckets are power-of-2, so a "1MB" entry is any
IO > 512KB and <= 1MB.

If you can't get the driver to reliably do full 1MB IOs, change to a
64KB chunk and set max_sectors_kb to 512. This will help ensure you get
aligned, full-stripe writes.

Kevin

_______________________________________________

Wojciech Turek

unread,
Jun 10, 2011, 8:00:14 AM6/10/11
to Kevin Van Maren, lustre-discuss
Hi Kevin,

Thanks for very helpful answer. I tried your suggestion and recompiled the mpt2sas driver with the following changes:

--- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000
+++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100
@@ -83,13 +83,13 @@
#ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
#if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16
#define MPT2SAS_SG_DEPTH       16
-#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128
-#define MPT2SAS_SG_DEPTH       128
+#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256
+#define MPT2SAS_SG_DEPTH       256
#else
#define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE
#endif
#else
-#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */
+#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */
#endif

#if defined(TARGET_MODE)

However I can still that almost 50% of writes and slightly over 50% of reads falls under 512K I/Os
I am using device-mapper-multipath to manage active/passive paths do you think that could have something to do with the I/O fragmentation?

Best regards,

Wojciech

Wojciech Turek

unread,
Jun 10, 2011, 8:29:47 AM6/10/11
to Kevin Van Maren, lustre-discuss
Hi Kevin,

In my kernel .config I find following lines

CONFIG_SCSI_MPT2SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS_LOGGING=y

I changed SGE value to 256

Do I need to recompile the Kernel before building new module based on that .config?

Kevin Van Maren

unread,
Jun 10, 2011, 8:38:40 AM6/10/11
to Wojciech Turek, lustre-discuss
It's possible there is another issue, but are you sure you (or RedHat)
are not setting CONFIG_SCSI_MPT2SAS_MAX_SGE in your .config, which is
preventing it from being set to 256? I don't have a machine using this
driver.

You could put #warning in the code to see if you hit the non-256 code
path when building, or printk the max_sgl_entries in
_base_allocate_memory_pools.

Kevin

_______________________________________________

Kevin Van Maren

unread,
Jun 10, 2011, 8:42:00 AM6/10/11
to Wojciech Turek, lustre-discuss
Wojciech Turek wrote:
> Hi Kevin,
>
> In my kernel .config I find following lines
>
> CONFIG_SCSI_MPT2SAS=m
> CONFIG_SCSI_MPT2SAS_MAX_SGE=128
> CONFIG_SCSI_MPT2SAS_LOGGING=y
>
> I changed SGE value to 256
>
> Do I need to recompile the Kernel before building new module based on
> that .config?

No, but you do need to do something like "make oldconfig" to propagate
the change in .config to the header files, and then rebuild the driver.

Kevin

Shipman, Galen M.

unread,
Jun 10, 2011, 9:25:36 AM6/10/11
to Wojciech Turek, Dillow, David A., lustre-discuss
Wojciech,

We have seen similar issues with DM-Multipath. Can you experiment with going straight to the block device without DM-Multipath?

Thanks,

Galen

> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org

> blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss

Wojciech Turek

unread,
Jun 13, 2011, 11:41:10 AM6/13/11
to Shipman, Galen M., lustre-discuss
Hi Galen,

I have tried your suggestion and mounted OST directly on /dev/sd<x> devices but that didn't help and I/O is still being fragmented.

Best regards,

Wojciech

Kevin Van Maren

unread,
Jun 13, 2011, 3:38:23 PM6/13/11
to Wojciech Turek, Lustre discuss
Did you printk the SGE in the driver, to make sure it was being set
properly?

sg_tablesize may be being limited elsewhere, although the kernel patches
in v1.8.5 should prevent that.


do this:

# cat /sys/class/scsi_host/host*/sg_tablesize
This should be 256. If not, then this is still the issue.


# cat /sys/block/sd*/queue/max_hw_sectors_kb
This should be >= 1024

# cat /sys/block/sd*/queue/max_sectors_kb
This should be 1024 (Lustre mount sets it to max_hw_sectors_kb)


_base_allocate_memory_pools prints a bunch of helpful info using
MPT2SAS_INFO_FMT -- goes to KERN_INFO,
and dinitprintk (MPT_DEBUG_INIT flag). Turn up kernel verbosity and set
the module parameter logging_level=0x20

If you still don't have an answer, then look at these values in
drivers/scsi/scsi_lib.c:

blk_queue_max_hw_segments(q, shost->sg_tablesize);
blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS);
blk_queue_max_sectors(q, shost->max_sectors);


Kevin


Wojciech Turek wrote:
> Hi Kevin,
>

> Unfortunately still no luck with 1MB I/O. I have forced my OSS to do
> 512KB I/O following your suggestion and setting 512 max_sectors_kb. I
> also recreated my HW RAID with 64KB chunks to align it for 512KB
> chunks. I can see from the brw_stats and controller statistics that
> it does indeed twice as many IOPS as compared to throughput MB/s but
> perfoamnce isn't any better as before.
> From the sgpdd-survey I know that this controller can do around 3GB/s
> write and 4GB/s read. Also when running sgpdd-survey controller stats
> show that I/O is not fragmented (nr of IOPS = Throughput in MB/s). I
> also try to bypass multipath layer by mounting the sd devices directly
> but that did not make any difference.
>
> If you have any more suggestions I will be happy to try them out.
>
> Best regards,
>
> Wojciech
>
>
> On 13 June 2011 15:13, Kevin Van Maren <kevin.v...@oracle.com
> <mailto:kevin.v...@oracle.com>> wrote:
>
> Did you get it doing 1MB IOs?
>
> Kevin

Reply all
Reply to author
Forward
0 new messages