I am hoping that someone in Lustre community can shed some light on to
my problem.
In my setup I use:
Lustre 1.8.5
CentOS-5.5
Some parameters I tuned from defaults in CentOS:
deadline I/O scheduler
max_hw_sectors_kb=4096
max_sectors_kb=1024
brw_stats output
--
find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost;
do cat $ost/brw_stats ; done | grep "disk I/O size" -A9
disk I/O size ios % cum % | ios % cum %
4K: 206 0 0 | 521 0 0
8K: 224 0 0 | 595 0 1
16K: 105 0 1 | 479 0 1
32K: 140 0 1 | 1108 1 3
64K: 231 0 1 | 1470 1 4
128K: 536 1 2 | 2259 2 7
256K: 1762 3 6 | 5644 6 14
512K: 31574 64 71 | 30431 35 50
1M: 14200 28 100 | 42143 49 100
--
disk I/O size ios % cum % | ios % cum %
4K: 187 0 0 | 457 0 0
8K: 244 0 0 | 598 0 1
16K: 109 0 1 | 481 0 1
32K: 129 0 1 | 1100 1 3
64K: 222 0 1 | 1408 1 4
128K: 514 1 2 | 2291 2 7
256K: 1718 3 6 | 5652 6 14
512K: 32222 65 72 | 29810 35 49
1M: 13654 27 100 | 42202 50 100
--
disk I/O size ios % cum % | ios % cum %
4K: 196 0 0 | 551 0 0
8K: 206 0 0 | 551 0 1
16K: 79 0 0 | 513 0 1
32K: 136 0 1 | 1048 1 3
64K: 232 0 1 | 1278 1 4
128K: 540 1 2 | 2172 2 7
256K: 1681 3 6 | 5679 6 13
512K: 31842 64 71 | 31705 37 51
1M: 14077 28 100 | 41789 48 100
--
disk I/O size ios % cum % | ios % cum %
4K: 190 0 0 | 486 0 0
8K: 200 0 0 | 547 0 1
16K: 93 0 0 | 448 0 1
32K: 141 0 1 | 1029 1 3
64K: 240 0 1 | 1283 1 4
128K: 558 1 2 | 2125 2 7
256K: 1716 3 6 | 5400 6 13
512K: 31476 64 70 | 29029 35 48
1M: 14366 29 100 | 42454 51 100
--
disk I/O size ios % cum % | ios % cum %
4K: 209 0 0 | 511 0 0
8K: 195 0 0 | 621 0 1
16K: 79 0 0 | 558 0 1
32K: 134 0 1 | 1135 1 3
64K: 245 0 1 | 1390 1 4
128K: 509 1 2 | 2219 2 7
256K: 1715 3 6 | 5687 6 14
512K: 31784 64 71 | 31172 36 50
1M: 14112 28 100 | 41719 49 100
--
disk I/O size ios % cum % | ios % cum %
4K: 201 0 0 | 500 0 0
8K: 241 0 0 | 604 0 1
16K: 82 0 1 | 584 0 1
32K: 130 0 1 | 1092 1 3
64K: 230 0 1 | 1331 1 4
128K: 547 1 2 | 2253 2 7
256K: 1695 3 6 | 5634 6 14
512K: 31501 64 70 | 31836 37 51
1M: 14343 29 100 | 41517 48 100
--
Wojciech Turek
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
May want to see Bug 22850 for the changes required eg, for the
Emulex/lpfc driver.
Glancing at the stock RHEL5 kernel, it looks like the issue is
MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to
match the default kernel limit, but it is possible there is also a
driver/HW limit. You should be able to increase that to 256 and see if
it works...
Also note that the size buckets are power-of-2, so a "1MB" entry is any
IO > 512KB and <= 1MB.
If you can't get the driver to reliably do full 1MB IOs, change to a
64KB chunk and set max_sectors_kb to 512. This will help ensure you get
aligned, full-stripe writes.
Kevin
_______________________________________________
You could put #warning in the code to see if you hit the non-256 code
path when building, or printk the max_sgl_entries in
_base_allocate_memory_pools.
Kevin
_______________________________________________
No, but you do need to do something like "make oldconfig" to propagate
the change in .config to the header files, and then rebuild the driver.
Kevin
We have seen similar issues with DM-Multipath. Can you experiment with going straight to the block device without DM-Multipath?
Thanks,
Galen
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss
sg_tablesize may be being limited elsewhere, although the kernel patches
in v1.8.5 should prevent that.
do this:
# cat /sys/class/scsi_host/host*/sg_tablesize
This should be 256. If not, then this is still the issue.
# cat /sys/block/sd*/queue/max_hw_sectors_kb
This should be >= 1024
# cat /sys/block/sd*/queue/max_sectors_kb
This should be 1024 (Lustre mount sets it to max_hw_sectors_kb)
_base_allocate_memory_pools prints a bunch of helpful info using
MPT2SAS_INFO_FMT -- goes to KERN_INFO,
and dinitprintk (MPT_DEBUG_INIT flag). Turn up kernel verbosity and set
the module parameter logging_level=0x20
If you still don't have an answer, then look at these values in
drivers/scsi/scsi_lib.c:
blk_queue_max_hw_segments(q, shost->sg_tablesize);
blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS);
blk_queue_max_sectors(q, shost->max_sectors);
Kevin
Wojciech Turek wrote:
> Hi Kevin,
>
> Unfortunately still no luck with 1MB I/O. I have forced my OSS to do
> 512KB I/O following your suggestion and setting 512 max_sectors_kb. I
> also recreated my HW RAID with 64KB chunks to align it for 512KB
> chunks. I can see from the brw_stats and controller statistics that
> it does indeed twice as many IOPS as compared to throughput MB/s but
> perfoamnce isn't any better as before.
> From the sgpdd-survey I know that this controller can do around 3GB/s
> write and 4GB/s read. Also when running sgpdd-survey controller stats
> show that I/O is not fragmented (nr of IOPS = Throughput in MB/s). I
> also try to bypass multipath layer by mounting the sd devices directly
> but that did not make any difference.
>
> If you have any more suggestions I will be happy to try them out.
>
> Best regards,
>
> Wojciech
>
>
> On 13 June 2011 15:13, Kevin Van Maren <kevin.v...@oracle.com
> <mailto:kevin.v...@oracle.com>> wrote:
>
> Did you get it doing 1MB IOs?
>
> Kevin