I am hoping that someone in Lustre community can shed some light on to
my problem.
In my setup I  use:
Lustre 1.8.5
CentOS-5.5
Some parameters I tuned from defaults in CentOS:
deadline I/O scheduler
max_hw_sectors_kb=4096
max_sectors_kb=1024
brw_stats output
--
find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost;
do cat $ost/brw_stats ; done | grep "disk I/O size" -A9
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    206   0   0   |  521   0   0
8K:                    224   0   0   |  595   0   1
16K:                   105   0   1   |  479   0   1
32K:                   140   0   1   | 1108   1   3
64K:                   231   0   1   | 1470   1   4
128K:                  536   1   2   | 2259   2   7
256K:                 1762   3   6   | 5644   6  14
512K:                31574  64  71   | 30431  35  50
1M:                  14200  28 100   | 42143  49 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    187   0   0   |  457   0   0
8K:                    244   0   0   |  598   0   1
16K:                   109   0   1   |  481   0   1
32K:                   129   0   1   | 1100   1   3
64K:                   222   0   1   | 1408   1   4
128K:                  514   1   2   | 2291   2   7
256K:                 1718   3   6   | 5652   6  14
512K:                32222  65  72   | 29810  35  49
1M:                  13654  27 100   | 42202  50 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    196   0   0   |  551   0   0
8K:                    206   0   0   |  551   0   1
16K:                    79   0   0   |  513   0   1
32K:                   136   0   1   | 1048   1   3
64K:                   232   0   1   | 1278   1   4
128K:                  540   1   2   | 2172   2   7
256K:                 1681   3   6   | 5679   6  13
512K:                31842  64  71   | 31705  37  51
1M:                  14077  28 100   | 41789  48 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    190   0   0   |  486   0   0
8K:                    200   0   0   |  547   0   1
16K:                    93   0   0   |  448   0   1
32K:                   141   0   1   | 1029   1   3
64K:                   240   0   1   | 1283   1   4
128K:                  558   1   2   | 2125   2   7
256K:                 1716   3   6   | 5400   6  13
512K:                31476  64  70   | 29029  35  48
1M:                  14366  29 100   | 42454  51 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    209   0   0   |  511   0   0
8K:                    195   0   0   |  621   0   1
16K:                    79   0   0   |  558   0   1
32K:                   134   0   1   | 1135   1   3
64K:                   245   0   1   | 1390   1   4
128K:                  509   1   2   | 2219   2   7
256K:                 1715   3   6   | 5687   6  14
512K:                31784  64  71   | 31172  36  50
1M:                  14112  28 100   | 41719  49 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    201   0   0   |  500   0   0
8K:                    241   0   0   |  604   0   1
16K:                    82   0   1   |  584   0   1
32K:                   130   0   1   | 1092   1   3
64K:                   230   0   1   | 1331   1   4
128K:                  547   1   2   | 2253   2   7
256K:                 1695   3   6   | 5634   6  14
512K:                31501  64  70   | 31836  37  51
1M:                  14343  29 100   | 41517  48 100
-- 
Wojciech Turek
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
May want to see Bug 22850 for the changes required eg, for the 
Emulex/lpfc driver.
Glancing at the stock RHEL5 kernel, it looks like the issue is 
MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set to 
match the default kernel limit, but it is possible there is also a 
driver/HW limit.  You should be able to increase that to 256 and see if 
it works...
Also note that the size buckets are power-of-2, so a "1MB" entry is any 
IO > 512KB and <= 1MB.
If you can't get the driver to reliably do full 1MB IOs, change to a 
64KB chunk and set max_sectors_kb to 512.  This will help ensure you get 
aligned, full-stripe writes.
Kevin
_______________________________________________
You could put #warning in the code to see if you hit the non-256 code 
path when building, or printk the max_sgl_entries in 
_base_allocate_memory_pools.
Kevin
_______________________________________________
No, but you do need to do something like "make oldconfig" to propagate 
the change in .config to the header files, and then rebuild the driver.
Kevin
We have seen similar issues with DM-Multipath. Can you experiment with going straight to the block device without DM-Multipath?
Thanks,
Galen
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss
sg_tablesize may be being limited elsewhere, although the kernel patches 
in v1.8.5 should prevent that.
do this:
# cat /sys/class/scsi_host/host*/sg_tablesize
This should be 256.  If not, then this is still the issue.
# cat /sys/block/sd*/queue/max_hw_sectors_kb
This should be >= 1024
# cat /sys/block/sd*/queue/max_sectors_kb
This should be 1024 (Lustre mount sets it to max_hw_sectors_kb)
_base_allocate_memory_pools prints a bunch of helpful info using 
MPT2SAS_INFO_FMT -- goes to KERN_INFO,
and dinitprintk (MPT_DEBUG_INIT flag).  Turn up kernel verbosity and set 
the module parameter logging_level=0x20
If you still don't have an answer, then look at these values in 
drivers/scsi/scsi_lib.c:
        blk_queue_max_hw_segments(q, shost->sg_tablesize);
        blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS);
        blk_queue_max_sectors(q, shost->max_sectors);
Kevin
Wojciech Turek wrote:
> Hi Kevin,
>
> Unfortunately still no luck with 1MB I/O. I have forced my OSS to do 
> 512KB I/O following your suggestion and setting 512 max_sectors_kb. I 
> also recreated my HW RAID with 64KB chunks to align it for 512KB 
> chunks. I can see from the brw_stats and  controller statistics that 
> it does indeed twice as many IOPS as compared to throughput MB/s but 
> perfoamnce isn't any better as before.
> From the sgpdd-survey I know that this controller can do around 3GB/s 
> write and 4GB/s read. Also when running sgpdd-survey controller stats 
> show that I/O is not fragmented (nr of IOPS = Throughput in MB/s). I 
> also try to bypass multipath layer by mounting the sd devices directly 
> but that did not make any difference.
>
> If you have any more suggestions I will be happy to try them out.
>
> Best regards,
>
> Wojciech
>
>
> On 13 June 2011 15:13, Kevin Van Maren <kevin.v...@oracle.com 
> <mailto:kevin.v...@oracle.com>> wrote:
>
>     Did you get it doing 1MB IOs?
>
>     Kevin