[Lustre-discuss] DDN hints?

John R. Dunning

unread,

May 18, 2007, 7:56:56 AM5/18/07

to lustre-...@clusterfs.com

I'm getting my first exposure to a ddn storage array and it's enlightening :-}

Mostly it just does what I expected of it with little trouble, but I'm having
hard time getting the read side working as fast as it ought to be able to go.
Does anybody have experience they'd like to share, tuning the kernel/driver or
the array?

I'm using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the
anticipatory scheduler, and tweaking up the readahead size for the blockdev, I
can get around 300MB/s by using 4 threads on a port, or about 3/4 of the
expected max. Writes max out easily. The ddn's stats say that the large
majority of my reads are only 256K, even though the requests are larger than
that.

I tried incorporating the blkdev-max-io-size-selection and
increase-sglist-size patches from cfs, but that didn't really help, my reads
are still maxing out at 256K.

If anybody's been through this kind of thing and has experiences, rumors, or
war stories about what kinds of tuning in this area yield good results, I'd
love to talk to you!

_______________________________________________
Lustre-discuss mailing list
Lustre-...@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Daniel Leaberry

unread,

May 18, 2007, 10:06:35 AM5/18/07

to John R. Dunning, lustre-...@clusterfs.com

John R. Dunning wrote:
> I'm getting my first exposure to a ddn storage array and it's enlightening :-}
>
> Mostly it just does what I expected of it with little trouble, but I'm having
> hard time getting the read side working as fast as it ought to be able to go.
> Does anybody have experience they'd like to share, tuning the kernel/driver or
> the array?
>
> I'm using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the
> anticipatory scheduler, and tweaking up the readahead size for the blockdev, I
> can get around 300MB/s by using 4 threads on a port, or about 3/4 of the
> expected max. Writes max out easily. The ddn's stats say that the large
> majority of my reads are only 256K, even though the requests are larger than
> that.
>
> I tried incorporating the blkdev-max-io-size-selection and
> increase-sglist-size patches from cfs, but that didn't really help, my reads
> are still maxing out at 256K.
>
> If anybody's been through this kind of thing and has experiences, rumors, or
> war stories about what kinds of tuning in this area yield good results, I'd
> love to talk to you!
>

I know this won't help you but for posterities sake use IBGD 1.8.2 if
you have an infiniband DDN array. I tried for 4 days with OFED 1.1.1 to
get decent io performance and never could. 30 minutes to install IBGD
and I was pushing 700MB/sec through the port using dd.

Makia Minich

unread,

May 18, 2007, 10:13:19 AM5/18/07

to lustre-...@clusterfs.com

On Friday 18 May 2007 10:06:35 am Daniel Leaberry wrote:
> John R. Dunning wrote:
> > I'm getting my first exposure to a ddn storage array and it's
> > enlightening :-}
> >
> > Mostly it just does what I expected of it with little trouble, but I'm
> > having hard time getting the read side working as fast as it ought to be
> > able to go. Does anybody have experience they'd like to share, tuning the
> > kernel/driver or the array?
> >
> > I'm using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using
> > the anticipatory scheduler, and tweaking up the readahead size for the
> > blockdev, I can get around 300MB/s by using 4 threads on a port, or about
> > 3/4 of the expected max. Writes max out easily. The ddn's stats say
> > that the large majority of my reads are only 256K, even though the
> > requests are larger than that.
> >
> > I tried incorporating the blkdev-max-io-size-selection and
> > increase-sglist-size patches from cfs, but that didn't really help, my
> > reads are still maxing out at 256K.
> >
> > If anybody's been through this kind of thing and has experiences, rumors,
> > or war stories about what kinds of tuning in this area yield good
> > results, I'd love to talk to you!
>
> I know this won't help you but for posterities sake use IBGD 1.8.2 if
> you have an infiniband DDN array. I tried for 4 days with OFED 1.1.1 to
> get decent io performance and never could. 30 minutes to install IBGD
> and I was pushing 700MB/sec through the port using dd.

Do you have actual numbers for your OFED test? If so, please send a message
to the OpenFabrics General mailing list (gen...@lists.openfabrics.org)
letting them know of this performance degredation. The more details we have
of a slow-down in the SRP performance, the more chance we have of OFED
finally fixing whatever the problem is (or at least getting Mellanox to pony
up what is the difference between IBGD's and OFED's SRP client code and
explain why they haven't submitted changes).

> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-...@clusterfs.com
> > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

--
Makia Minich <min...@ornl.gov>
National Center for Computation Science
Oak Ridge National Laboratory
Phone: 865.574.7460
--*--
Imagine no possessions
I wonder if you can
- John Lennon

chas williams - CONTRACTOR

unread,

May 18, 2007, 10:38:29 AM5/18/07

to John R. Dunning, lustre-...@clusterfs.com

In message <17997.38024....@gs105.sicortex.com>,"John R. Dunning" wri
tes:

>I tried incorporating the blkdev-max-io-size-selection and
>increase-sglist-size patches from cfs, but that didn't really help, my reads
>are still maxing out at 256K.

the srp initator creates a virtual scsi device driver. this virtual
device driver has a .max_sectors paramters associated with it. you can
tune this with the max_sect= during login for the openfabrics stack.
no idea, how this is tuned on ibgold.

take a look at /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb}

if you arent using direct i/o, use direct i/o. you could just tune
the page size of the ddn to 256k.

Makia Minich

unread,

May 18, 2007, 11:14:49 AM5/18/07

to lustre-...@clusterfs.com

How much luck did you have with this tuning and OFED's SRP? What performance
are you seeing? We had done quite a bit of testing playing with this option,
but saw very little improvement in performance (if I remember correctly, the
block sizes did increase, but performance was still down).

--

Makia Minich <min...@ornl.gov>
National Center for Computation Science
Oak Ridge National Laboratory
Phone: 865.574.7460
--*--
Imagine no possessions
I wonder if you can
- John Lennon

_______________________________________________

Daniel Leaberry

unread,

May 18, 2007, 11:54:33 AM5/18/07

to Makia Minich, lustre-...@clusterfs.com

Makia Minich wrote:
> How much luck did you have with this tuning and OFED's SRP? What performance
> are you seeing? We had done quite a bit of testing playing with this option,
> but saw very little improvement in performance (if I remember correctly, the
> block sizes did increase, but performance was still down).
>
>

That's what I saw as well. I eventually got great performance writing
with /dev/sg* devices by tuning srp_sg_tablesize (it defaults to 12
which sent 48KB io's to the array) the but I could never get /dev/sd*
devices to perform and reading was always stuck at 128KB io's no matter
what I passed into to srp_sg_tablesize.

chas williams - CONTRACTOR

unread,

May 18, 2007, 11:57:05 AM5/18/07

to Makia Minich, lustre-...@clusterfs.com

well... i suspect tuning the i/o sizes to be larger didnt make a big
difference on reads. it helps to get the write to atleast match the page
size on the ddn's memory cache (2MB as i recall, but this can be tuned to
a smaller value). this will let most devices "write through" the memory
cache directly to disk. as you get farther and farther away from your
storage, you need to increase the message size to offset bandwidth*delay.

after a bit of fiddling, we managed to get:

Using Minimum Record Size 1024 KB
Auto Mode 2. This option is obsolete. Use -az -i0 -i1
O_DIRECT feature enabled
Command line used: /data1/iozone.ia64 -f testfile -y 1024k -A -I
Output is in Kbytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 Kbytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.

KB reclen write rewrite read reread
524288 1024 270275 430822 354823 355126
524288 2048 412733 545186 673329 679848
524288 4096 533884 619260 1048551 1053483
524288 8192 606596 665102 1201478 1192968
524288 16384 662077 698136 1333838 1334341

this was a single host using 2 ddr adapters striped across 8 luns on
the ddn. each lun on the ddn was across 14 tiers. obviously a 512MB
test file fits inside the ddn cache. the ddn should be able to go faster,
but my single host couldnt push harder.

Andreas Dilger

unread,

May 18, 2007, 2:43:48 PM5/18/07

to John R. Dunning, lustre-...@clusterfs.com

On May 18, 2007 07:56 -0400, John R. Dunning wrote:
> I'm using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the
> anticipatory scheduler, and tweaking up the readahead size for the blockdev, I

For a DDN you should probably use noop or deadline scheduler. Anticipatory
is really tuned for desktop workloads.

> can get around 300MB/s by using 4 threads on a port, or about 3/4 of the
> expected max. Writes max out easily. The ddn's stats say that the large
> majority of my reads are only 256K, even though the requests are larger than
> that.

What tool are you using to measure performance? I'd strongly suggest using
the lustre-iokit, which has several components in order to test bare-disk,
local filesystem, network, and lustre-filesystem components independently.

Lustre can consistently generate 1MB IOs to the underlying filesystem because
it submits the IO in 1MB chunks, unlike the kernel's read() and write() calls
which submit IO in 4kB chunks and hope the elevator can merge them.

See also the DDN tuning section in the Lustre manual.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

John R. Dunning

unread,

May 18, 2007, 3:10:45 PM5/18/07

to Andreas Dilger, lustre-...@clusterfs.com

From: Andreas Dilger <adi...@clusterfs.com>
Date: Fri, 18 May 2007 12:43:48 -0600

On May 18, 2007 07:56 -0400, John R. Dunning wrote:
> I'm using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. Using the
> anticipatory scheduler, and tweaking up the readahead size for the blockdev, I

For a DDN you should probably use noop or deadline scheduler. Anticipatory
is really tuned for desktop workloads.

Yes, others have said the same thing. I've tried them both but so far there's
not much difference. The evidence is that something in the block layer is
breaking up read requests, which seems to negate any effect I might be getting
from the iosched.

I found /proc/sys/vm/block_dump, added some extra instrumentation to it, and
turned it on. On the write side, I'm seeing nice big requests (though the
sizes are a bit all over the place) but on the read side it seems to be
willing to go up to 32 elements in the bio and never go any higher. That
statement so far seems to be true regardless of what I use for readahead
values, what scheduler tuning params I give it, what kind of request size the
higher level thinks it's issuing etc. It's behaving like there's something
which has an arbitrary limit on the size of a read request, but I haven't yet
figured out what that is.

> can get around 300MB/s by using 4 threads on a port, or about 3/4 of the
> expected max. Writes max out easily. The ddn's stats say that the large
> majority of my reads are only 256K, even though the requests are larger than
> that.

What tool are you using to measure performance?

Various. Mostly iozone and timing dd and stuff like that. I'm not (yet)
running lustre against the ddn.

I'd strongly suggest using
the lustre-iokit, which has several components in order to test bare-disk,
local filesystem, network, and lustre-filesystem components independently.

Ok. I tried an older version of it last year, and it didn't seem to be
telling anything I hadn't already found out by other means. EEB shipped me a
newer version, which I've unpacked, and am currently trying to figure out how
to build. It seems to be set up such that I have to autoconf it, but trying
to do that causes errors. Hints?

Lustre can consistently generate 1MB IOs to the underlying filesystem because
it submits the IO in 1MB chunks, unlike the kernel's read() and write() calls
which submit IO in 4kB chunks and hope the elevator can merge them.

See also the DDN tuning section in the Lustre manual.

Ok, will do. Thanks.

John R. Dunning

unread,

May 18, 2007, 3:18:03 PM5/18/07

to John R. Dunning, lustre-...@clusterfs.com, Andreas Dilger

From: "John R. Dunning" <j...@sicortex.com>
Date: Fri, 18 May 2007 15:10:45 -0400

It seems to be set up such that I have to autoconf it, but trying
to do that causes errors. Hints?

Ok, never mind, I realized that this version doesn't contain ior, so it's all
shell scripts, no building required.

Reply all

Reply to author

Forward