[Lustre-discuss] Multiple IB ports

Brian O'Connor

unread,

Mar 20, 2011, 11:53:08 PM3/20/11

to lustre-...@lists.lustre.org

Hi,

Any body actually using multiple IB ports on a client for an aggregated connection?

Ie. Many oss with one qdr IB each. Clients with 4 qdr IB ports. Assuming the normal

issues with bus bandwidth etc, what sort of perf can I expect

qdr ~ 3-4Gbytes/Sec

I’m trying to size a cluster and clients to get ~10GBytes/Sec on *one*

client node.

If I can aggregate IB linearly the next step will be to try and figure out

How to get 10Gigabytes/s to local storage L

Some times customers are crazy…….

Brian O'Connor

-------------------------------------------------

SGI Consulting

Email: bri...@sgi.com, Mobile +61 417 746 452

Phone: +61 3 9963 1900, Fax: +61 3 9963 1902

357 Camberwell Road, Camberwell, Victoria, 3124

AUSTRALIA http://www.sgi.com/support/services

-------------------------------------------------

Andreas Dilger

unread,

Mar 21, 2011, 3:41:49 AM3/21/11

to Brian O'Connor, lustre-...@lists.lustre.org

On 2011-03-21, at 4:53 AM, Brian O'Connor wrote:
> Any body actually using multiple IB ports on a client for an aggregated connection?
>
> Ie. Many oss with one qdr IB each. Clients with 4 qdr IB ports. Assuming the
> normal issues with bus bandwidth etc, what sort of perf can I expect
>
> qdr ~ 3-4Gbytes/Sec
>
> I’m trying to size a cluster and clients to get ~10GBytes/Sec on *one*
> client node.

I believe this is possible to some limited extent today. The main issue is that the primary NID addresses for the OST IB cards need to be on different subnets so that the clients will route the traffic to the OSTs via the different IB HCAs.

I don't have low-level details on this myself, but I believe there are a couple of sites that have done this.

> If I can aggregate IB linearly the next step will be to try and figure out
> How to get 10Gigabytes/s to local storage L
>
>
> Some times customers are crazy…….
>
>
>
> Brian O'Connor
>
> -------------------------------------------------
>
> SGI Consulting
>
> Email: bri...@sgi.com, Mobile +61 417 746 452
>
> Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
>
> 357 Camberwell Road, Camberwell, Victoria, 3124
>
> AUSTRALIA http://www.sgi.com/support/services
>
> -------------------------------------------------
>

> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.

_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Sebastien Piechurski

unread,

Mar 21, 2011, 5:18:47 AM3/21/11

to Brian O'Connor, lustre-...@lists.lustre.org

Hi Brian,

From my understanding, but confirmation from more skilled people on the list would be welcomed, using multiple IB ports with a lustre client will be difficult to manage, and will probably not bring any performance improvements.

I was told by a colleague that there were currently too many internal locks in the clients to sustain a big throughput. Lustre is designed for global throughput on many clients, but not on individual clients.

I can observe this on my site, where I have enough storage and servers to reach 21GB/s globally, but am unable to get more than 300MB/s on a single client even though the DDR IB network would sustain +800MB/s ...

From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Brian O'Connor
Sent: lundi 21 mars 2011 04:53
To: lustre-...@lists.lustre.org
Subject: [Lustre-discuss] Multiple IB ports

Andreas Dilger

unread,

Mar 21, 2011, 7:37:46 AM3/21/11

to Sebastien Piechurski, lustre-...@lists.lustre.org

On 2011-03-21, at 10:18 AM, Sebastien Piechurski wrote:
> From my understanding, but confirmation from more skilled people on the list would be welcomed, using multiple IB ports with a lustre client will be difficult to manage, and will probably not bring any performance improvements.
> I was told by a colleague that there were currently too many internal locks in the clients to sustain a big throughput. Lustre is designed for global throughput on many clients, but not on individual clients.
> I can observe this on my site, where I have enough storage and servers to reach 21GB/s globally, but am unable to get more than 300MB/s on a single client even though the DDR IB network would sustain +800MB/s ...

There must be something wrong with your configuration or the code has some bug, because we have had single clients doing 2GB/s in the past. What version of Lustre did you test on?

Is this a single-threaded write? With single-threaded IO the bottleneck often happens in the kernel copy_{to,from}_user() that is copying data to/from userspace in order to do data caching in the client. Having multiple threads doing the IO allows multiple cores to do the data copying.

Is the lustre debugging disabled? "lctl set_param debug=0" if this helps.

Is the Lustre network checksum disabled? "lctl set_param osc.*.checksums=0" There is a patch to allow hardware-assisted checksums, but it needs some debugging before it can be landed into the production release.

> From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Brian O'Connor
> Sent: lundi 21 mars 2011 04:53
> To: lustre-...@lists.lustre.org
> Subject: [Lustre-discuss] Multiple IB ports
>
> Hi,
> Any body actually using multiple IB ports on a client for an aggregated connection?
>
> Ie. Many oss with one qdr IB each. Clients with 4 qdr IB ports. Assuming the normal
> issues with bus bandwidth etc, what sort of perf can I expect
>
> qdr ~ 3-4Gbytes/Sec
>
> I’m trying to size a cluster and clients to get ~10GBytes/Sec on *one*
> client node.
>
> If I can aggregate IB linearly the next step will be to try and figure out
> How to get 10Gigabytes/s to local storage L
>
>
> Some times customers are crazy…….
>
>
>
> Brian O'Connor
>
> -------------------------------------------------
>
> SGI Consulting
>
> Email: bri...@sgi.com, Mobile +61 417 746 452
>
> Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
>
> 357 Camberwell Road, Camberwell, Victoria, 3124
>
> AUSTRALIA http://www.sgi.com/support/services
>
> -------------------------------------------------
>

Paul Nowoczynski

unread,

Mar 21, 2011, 11:28:35 AM3/21/11

to Brian O'Connor, lustre-...@lists.lustre.org

Hi Brian,
I don't think it's crazy to strive for that rate, especially when there
are machines on the market which can accommodate multiple TB's of
memory. Assuming my math is mostly correct, to load or unload a 16TB
data set into a single machine (with 16TB of memory) would take about an
hour and a half with a single QDR interface:

(16*1024^4)/(3.4*1000^3)/60 == 86 mins

The ratio of memory capacity to I/O bandwidth is a critical issue for
most large machines. Typically in HPC, we'd like to dump all of memory
in 5 to 10 minutes.
thanks,
paul

Brian O'Connor wrote:
>
> Hi,
>
> Any body actually using multiple IB ports on a client for an
> aggregated connection?
>
> Ie. Many oss with one qdr IB each. Clients with 4 qdr IB ports.
> Assuming the normal
>
> issues with bus bandwidth etc, what sort of perf can I expect
>
> qdr ~ 3-4Gbytes/Sec
>

> I’m trying to size a cluster and clients to get ~10GBytes/Sec on **one**

>
> client node.
>
> If I can aggregate IB linearly the next step will be to try and figure out
>
> How to get 10Gigabytes/s to local storage L
>
> Some times customers are crazy…….
>
> Brian O'Connor
>
> -------------------------------------------------
>
> SGI Consulting
>

> Email: bri...@sgi.com <mailto:bri...@sgi.com>, Mobile +61 417 746 452

>
> Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
>
> 357 Camberwell Road, Camberwell, Victoria, 3124
>
> AUSTRALIA http://www.sgi.com/support/services
>
> -------------------------------------------------
>
>

> ------------------------------------------------------------------------

Jeremy Filizetti

unread,

Mar 21, 2011, 12:23:53 PM3/21/11

to Andreas Dilger, lustre-...@lists.lustre.org

> I was told by a colleague that there were currently too many internal locks in the clients to sustain a big throughput. Lustre is designed for global throughput on many clients, but not on individual clients.

The LNet SMP scaling fixes/enhancements should help but I don't believe they are coming until 2.1.

> I can observe this on my site, where I have enough storage and servers to reach 21GB/s globally, but am unable to get more than 300MB/s on a single client even though the DDR IB network would sustain +800MB/s ...

You probably need to disable checksums, and a DDR should be able to sustain 1.5 GB/s. I've seen close to these rates with LNet self tests I don't see them usually in normal operations with the file system added on top.

There must be something wrong with your configuration or the code has some bug, because we have had single clients doing 2GB/s in the past. What version of Lustre did you test on?

I've never seen as high as 2 GB/s from a single client but I've only been focused on single-threaded IO. For that I've seen between 1.3 and 1.4 GB/s peak.

I spent a little time trying to figure out what that was before with system tap, but I only looked at the read case. It looked like the per page locking penalty can be high. Monitoring each ll_readpage I was seeing an median average of 2.4 us for the read scenario while the mode average was only .5 us. IIRC it was the llap locking that accounted for most of the ll_readpage time. I didn't look at the penalty for rebalancing the cache between the various CPUs.

Using those numbers:

>>> ((1/.000002406) * 4096)/2**20
1623.5453034081463

Give me a best case scenario of ~1.6 GB/s. I thought about working the read case but realized the effort probably wasn't worth putting into 1.8 and I would have to wait until 2.0 to test more. Unfortunately I haven't had the time now to look at 2.0+.

Is this a single-threaded write? With single-threaded IO the bottleneck often happens in the kernel copy_{to,from}_user() that is copying data to/from userspace in order to do data caching in the client. Having multiple threads doing the IO allows multiple cores to do the data copying.

Even with the copy_{to,from}_user() should be able to provide at least >5 GB/s. I've seen about 5.5 GB/s reading cached data on a client with lots of memory.

Jeremy

Sebastien Piechurski

unread,

Mar 21, 2011, 1:01:31 PM3/21/11

to Andreas Dilger, lustre-...@lists.lustre.org

Thanks for the correction.

I guess I need to redo some benchs, and go through the tunables ....

> > Some times customers are crazy.......

Atul Vidwansa

unread,

Mar 22, 2011, 1:15:35 AM3/22/11

to Brian O'Connor, lustre-...@lists.lustre.org

Hi Brian,

With one 4x QDR IB port, you can achieve 2 GB/Sec on single client, multi-threaded workload provided that you have right storage (with enough bandwidth) at other end. We have tested this multiple times at DDN.

I have seen sites that do IB-bonding across 2 ports but mostly in failover configuration. To get 10GB/Sec to a single node requires aggregating 5 QDR IB ports. You will need to confirm from your IB vendor (Mellanox? ), OS vendor (SGI/RedHat/Novell) and Lustre vendor whether they support aggregating so many links. I think the challenge you will have is to find a Lustre client node that has enough x8 PCIe slots to sustain 3 dual-port Infiniband adapters at full rate (think multiple such nodes in a typical Lustre filesystem, not so economical). Other alternative is to find a server that can support 8X or 12X QDR IB port on the motherboard to get more bandwidth.

With a typical Lustre client memory of 24-64GB and memory to CPU bandwidth of 10GB/Sec (with standard DDR3-1333MHz DIMMS), it is not possible to fit dataset larger than 2/3^rd of memory. If you still want to achieve 10GB/Sec of bandwidth between storage and memory, there are clever alternatives. You will have to stage your data into memory beforehand and keep memory pages locked and continue feeding data as these pages are consumed. It is lot harder than it seems on the paper.

Cheers,

-Atul

From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Brian O'Connor
Sent: Monday, 21 March 2011 9:23 AM
To: lustre-...@lists.lustre.org
Subject: [Lustre-discuss] Multiple IB ports

Hi,

Peter Kjellström

unread,

Mar 22, 2011, 8:23:36 AM3/22/11

to lustre-...@lists.lustre.org

On Tuesday, March 22, 2011 06:15:35 am Atul Vidwansa wrote:
> Hi Brian,
>
> With one 4x QDR IB port, you can achieve 2 GB/Sec on single client,
> multi-threaded workload provided that you have right storage (with enough
> bandwidth) at other end. We have tested this multiple times at DDN.
>
> I have seen sites that do IB-bonding across 2 ports but mostly in failover
> configuration. To get 10GB/Sec to a single node requires aggregating 5 QDR
> IB ports. You will need to confirm from your IB vendor (Mellanox? ), OS
> vendor (SGI/RedHat/Novell) and Lustre vendor whether they support
> aggregating so many links. I think the challenge you will have is to find
> a Lustre client node that has enough x8 PCIe slots to sustain 3 dual-port
> Infiniband adapters at full rate

Just adding a small detail, a single port of QDR consumes all of the HCAs pci
bandwidth so you would need 5 x8 IB HCAs for a total of 40 lanes of pci-
express. This will of course change with the introduction of future pci-
express generations...

/Peter

signature.asc

Mike Hanby

unread,

Mar 22, 2011, 10:30:31 AM3/22/11

to lustre-...@lists.lustre.org

I'm curios about the checksums,

The manual tells you how to turn both types of checksum on or off (client in memory, and wire/network):
$ echo 0 > /proc/fs/lustre/llite/<fsname>/checksum_pages

Then it tells you how to check the status of wire checksums:
$ /usr/sbin/lctl get_param osc.*.checksums

It's not clear if 0 in the checksum_pages file overrides the osc.*.checksums setting, or the opposite (assuming the results of the get_param shows all OSTs with "...checksums=1".

Also, what's the typical recommendation for 1.8 sites? in-memory off and wire on?

-----Original Message-----
From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Peter Kjellström
Sent: Tuesday, March 22, 2011 7:24 AM
To: lustre-...@lists.lustre.org
Subject: Re: [Lustre-discuss] Multiple IB ports

/Peter

Andreas Dilger

unread,

Mar 22, 2011, 11:30:19 AM3/22/11

to Mike Hanby, lustre-...@lists.lustre.org

On 2011-03-22, at 3:30 PM, Mike Hanby wrote:
> I'm curios about the checksums,
>
> The manual tells you how to turn both types of checksum on or off (client in memory, and wire/network):
> $ echo 0 > /proc/fs/lustre/llite/<fsname>/checksum_pages

This is enabling/disabling the in-memory page checksums, as well as the network RPC checksums. The assumption is that there is no value in doing the in-memory checksums without the RPC checksums. It is possible to enable/disable the RPC checksums independently.

> Then it tells you how to check the status of wire checksums:
> $ /usr/sbin/lctl get_param osc.*.checksums
>
> It's not clear if 0 in the checksum_pages file overrides the osc.*.checksums setting,

Yes, it does.

> or the opposite (assuming the results of the get_param shows all OSTs with "...checksums=1".
>
> Also, what's the typical recommendation for 1.8 sites? in-memory off and wire on?

The default is in-memory off, RPC checksums on, which is recommended. The only time I suggest disabling the RPC checksums is if single-threaded IO performance is a bottleneck for specific applications, and disabling the checksum CPU usage is a significant performance boost.

Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.

_______________________________________________

Rick Mohr

unread,

Apr 1, 2011, 5:34:09 PM4/1/11

to Brian O'Connor, lustre-...@lists.lustre.org

On Sun, 2011-03-20 at 22:53 -0500, Brian O'Connor wrote:

> Any body actually using multiple IB ports on a client for an
> aggregated connection?

I am trying to do something like what you mentioned. I am working on a
machine with multiple IB ports, but rather than trying to aggregate
links, I am just trying to direct Lustre traffic over different IB ports
so there will essentially be a single QDR IB link dedicated to each
MDS/OSS server. Below are some of the main details. (I can provide
more detailed info if you think it would be useful.)

The storage is a DDN SFA10k couplet with 28 LUNs. Each controller in
the couplet has 4 QDR IB ports, but only 2 on each controller are
connected to the IB fabric. The is a single MGS/MDS server and 4 OSS
servers. All servers have a single QDR IB port connected to the fabric.
Each OSS node does SRP login to a different DDN port and serves out 7 of
the 28 OSTs. The lustre client is a SGI UV1000 (1024 cores, 4TB RAM)
with 24 QDR IB ports (of which we are currently only using 5 ports).

The 5 MDS/OSS servers have their single IB ports configured on 2
different lnets. All 5 servers have o2ib0 configured as well as a
specific lnet for that server (oss1 => o2ib1, oss2->o2ib2, ...,
mds->o2ib5). The client has lnets o2ib[1-5] configured (one on each of
the 5 IB ports). I also had to configure some static ip routes on the
client so that each lustre server could ping the corresponding port on
the client.

I am still doing performance testing and playing around with
configuration parameters. In general, I am getting performance that is
better than using a single QDR IB link, but it certainly is not scaling
up linearly. I can't say for sure where the bottleneck is. It could be
a misconfiguration on my part, some limitation I am hitting within
lustre, or just the natural result of running lustre on a giant single
system image SMP machine. (Although I am pretty sure that at least part
of the problem is due to poor NUMA remote memory access performance.)

--
Rick Mohr
HPC Systems Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu/

Reply all

Reply to author

Forward