Hi,
Any body actually using multiple IB ports on a client for an aggregated connection?
Ie. Many oss with one qdr IB each. Clients with 4 qdr IB ports. Assuming the normal
issues with bus bandwidth etc, what sort of perf can I expect
qdr ~ 3-4Gbytes/Sec
I’m trying to size a cluster and clients to get ~10GBytes/Sec on *one*
client node.
If I can aggregate IB linearly the next step will be to try and figure out
How to get 10Gigabytes/s to local storage L
Some times customers are crazy…….
Brian O'Connor
-------------------------------------------------
SGI Consulting
Email: bri...@sgi.com, Mobile +61 417 746 452
Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
357 Camberwell Road, Camberwell, Victoria, 3124
AUSTRALIA http://www.sgi.com/support/services
-------------------------------------------------
I believe this is possible to some limited extent today. The main issue is that the primary NID addresses for the OST IB cards need to be on different subnets so that the clients will route the traffic to the OSTs via the different IB HCAs.
I don't have low-level details on this myself, but I believe there are a couple of sites that have done this.
> If I can aggregate IB linearly the next step will be to try and figure out
> How to get 10Gigabytes/s to local storage L
>
>
> Some times customers are crazy…….
>
>
>
> Brian O'Connor
>
> -------------------------------------------------
>
> SGI Consulting
>
> Email: bri...@sgi.com, Mobile +61 417 746 452
>
> Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
>
> 357 Camberwell Road, Camberwell, Victoria, 3124
>
> AUSTRALIA http://www.sgi.com/support/services
>
> -------------------------------------------------
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-...@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Brian O'Connor
Sent: lundi 21 mars 2011 04:53
To: lustre-...@lists.lustre.org
Subject: [Lustre-discuss] Multiple IB ports
There must be something wrong with your configuration or the code has some bug, because we have had single clients doing 2GB/s in the past. What version of Lustre did you test on?
Is this a single-threaded write? With single-threaded IO the bottleneck often happens in the kernel copy_{to,from}_user() that is copying data to/from userspace in order to do data caching in the client. Having multiple threads doing the IO allows multiple cores to do the data copying.
Is the lustre debugging disabled? "lctl set_param debug=0" if this helps.
Is the Lustre network checksum disabled? "lctl set_param osc.*.checksums=0" There is a patch to allow hardware-assisted checksums, but it needs some debugging before it can be landed into the production release.
> From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Brian O'Connor
> Sent: lundi 21 mars 2011 04:53
> To: lustre-...@lists.lustre.org
> Subject: [Lustre-discuss] Multiple IB ports
>
> Hi,
> Any body actually using multiple IB ports on a client for an aggregated connection?
>
> Ie. Many oss with one qdr IB each. Clients with 4 qdr IB ports. Assuming the normal
> issues with bus bandwidth etc, what sort of perf can I expect
>
> qdr ~ 3-4Gbytes/Sec
>
> I’m trying to size a cluster and clients to get ~10GBytes/Sec on *one*
> client node.
>
> If I can aggregate IB linearly the next step will be to try and figure out
> How to get 10Gigabytes/s to local storage L
>
>
> Some times customers are crazy…….
>
>
>
> Brian O'Connor
>
> -------------------------------------------------
>
> SGI Consulting
>
> Email: bri...@sgi.com, Mobile +61 417 746 452
>
> Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
>
> 357 Camberwell Road, Camberwell, Victoria, 3124
>
> AUSTRALIA http://www.sgi.com/support/services
>
> -------------------------------------------------
>
(16*1024^4)/(3.4*1000^3)/60 == 86 mins
The ratio of memory capacity to I/O bandwidth is a critical issue for
most large machines. Typically in HPC, we'd like to dump all of memory
in 5 to 10 minutes.
thanks,
paul
Brian O'Connor wrote:
>
> Hi,
>
> Any body actually using multiple IB ports on a client for an
> aggregated connection?
>
> Ie. Many oss with one qdr IB each. Clients with 4 qdr IB ports.
> Assuming the normal
>
> issues with bus bandwidth etc, what sort of perf can I expect
>
> qdr ~ 3-4Gbytes/Sec
>
> I’m trying to size a cluster and clients to get ~10GBytes/Sec on **one**
>
> client node.
>
> If I can aggregate IB linearly the next step will be to try and figure out
>
> How to get 10Gigabytes/s to local storage L
>
> Some times customers are crazy…….
>
> Brian O'Connor
>
> -------------------------------------------------
>
> SGI Consulting
>
> Email: bri...@sgi.com <mailto:bri...@sgi.com>, Mobile +61 417 746 452
>
> Phone: +61 3 9963 1900, Fax: +61 3 9963 1902
>
> 357 Camberwell Road, Camberwell, Victoria, 3124
>
> AUSTRALIA http://www.sgi.com/support/services
>
> -------------------------------------------------
>
>
> ------------------------------------------------------------------------
> I was told by a colleague that there were currently too many internal locks in the clients to sustain a big throughput. Lustre is designed for global throughput on many clients, but not on individual clients.
> I can observe this on my site, where I have enough storage and servers to reach 21GB/s globally, but am unable to get more than 300MB/s on a single client even though the DDR IB network would sustain +800MB/s ...
There must be something wrong with your configuration or the code has some bug, because we have had single clients doing 2GB/s in the past. What version of Lustre did you test on?
Is this a single-threaded write? With single-threaded IO the bottleneck often happens in the kernel copy_{to,from}_user() that is copying data to/from userspace in order to do data caching in the client. Having multiple threads doing the IO allows multiple cores to do the data copying.
I guess I need to redo some benchs, and go through the tunables ....
> > Some times customers are crazy.......
Hi Brian,
With one 4x QDR IB port, you can achieve 2 GB/Sec on single client, multi-threaded workload provided that you have right storage (with enough bandwidth) at other end. We have tested this multiple times at DDN.
I have seen sites that do IB-bonding across 2 ports but mostly in failover configuration. To get 10GB/Sec to a single node requires aggregating 5 QDR IB ports. You will need to confirm from your IB vendor (Mellanox? ), OS vendor (SGI/RedHat/Novell) and Lustre vendor whether they support aggregating so many links. I think the challenge you will have is to find a Lustre client node that has enough x8 PCIe slots to sustain 3 dual-port Infiniband adapters at full rate (think multiple such nodes in a typical Lustre filesystem, not so economical). Other alternative is to find a server that can support 8X or 12X QDR IB port on the motherboard to get more bandwidth.
With a typical Lustre client memory of 24-64GB and memory to CPU bandwidth of 10GB/Sec (with standard DDR3-1333MHz DIMMS), it is not possible to fit dataset larger than 2/3rd of memory. If you still want to achieve 10GB/Sec of bandwidth between storage and memory, there are clever alternatives. You will have to stage your data into memory beforehand and keep memory pages locked and continue feeding data as these pages are consumed. It is lot harder than it seems on the paper.
Cheers,
-Atul
From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Brian O'Connor
Sent: Monday, 21 March 2011 9:23 AM
To: lustre-...@lists.lustre.org
Subject: [Lustre-discuss] Multiple IB ports
Hi,
Just adding a small detail, a single port of QDR consumes all of the HCAs pci
bandwidth so you would need 5 x8 IB HCAs for a total of 40 lanes of pci-
express. This will of course change with the introduction of future pci-
express generations...
/Peter
The manual tells you how to turn both types of checksum on or off (client in memory, and wire/network):
$ echo 0 > /proc/fs/lustre/llite/<fsname>/checksum_pages
Then it tells you how to check the status of wire checksums:
$ /usr/sbin/lctl get_param osc.*.checksums
It's not clear if 0 in the checksum_pages file overrides the osc.*.checksums setting, or the opposite (assuming the results of the get_param shows all OSTs with "...checksums=1".
Also, what's the typical recommendation for 1.8 sites? in-memory off and wire on?
-----Original Message-----
From: lustre-disc...@lists.lustre.org [mailto:lustre-disc...@lists.lustre.org] On Behalf Of Peter Kjellström
Sent: Tuesday, March 22, 2011 7:24 AM
To: lustre-...@lists.lustre.org
Subject: Re: [Lustre-discuss] Multiple IB ports
/Peter
This is enabling/disabling the in-memory page checksums, as well as the network RPC checksums. The assumption is that there is no value in doing the in-memory checksums without the RPC checksums. It is possible to enable/disable the RPC checksums independently.
> Then it tells you how to check the status of wire checksums:
> $ /usr/sbin/lctl get_param osc.*.checksums
>
> It's not clear if 0 in the checksum_pages file overrides the osc.*.checksums setting,
Yes, it does.
> or the opposite (assuming the results of the get_param shows all OSTs with "...checksums=1".
>
> Also, what's the typical recommendation for 1.8 sites? in-memory off and wire on?
The default is in-memory off, RPC checksums on, which is recommended. The only time I suggest disabling the RPC checksums is if single-threaded IO performance is a bottleneck for specific applications, and disabling the checksum CPU usage is a significant performance boost.
Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.
_______________________________________________
> Any body actually using multiple IB ports on a client for an
> aggregated connection?
I am trying to do something like what you mentioned. I am working on a
machine with multiple IB ports, but rather than trying to aggregate
links, I am just trying to direct Lustre traffic over different IB ports
so there will essentially be a single QDR IB link dedicated to each
MDS/OSS server. Below are some of the main details. (I can provide
more detailed info if you think it would be useful.)
The storage is a DDN SFA10k couplet with 28 LUNs. Each controller in
the couplet has 4 QDR IB ports, but only 2 on each controller are
connected to the IB fabric. The is a single MGS/MDS server and 4 OSS
servers. All servers have a single QDR IB port connected to the fabric.
Each OSS node does SRP login to a different DDN port and serves out 7 of
the 28 OSTs. The lustre client is a SGI UV1000 (1024 cores, 4TB RAM)
with 24 QDR IB ports (of which we are currently only using 5 ports).
The 5 MDS/OSS servers have their single IB ports configured on 2
different lnets. All 5 servers have o2ib0 configured as well as a
specific lnet for that server (oss1 => o2ib1, oss2->o2ib2, ...,
mds->o2ib5). The client has lnets o2ib[1-5] configured (one on each of
the 5 IB ports). I also had to configure some static ip routes on the
client so that each lustre server could ping the corresponding port on
the client.
I am still doing performance testing and playing around with
configuration parameters. In general, I am getting performance that is
better than using a single QDR IB link, but it certainly is not scaling
up linearly. I can't say for sure where the bottleneck is. It could be
a misconfiguration on my part, some limitation I am hitting within
lustre, or just the natural result of running lustre on a giant single
system image SMP machine. (Although I am pretty sure that at least part
of the problem is due to poor NUMA remote memory access performance.)
--
Rick Mohr
HPC Systems Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu/