[Lustre-discuss] Using Infiniband QoS with Lustre 1.8.5

Ramiro Alba

unread,

Feb 8, 2011, 11:44:35 AM2/8/11

to lustre-...@lists.lustre.org

Hi everybody,

We have a 128 nodes (8 cores/node) 4x DDR IB cluster with 2:1
oversubscription and I use the IB net for:

- OpenMPI
- Lustre
- Admin (may change in future)

I'am very interested in using IB QoS, as in the near future I'm
deploying ADM processors having then 24 cores /node so I want to put a
barrier to trafic so as no trafic (specially OpenMPI) is starved by
others (specially Lustre I/O). So I read all the documentation I could
get
(http://www.mail-archive.com/lustre-...@lists.lustre.org/msg04092.html was really very helpful)
and made the configuration showed bellow.

I'll really be very grateful if someone in the the list could tell me
his/her opinion on the proposed configuration bellow. Any comment will
be welcomed, even if the whole think is a complete nonsense, as no one
in my zone (as far as I know) is using IB and QoS and is really painful.

Personal doubts:

- Am I taking properly into account 'latency' considerations for ?
- Any need to define 'QoS Switch Port 0 options'?.
- Is it interesting to make a difference for CAs and switches external
ports configuration?
- Not really very important to follow strictly the rule 'the weighting
values for each VL should be multiples of 64', at least in vlarb_high?
- Other 'weights suggested?

Thanks in Advance

----- /etc/opensm/qos-policy.conf --------------------

# SL asignation to Flows. GUIDs are Port GUIDs
qos-ulps
default :0 # default SL (OPENMPI)
any, target-port-guid 0x0002c90200279295 :1 # SL for Lustre MDT
any, target-port-guid 0x0002c9020029fda9,0x0002c90200285ed5 :2
# SL for Lustre OSTs
ipoib :3 # SL for Administration
end-qos-ulps

----- /etc/opensm/opensm.conf -----------------------

#
# QoS OPTIONS
#
# Enable QoS setup
qos FALSE

# QoS policy file to be used
qos_policy_file /etc/opensm/qos-policy.conf

# QoS default options
qos_max_vls 4
qos_high_limit 4
qos_vlarb_high 0:128,1:64,2:0,3:0
qos_vlarb_low 0:192,1:16,2:64,3:8
qos_sl2vl 0,1,2,3,15,15,15,15,15,15,15,15,15,15,15,15

# QoS CA options
qos_max_vls 4
qos_high_limit 4
qos_vlarb_high 0:128,1:64,2:0,3:0
qos_vlarb_low 0:192,1:16,2:64,3:8
qos_sl2vl 0,1,2,3,15,15,15,15,15,15,15,15,15,15,15,15

# QoS Switch Port 0 options
#qos_sw0_max_vls 0
#qos_sw0_high_limit -1
#qos_sw0_vlarb_high (null)
#qos_sw0_vlarb_low (null)
#qos_sw0_sl2vl (null)

# QoS Switch external ports options
qos_swe_max_vls 4
qos_swe_high_limit 255
qos_swe_vlarb_high 0:192,1:16,2:64,3:8
qos_swe_vlarb_low 0:0,1:0,2:0,3:0
qos_swe_sl2vl 0,1,2,3,15,15,15,15,15,15,15,15,15,15,15,15

--
Ramiro Alba

Centre Tecnològic de Tranferència de Calor
http://www.cttc.upc.edu

Escola Tècnica Superior d'Enginyeries
Industrial i Aeronàutica de Terrassa
Colom 11, E-08222, Terrassa, Barcelona, Spain
Tel: (+34) 93 739 86 46

--
Aquest missatge ha estat analitzat per MailScanner
a la cerca de virus i d'altres continguts perillosos,
i es considera que est� net.

Isaac Huang

unread,

Feb 8, 2011, 4:41:23 PM2/8/11

to Ramiro Alba, lustre-...@lists.lustre.org

On Tue, Feb 08, 2011 at 05:44:35PM +0100, Ramiro Alba wrote:
> Hi everybody,
>
> We have a 128 nodes (8 cores/node) 4x DDR IB cluster with 2:1
> oversubscription and I use the IB net for:
>
> - OpenMPI
> - Lustre
> - Admin (may change in future)
>
> I'am very interested in using IB QoS, as in the near future I'm
> deploying ADM processors having then 24 cores /node so I want to put a
> barrier to trafic so as no trafic (specially OpenMPI) is starved by
> others (specially Lustre I/O). So I read all the documentation I could

My own experience was that Lustre traffic often fell victim of
aggressive MPI behavior, especially during collective communications.

> ----- /etc/opensm/qos-policy.conf --------------------
>
>
> # SL asignation to Flows. GUIDs are Port GUIDs
> qos-ulps
> default :0 # default SL (OPENMPI)
> any, target-port-guid 0x0002c90200279295 :1 # SL for Lustre MDT
> any, target-port-guid 0x0002c9020029fda9,0x0002c90200285ed5 :2
> # SL for Lustre OSTs
> ipoib :3 # SL for Administration
> end-qos-ulps

My understanding is that SL is determined only once for each
connected QP, which Lustre uses, during connection establishment. The
configuration above seemed to me to be able to catch connections from
clients to servers but not the other way. Servers do connect to
clients though that's not the usual case. Moreover, Lustre QPs are
persistent. So you might end up with quite some Lustre QPs in the
default SL. I've never done any IB QoS configuration, but it'd be
good to double check that the config above does catch all connections.

If servers run not just Lustre, it's possible to distinguish ULP
traffic further by the Lustre ServiceID. If servers serve more than
one Lustre file system, you can divide the traffic further by
assigning each file system a different PKey. But it's probably beyond
your concerns.

Cheers,
Isaac
_______________________________________________
Lustre-discuss mailing list
Lustre-...@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Ramiro Alba

unread,

Feb 9, 2011, 3:13:50 AM2/9/11

to Isaac Huang, lustre-...@lists.lustre.org

Ok, but the question is if this unwanted traffic going to default is
meaningful enough. What do you think?

> good to double check that the config above does catch all connections.
>
> If servers run not just Lustre, it's possible to distinguish ULP
> traffic further by the Lustre ServiceID. If servers serve more than

Yes. I saw this possibility a the lustre mailing list:

http://lists.lustre.org/pipermail/lustre-discuss/2009-May/010563.html

but it is said it has a drawback:

..........................................................................
The next step is to tell OpenSM to assign an SL to this service-id.
Here is an extract of our "QoS policy file":
qos-ulps
default : 0
any, service-id=0x.....: 3
end-qos-ulps

The major drawback of this solution is that the modification we made in
the ofa-kernel is not OpenFabrics Alliance compliant, because the
portspace list is defined in the IB standard.
...........................................................................

> one Lustre file system, you can divide the traffic further by

That's not my case at the moment.

> assigning each file system a different PKey. But it's probably beyond
> your concerns.

What do you thing about the 'weights' policy I've suggested in my
configuration?

Thanks for your answer
Kind Regards

Reply all

Reply to author

Forward