Specifying NIC (by component_id) when connecting two types of nodes

68 views
Skip to first unread message

jie...@umich.edu

unread,
Feb 20, 2022, 1:01:02 AM2/20/22
to cloudlab-users
Hi!


I wanted to connect a c6320 node to r7525 node (with the ConnectX-5 25 Gb NIC), so that I can use the other NIC (BlueField2) for other purpose. I have verified that 'eth2' is the CX-5 NIC, and 'eth4/5' are the BF2 NIC.
So I specified component_id='eth2' in my profile for r7525 nodes. And this would lead to the following error (while if I specify 'eth4' or 'eth5' it won't fail):

============== Start of Error Message ==============
Could not map all requested links to physical resources. Not enough free resources currently. Please try again later.

Nodes:
nfs clnode162
node-1 clgpu010
node-2 clgpu022
lan/nfsLan h-itc-cf31-d6000-016
End Nodes
Edges:
linklan/nfsLan/nfs:0 direct link-clnode162:eth1-h-itc-cf31-d6000-016:0/18 (clnode162/eth1,h-itc-cf31-d6000-016/0/18) link-clnode162:eth1-h-itc-cf31-d6000-016:0/18 (clnode162/eth1,h-itc-cf31-d6000-016/0/18)
linklan/nfsLan/node-1:0 Mapping Failed
linklan/nfsLan/node-2:0 Mapping Failed
End Edges
============== End of Error Message ==============

Best,
Jie

Mike Hibler

unread,
Feb 20, 2022, 12:08:46 PM2/20/22
to cloudla...@googlegroups.com
What is the topology you are trying to setup?

I know you have an NFS server node connected with the other nodes in one LAN.
And then a link connecting the NFS server with the iSCSI blockstore. The NFS
server will have its link and LAN sharing the same physical interface out of
necessity since it has only one physical interface. Thus the need best_effort,
vlan_tagging, and link_multiplexing (all of which you have done) on the link
and LAN. However you also set the BW on the node LAN interfaces trying to
force them to the 25Gbps interface. Since you are also specifying "eth2" for
those, you should not need the BW. At best, it will be ignored because of
the "best_effort" flag, at worst it might be causing problems like you are
seeing. I am not sure about that. Have you tried with "eth2" but without
the BW spec?

What is your "other purpose" for the BF2 NIC? If you want to further connect
the nodes using those interfaces, the links need to be declared in the profile
topology. Otherwise we won't set up switch VLANs for them and they won't be
able to communicate. This will also cause them to get initialized with IPv4
addresses, but you can tear that part down if you are using DPDK or other
custom drivers. By creating links/LANs with those interfaces, and setting
their BW to 100000000 (but NOT setting best_effort, vlan_tagging, or
link_multiplexing) you should be able to force the correct interfaces.
This by itself may also force the mapper to put the other LANs on the 25Gb
interfaces.

All of which is a long-winded way of saying: I don't know for sure what
caused the error. :-) But try the "eth2" without the 20000000 BW setting
if you have not already. Otherwise try declaring the two 100Gbps interfaces
(which you need to do anyway) and see if that forces the NFS lan onto the
25Gbps interface.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/0bd4f24d-b5b6-432a-b507-836632e65384n%40googlegroups.com.

jie...@umich.edu

unread,
Feb 22, 2022, 9:16:04 PM2/22/22
to cloudlab-users
Thank you for your reply!

To report some results we have found out:
For testing with 3 nodes (c6320 + 2*r7525)
1. I tried "eth2" w/o any BW setting (which result in a starfish topology with 3 nodes), did not work.
2. I tried "eth4" w/o any BW setting (which result in a starfish topology with 3 nodes), it works.
3. Tried same as 2 but with "eth5" it works.
4. Tried to not specify any "component_id", but create a starfish topology with 3 nodes (specified 10Gbps), and add 2 links between the r7525 (100Gbps each). It did not work as expected, all VLANs were mapped to the BlueField2 card.

Then I tried testing with just 2 nodes, removing the c6320 node (so we host NFS directly on r7525 node):
5. Tried to create 3 links, specifying BW, see (https://www.cloudlab.us/show-profile.php?uuid=62323c51-d57d-11eb-8fd9-e4434b2381fc), it works.
6. Tried to change the profile in (4), and change one of the link from Link() to LAN(), it didn't work. (see profile https://www.cloudlab.us/show-profile.php?uuid=539bb1dc-9433-11ec-b318-e4434b2381fc), did not work (experiment https://www.cloudlab.us/status.php?uuid=29231c17-9436-11ec-b318-e4434b2381fc).

At this point I only have some basic hypothesis:
1. It is impossible to connect a c6320 node to an r7525's 25Gbps interface. It works only if I specify component_id to be "eth4"/"eth5" (the 100Gbps one).
2. Forcing the LAN to use the 25Gbps one can work (see (5)), but it only works with a Link() not a LAN(). I'm particularly confused about why 5 worked and 6 didn't. The two profiles are very similar.

Best,
Jie



jie...@umich.edu

unread,
Feb 22, 2022, 10:37:09 PM2/22/22
to cloudlab-users
The reason we are trying this is that we want to multiplex the r7525 node (which has both GPU and SmartNIC) so that we can run GPU experiments and SmartNIC experiments at the same time, increasing the utilization ;-)

In order to ensure that the GPU experiment is not interfered from the BF2 card, we want the traffic of distributed ML to use the 25Gbps card instead. To give an example topology of the setup (which is basically a 4-node version of https://www.cloudlab.us/show-profile.php?uuid=539bb1dc-9433-11ec-b318-e4434b2381fc), see the attached image.
The intention is to create a basic starfish topology with the 25Gbps NICs, while occuping the BF2 card with 2*100Gbps links between each 2-node pair. But this did not work even with a simple 2-node setup, see case (6).

-Jie

MicrosoftTeams-image.png

Leigh Stoller

unread,
Feb 23, 2022, 9:47:13 AM2/23/22
to cloudla...@googlegroups.com

> 1. It is impossible to connect a c6320 node to an r7525's 25Gbps interface. It works only if I specify component_id to be "eth4"/"eth5" (the 100Gbps one).

Hi. Just to clear one thing up about c6320 nodes; See this info page about
our resources: https://www.cloudlab.us/portal-hardware.php

Most node types in Cloudlab have only 1 or 2 physical interfaces
wired to the experimental fabric. The c6320 nodes have one interface
so there is no way to create the topology in the diagram unless you
use link multiplexing (layer two or more logical links over the
one physical interface). We can tell you how to specify that in your
profile if that is something you want to try.

I am looking at case 6 now …

Leigh


Leigh Stoller

unread,
Feb 23, 2022, 11:02:02 AM2/23/22
to cloudla...@googlegroups.com

> I am looking at case 6 now …

So on this one, you can make this work with four r7525 nodes
by getting rid of:

lan.best_effort = True
lan.vlan_tagging = True
lan.link_multiplexing = True

also get rid of the bandwidth on the lan. Since the r7525 nodes
have three physical interfaces, there is no need to specify all
that and in this case will bypass a problem in the resource mapper
that you tripped over.

Let us know …

Leigh



Leigh Stoller

unread,
Feb 23, 2022, 4:41:21 PM2/23/22
to cloudla...@googlegroups.com
OK, I have installed a possible fix at Clemson for this. You can
try the C6320 nodes in your topology, using the lan settings above.

Let us know how it goes.

Thanks
Leigh

Reply all
Reply to author
Forward
0 new messages