Control network

46 views
Skip to first unread message

Nurlan Nazaraliyev

unread,
Nov 14, 2024, 8:43:32 PM11/14/24
to cloudlab-users
Hello,

I am running 2 xl170 nodes. Every time I update the driver or I downgrade the linux version, the connection goes off (I need to reload in this case). It is because of the network mismatch on the control link (I believe). Is there any way to have an independent/separate control link (than the link between the nodes)?

I don't want my control network to go down as I make modifications to the system. Is that possible?


Thanks in advance
Best,
Nurlan 

Nurlan Nazaraliyev

unread,
Nov 15, 2024, 1:58:04 AM11/15/24
to cloudlab-users
I previously had 2 r7525 nodes connected to each other. I was doing similar modifications but never lost the connection (ssh) to nodes. 
The profile I used for r7525:

<rspec xmlns="http://www.geni.net/resources/rspec/3" xmlns:emulab="http://www.protogeni.net/resources/rspec/ext/emulab/1" xmlns:tour="http://www.protogeni.net/resources/rspec/ext/apt-tour/1" xmlns:jacks="http://www.protogeni.net/resources/rspec/ext/jacks/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.geni.net/resources/rspec/3    http://www.geni.net/resources/rspec/3/request.xsd" type="request"><rspec_tour xmlns="http://www.protogeni.net/resources/rspec/ext/apt-tour/1"><description xmlns="" type="markdown">RDMA connection</description></rspec_tour>
<node client_id="bf1" exclusive="true" component_manager_id="urn:publicid:IDN+clemson.cloudlab.us+authority+cm">
    <sliver_type name="raw">
      <disk_image name="urn:publicid:IDN+emulab.net+image+emulab-ops//UBUNTU22-64-STD"/>
    </sliver_type>
    <hardware_type name="r7525"/>
    <interface client_id="bf1:interface-0"/>
    <interface client_id="bf1:interface-2"/>
    <interface client_id="bf1:interface-4"/>
  <services xmlns="http://www.geni.net/resources/rspec/3"/></node><node client_id="bf2" exclusive="true" component_manager_id="urn:publicid:IDN+clemson.cloudlab.us+authority+cm">
    <sliver_type name="raw">
      <disk_image name="urn:publicid:IDN+emulab.net+image+emulab-ops//UBUNTU22-64-STD"/>
    </sliver_type>
    <hardware_type name="r7525"/>
    <interface client_id="bf2:interface-1"/>
    <interface client_id="bf2:interface-3"/>
    <interface client_id="bf2:interface-5"/>
  <services xmlns="http://www.geni.net/resources/rspec/3"/></node><link client_id="link-0">
    <interface_ref client_id="bf2:interface-1"/>
    <interface_ref client_id="bf1:interface-0"/>
   
   
    <ns0:site xmlns:ns0="http://www.protogeni.net/resources/rspec/ext/jacks/1" id="undefined"/>
  <property xmlns="http://www.geni.net/resources/rspec/3" source_id="bf2:interface-1" dest_id="bf1:interface-0" capacity="40000000"/><property xmlns="http://www.geni.net/resources/rspec/3" source_id="bf1:interface-0" dest_id="bf2:interface-1" capacity="40000000"/><component_manager name="urn:publicid:IDN+clemson.cloudlab.us+authority+cm"/></link><link client_id="link-1">
    <interface_ref client_id="bf1:interface-2"/>
    <interface_ref client_id="bf2:interface-3"/>
   
   
    <ns1:site xmlns:ns1="http://www.protogeni.net/resources/rspec/ext/jacks/1" id="undefined"/>
  <property xmlns="http://www.geni.net/resources/rspec/3" source_id="bf1:interface-2" dest_id="bf2:interface-3" capacity="40000000"/><property xmlns="http://www.geni.net/resources/rspec/3" source_id="bf2:interface-3" dest_id="bf1:interface-2" capacity="40000000"/><component_manager name="urn:publicid:IDN+clemson.cloudlab.us+authority+cm"/></link><link client_id="link-2">
    <interface_ref client_id="bf1:interface-4"/>
    <interface_ref client_id="bf2:interface-5"/>
   
   
    <ns2:site xmlns:ns2="http://www.protogeni.net/resources/rspec/ext/jacks/1" id="undefined"/>
  <property xmlns="http://www.geni.net/resources/rspec/3" source_id="bf1:interface-4" dest_id="bf2:interface-5" capacity="1000000"/><property xmlns="http://www.geni.net/resources/rspec/3" source_id="bf2:interface-5" dest_id="bf1:interface-4" capacity="1000000"/><component_manager name="urn:publicid:IDN+clemson.cloudlab.us+authority+cm"/></link><emulab:portal name="cloudlab" url="https://www.cloudlab.us/status.php?uuid=a53a7f7c-a31a-11ef-af1a-e4434b2381fc" project="dms" experiment="gpuvm" sequence="1731652036"/></rspec>

The profile I use for xl170 nodes:

<rspec xmlns="http://www.geni.net/resources/rspec/3" xmlns:emulab="http://www.protogeni.net/resources/rspec/ext/emulab/1" xmlns:tour="http://www.protogeni.net/resources/rspec/ext/apt-tour/1" xmlns:jacks="http://www.protogeni.net/resources/rspec/ext/jacks/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.geni.net/resources/rspec/3 http://www.geni.net/resources/rspec/3/request.xsd" type="request">
 
  <rspec_tour xmlns="http://www.protogeni.net/resources/rspec/ext/apt-tour/1">
    <description xmlns="" type="markdown">RDMA Project on CloudLab XL170 Nodes with a 25Gb Link</description>
  <instructions xmlns="" type="markdown">check paper</instructions></rspec_tour>
 
  <!-- Node 1 -->
  <node client_id="xl1" exclusive="true">
    <sliver_type name="raw">
      <disk_image name="urn:publicid:IDN+emulab.net+image+emulab-ops//UBUNTU20-64-STD"/>
    </sliver_type>
    <hardware_type name="xl170"/>
    <interface client_id="xl1:interface-0"/>
  </node>
 
  <!-- Node 2 -->
  <node client_id="xl2" exclusive="true">
    <sliver_type name="raw">
      <disk_image name="urn:publicid:IDN+emulab.net+image+emulab-ops//UBUNTU20-64-STD"/>
    </sliver_type>
    <hardware_type name="xl170"/>
    <interface client_id="xl2:interface-1"/>
  </node>
 
  <!-- 25 Gbps Link -->
  <link client_id="link-25Gb">
    <interface_ref client_id="xl1:interface-0"/>
    <interface_ref client_id="xl2:interface-1"/>
    <property source_id="xl1:interface-0" dest_id="xl2:interface-1" capacity="25000000"/>
    <property source_id="xl2:interface-1" dest_id="xl1:interface-0" capacity="25000000"/>
  </link>
</rspec>

Is there anything missing in the xl170 profile? 

Appreciate any help. 

Thanks in advance,
Best,
Nurlan

David M Johnson

unread,
Nov 15, 2024, 7:45:00 PM11/15/24
to cloudla...@googlegroups.com
On 11/14/24 18:43, 'Nurlan Nazaraliyev' via cloudlab-users wrote:
> Hello,
>
> I am running 2 xl170 nodes. Every time I update the driver or I
> downgrade the linux version, the connection goes off (I need to reload
> in this case). It is because of the network mismatch on the control link
> (I believe). Is there any way to have an independent/separate control
> link (than the link between the nodes)?
>
> I don't want my control network to go down as I make modifications to
> the system. Is that possible?

I'm not sure exactly which link you mean. What we call the "control
net" are the interfaces with public IPs; the "experiment net" is
collectively any links you define in your profile (in this case, a
single 25Gbps link between two nodes). On the xl170s, there are two
dual-port Mellanox ConnectX-4s, one with a 10Gbps control net port and
10Gbps port for experiment networks; and another with a 25Gbps port that
can be used for experiment networks. This means the same ethernet
driver is used for both NICs, so you do have to be careful updating it.

The loss of the control net connection after upgrade/reboot can result
from many things: e.g. old kernel with insufficient driver support,
installation of NetworkManager due to package dependencies. For the
latter, please search the forum for old posts that tell you how to
disable NetworkManager, e.g.
https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ .

> status
> page: https://www.cloudlab.us/status.php?uuid=610603da-a2b7-11ef-af1a-e4434b2381fc

Sorry we didn't get to look at this before it expired during the night.

> Nurlan

David

Nurlan Nazaraliyev

unread,
Nov 15, 2024, 8:57:15 PM11/15/24
to cloudlab-users
Thanks for your reply. 

Right now, I have a project (https://www.cloudlab.us/status.php?uuid=0541718a-a3b5-11ef-af1a-e4434b2381fc) with a similar problem.

I have turned off the network manager by following the link from the older posts you shared. I have installed a custom linux image (5.14-rc5) and want to boot into that image. However, I am using consol and as I see, the node keeps rebooting. Why is this happening?

Any help appreciated.

Best,
Nurlan

Nurlan Nazaraliyev

unread,
Nov 15, 2024, 8:58:20 PM11/15/24
to cloudlab-users
That node is 'node-0' and it can boot successfully with the default image (5.15.0-122-generic) but cannot boot with my custom image.

David M Johnson

unread,
Nov 15, 2024, 9:11:13 PM11/15/24
to cloudla...@googlegroups.com
On 11/15/24 18:57, 'Nurlan Nazaraliyev' via cloudlab-users wrote:
> Thanks for your reply.
>
> Right now, I have a project
> (https://www.cloudlab.us/status.php?uuid=0541718a-a3b5-11ef-af1a-e4434b2381fc) with a similar problem.
>
> I have turned off the network manager by following the link from the
> older posts you shared. I have installed a custom linux image (5.14-rc5)
> and want to boot into that image. However, I am using consol and as I
> see, the node keeps rebooting. Why is this happening?

If you look at the Console Log for this node (List View, gear icon on
right end of node row), you can see that the boot for 5.14-rc5 doesn't
say anything about finding the mellanox NICs, and the systemd services
fail to find the network (since the NICs are not available).

Most likely your custom kernel is not built with the `mlx5` driver;
you'll need to verify and rebuild if not. Here is a previous post where
I suggest a strategy to build custom kernels by reusing the Ubuntu
kernel config as your starting point, so that your custom kernel can
boot on a wide range of hardware
(https://groups.google.com/g/cloudlab-users/c/y3SwBbUs2oE/m/M67jRhFSAAAJ).

> Nurlan

David
> https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ <https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ> .
>
> > status
> > page:
> https://www.cloudlab.us/status.php?uuid=610603da-a2b7-11ef-af1a-e4434b2381fc <https://www.cloudlab.us/status.php?uuid=610603da-a2b7-11ef-af1a-e4434b2381fc>
>
> Sorry we didn't get to look at this before it expired during the night.
>
> > Nurlan
>
> David
>
> --
> You received this message because you are subscribed to the Google
> Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cloudlab-user...@googlegroups.com
> <mailto:cloudlab-user...@googlegroups.com>.
> To view this discussion visit
> https://groups.google.com/d/msgid/cloudlab-users/cb35b907-40ae-4236-8fb0-9deb8bc38905n%40googlegroups.com <https://groups.google.com/d/msgid/cloudlab-users/cb35b907-40ae-4236-8fb0-9deb8bc38905n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Nurlan Nazaraliyev

unread,
Nov 18, 2024, 1:08:33 PM11/18/24
to cloudlab-users
Thanks a lot. I solved the problem by modifying the .config file. 
Reply all
Reply to author
Forward
0 new messages