Regarding Sm110p nodes - Wisc Cluster

131 views
Skip to first unread message

Subitsha Kamal

unread,
Jan 10, 2024, 11:58:32 PM1/10/24
to cloudlab-users
Hi, 

I have started an experiment with two nodes of type sm110p . When I checked the type of NICs that are there, I found this.  

subitsha@node0:~$ lspci | grep Mellanox
0000:51:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
0000:51:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
0000:8a:00.0 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-6 Lx]
0000:8a:00.1 Ethernet controller: Mellanox Technologies MT2894 Family [ConnectX-6 Lx]

In this screenshot, You can see that only the mlx5_2 that corresponds to the ConnectX-6 Lx is showing as Port Active. All the others show Port DOWN
However in the CloudLab Hardware Page it is mentioned that for the sm110p nodes, Dual-port Mellanox ConnectX-6 LX 25Gb NIC (not available for experiment use) and that the Dual-port Mellanox ConnectX-6 DX 100Gb NIC (both ports available for experiment use) is available for use.   Refer here . The PSID of the ConnectX-6 Lx that shows port active  is DEL0000000031 and I am suspecting that is the reason why I am unable to do a IPSEC Crypto offload as the settings show as below . Hence I would like to know how I can make use of the ConnectX-6 Dx versions . I dont know why the ports are down for the Dx versions. Kindly help me out with this. 

Screenshot 2024-01-10 at 8.12.05 PM.png

Screenshot 2024-01-10 at 9.04.05 PM.png
Screenshot 2024-01-10 at 9.04.33 PM.png

Mike Hibler

unread,
Jan 11, 2024, 12:56:24 AM1/11/24
to cloudla...@googlegroups.com
Our startup script will configure the experiment interface with a 10.x.x.x
address and bring it up. But when I logged into one of your nodes, the
interface was down. When I reconfigured it, ibv_devinfo shows it as up:

hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 22.36.1010
node_guid: b83f:d203:0013:0c57
sys_image_guid: b83f:d203:0013:0c56
vendor_id: 0x02c9
vendor_part_id: 4125
hw_ver: 0x0
board_id: MT_0000000437
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet

(On the other node it is mlx5_0 instead of mlx5_1). So something you did
after the initial boot caused the interfaces to go down.
> Screenshot 2024-01-10 at 8.12.05 PM.png
>
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/b9f1ef03-b0c3-4fcc-a155-7cd830e3c259n%40googlegroups.com.




Subitsha Kamal

unread,
Jan 11, 2024, 1:08:49 AM1/11/24
to cloudlab-users
Hi, how did you reconfigure it? did you install the Mellanox OFED Driver ? So basically what I did was, once I SSHed into the nodes, I installed the MLNX_OFED driver MLNX_OFED_LINUX-5.8-3.0.7.0-ubuntu22.04-x86_64.tgz and only then when I typed ibv_devinfo it showed up the output that I had shared with you.  Here is my experiment page https://www.cloudlab.us/status.php?uuid=3a396a31-adce-11ee-9f39-e4434b2381fc . Also, I need to do an Ipsec crypto offload with the ConnectX-6 NIC. It is mentioned in the Nvidia's website that ConnectX-6 Dx and Lx versions do support IPSEC offload. But however, I get this as an output when I tried. It says OFF and fixed for tls offload and esp hwScreenshot 2024-01-10 at 8.12.05 PM.png

Mike Hibler

unread,
Jan 11, 2024, 10:35:18 AM1/11/24
to cloudla...@googlegroups.com
All I did was run our startup script to configure the IP address.
If you really want to know:

sudo /usr/local/etc/emulab/rc/rc.ifconfig boot

Or you could reboot the machine.

The key thing from the standpoint of ibv_devinfo was that the interface
was not up. So when you installed the drivers that would have reset the
state of the interfaces.

It is possible that there is a vendor-branded version of the firmware
running on the cards that restricts what operations can be done, but I
doubt it. I will engage the Wisconsin people who purchased the machines
and see what they know.

On Wed, Jan 10, 2024 at 10:08:49PM -0800, Subitsha Kamal wrote:
> Hi, how did you reconfigure it? did you install the Mellanox OFED Driver ? So
> basically what I did was, once I SSHed into the nodes, I installed the
> MLNX_OFED driver MLNX_OFED_LINUX-5.8-3.0.7.0-ubuntu22.04-x86_64.tgz and only
> then when I typed ibv_devinfo it showed up the output that I had shared with
> you.  Here is my experiment page https://www.cloudlab.us/status.php?uuid=
> 3a396a31-adce-11ee-9f39-e4434b2381fc . Also, I need to do an Ipsec crypto
> offload with the ConnectX-6 NIC. It is mentioned in the Nvidia's website that
> ConnectX-6 Dx and Lx versions do support IPSEC offload. But however, I get this
> as an output when I tried. It says OFF and fixed for tls offload and esp hw
> cloudlab-users/676ff6d6-d26b-42bb-b1f1-8472e4544bffn%40googlegroups.com.


Mike Hibler

unread,
Jan 11, 2024, 11:11:34 AM1/11/24
to cloudla...@googlegroups.com
Wisconsin people say that the SM nodes have non-crypto versions of the
ConnectX-6. I believe the Clemson "r650" and "r6525" nodes have crypto
enabled versions.

On Thu, Jan 11, 2024 at 08:35:13AM -0700, Mike Hibler wrote:
> All I did was run our startup script to configure the IP address.
> If you really want to know:
>
> sudo /usr/local/etc/emulab/rc/rc.ifconfig boot
>
> Or you could reboot the machine.
>
> The key thing from the standpoint of ibv_devinfo was that the interface
> was not up. So when you installed the drivers that would have reset the
> state of the interfaces.
>
> It is possible that there is a vendor-branded version of the firmware
> running on the cards that restricts what operations can be done, but I
> doubt it. I will engage the Wisconsin people who purchased the machines
> and see what they know.
>
> On Wed, Jan 10, 2024 at 10:08:49PM -0800, Subitsha Kamal wrote:
> > Hi, how did you reconfigure it? did you install the Mellanox OFED Driver ? So
> > basically what I did was, once I SSHed into the nodes, I installed the
> > MLNX_OFED driver MLNX_OFED_LINUX-5.8-3.0.7.0-ubuntu22.04-x86_64.tgz and only
> > then when I typed ibv_devinfo it showed up the output that I had shared with
> > you.?? Here is my experiment page??https://www.cloudlab.us/status.php?uuid=
> > > Hi,??
> > >
> > > I have started an experiment with two nodes of type??sm110p??. When I
> > checked the
> > > type of NICs that are there, I found this.????
> > >
> > > subitsha@node0:~$ lspci | grep Mellanox
> > > 0000:51:00.0 Ethernet controller: Mellanox Technologies MT2892 Family
> > > [ConnectX-6 Dx]
> > > 0000:51:00.1 Ethernet controller: Mellanox Technologies MT2892 Family
> > > [ConnectX-6 Dx]
> > > 0000:8a:00.0 Ethernet controller: Mellanox Technologies MT2894 Family
> > > [ConnectX-6 Lx]
> > > 0000:8a:00.1 Ethernet controller: Mellanox Technologies MT2894 Family
> > > [ConnectX-6 Lx]
> > >
> > > In this screenshot, You can see that only the mlx5_2 that corresponds to
> > the
> > > ConnectX-6 Lx is showing as Port Active. All the others show Port DOWN.??
> > > However in the CloudLab Hardware Page it is mentioned that for the sm110p
> > > nodes,??Dual-port Mellanox ConnectX-6 LX 25Gb NIC (not available for
> > experiment
> > > use) and that the??Dual-port Mellanox ConnectX-6 DX 100Gb NIC (both ports
> > > available for experiment use) is available for use.?? ??Refer here??. The
> > PSID of
> > > the ConnectX-6 Lx that shows port active?? is??DEL0000000031 and I am
> > suspecting
> > > that is the reason why I am unable to do a IPSEC Crypto offload as the
> > settings
> > > show as below . Hence I would like to know how I can make use of the
> > ConnectX-6
> > > Dx versions . I dont know why the ports are down for the Dx versions.
> > Kindly
> > > help me out with this.??
> > >
> > > Screenshot 2024-01-10 at 8.12.05 PM.png
> > >
> > > --
> > > You received this message because you are subscribed to the Google Groups
> > > "cloudlab-users" group.
> > > To unsubscribe from this group and stop receiving emails from it, send an
> > email
> > > to cloudlab-user...@googlegroups.com.
> > > To view this discussion on the web visit https://groups.google.com/d/
> > msgid/
> > > cloudlab-users/b9f1ef03-b0c3-4fcc-a155-7cd830e3c259n%40googlegroups.com.
> >
> >
> >
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "cloudlab-users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email
> > to cloudlab-user...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/
> > cloudlab-users/676ff6d6-d26b-42bb-b1f1-8472e4544bffn%40googlegroups.com.
>
>
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/20240111153513.GM48389%40flux.utah.edu.

Subitsha Kamal

unread,
Jan 14, 2024, 11:42:21 PM1/14/24
to cloudlab-users
Hi. I tried creating an experiment with both r650 and r6525 but looks like these nodes are unavailable. So what can I do now? Is there any way to register for these nodes? I need two of either type1 or type 2
  "*** Resource reservation violation: 2 nodes of type r6525 requested, but only 1 available because of existing resource reservations to other projects or users. TIMESTAMP: 23:35:45:048830 Released the mapper lock after RunAssign1"

Mike Hibler

unread,
Jan 15, 2024, 10:08:23 AM1/15/24
to cloudla...@googlegroups.com
You need to make a reservation:
http://docs.cloudlab.us/reservations.html
Those node types are popular, so you are unlikely to get them as
a "walk in".
> > > cloudlab-users/676ff6d6-d26b-42bb-b1f1-8472e4544bffn%40googlegroups.com
> .
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> email to cloudlab-user...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/
> msgid/cloudlab-users/20240111153513.GM48389%40flux.utah.edu.
>
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/e34be2eb-a050-4567-9d9a-65c72f709206n%40googlegroups.com.

Subitsha Kamal

unread,
Jan 23, 2024, 3:23:47 PM1/23/24
to cloudlab-users
Sure, and now that I have the nodes from Clem cluster of type r6525. I tried to ssh into the nodes and found out that there are two types of NICs available, ConnectX-6 Dx and ConnectX-5 . How can I use the ConnectX6 Dx NIC? is there any command or a way to enable the ConnectX-6 Dx NIC to use for my project?  I basically want to use the 81:00.0 ConnectX-6 Dx for my project. But when I checked ip route , it's showing an interface that is running with ConnectX-5. ConnectX-5 does not have the ipsec offload feature. Screenshot 2024-01-23 at 1.19.16 PM.png
Reply all
Reply to author
Forward
0 new messages