C6220 infiniband not working

41 views
Skip to first unread message

ruihong wang

unread,
Jun 16, 2022, 2:59:36 PM6/16/22
to cloudlab-users
Hi,

I found the Infiniband is not available in many nodes of C6220.
For example in apt14, it shows there is no adapters on PCIe bus.

Ruihong@node-14:~$ sudo /usr/bin/hca_self_test.ofed

---- Performing Adapter Device Self Test ----
Number of CAs Detected ................. 0
PCI Device Check ....................... FAIL
   REASON: no CAs in the system
 

In apt 41,46 and 48, it shows that the port is down.

Ruihong@node-13:~$ ibstat
CA 'mlx4_0'
       CA type: MT4099
       Number of ports: 1
       Firmware version: 2.36.5000
       Hardware version: 1
       Node GUID: 0x0002c90300168fd0
       System image GUID: 0x0002c90300168fd3
       Port 1:
               State: Down
               Physical state: Disabled
               Rate: 40
               Base lid: 0
               LMC: 0
               SM lid: 0
               Capability mask: 0x00010000
               Port GUID: 0x0202c9fffe168fd1
               Link layer: Ethernet


Can anyone help me to fix that?

Thanks,

Ruihong

Leigh Stoller

unread,
Jun 16, 2022, 3:05:56 PM6/16/22
to cloudla...@googlegroups.com

> I found the Infiniband is not available in many nodes of C6220.
> For example in apt14, it shows there is no adapters on PCIe bus.

Hi. The first thing to try is to power cycle the nodes to see if
that brings the interfaces back online. There is a power cycle
option on the status page.

Leigh


ruihong wang

unread,
Jun 16, 2022, 3:45:08 PM6/16/22
to cloudlab-users
It works for apt14, but the ib ports are still down for apt 41, 46 and 48 aft er the power cycle. Any other suggestions?

Thanks,

Ruihong

Kirk Webb

unread,
Jun 16, 2022, 3:49:57 PM6/16/22
to cloudlab-users
Send a link to your experiment and I'll have a look.
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/30a61e73-8e1d-4fb8-8bb6-8231cbe20633n%40googlegroups.com.

ruihong wang

unread,
Jun 16, 2022, 3:57:46 PM6/16/22
to cloudlab-users

Kirk Webb

unread,
Jun 16, 2022, 6:13:42 PM6/16/22
to cloudlab-users
OK, I believe I've fixed all IB port state issues for the nodes in
your experiment. There were a variety of issues, largely incorrect
Mellanox NIC port type settings (hard set to Ethernet). apt002 is
faulty, however, and I do not even see the Mellanox adapter on the PCI
bus. We'll probably need to take a closer look at that one later, so
I will schedule it to go out of service when your experiment
terminates.

-Kirk
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/85e9a848-80af-493d-b47a-7f5fc2b90ce3n%40googlegroups.com.

ruihong wang

unread,
Jun 16, 2022, 9:25:16 PM6/16/22
to cloudlab-users
Thank you for your help.

May I know how could I avoid the incorrect port type setting when I initialize the servers again? 
I would like to rebuild a cluster without the faulty node, but I am afraid that there will be still down port.
I also find that apt19 's RDMA latency and bandwidth is slower than the other nodes when running the perftest. How can this problem be fixed?

Thanks,

Ruihong

Kirk Webb

unread,
Jun 17, 2022, 1:30:37 AM6/17/22
to cloudlab-users
It is completely unclear how to fix the issues you describe with
apt019 without going through additional troubleshooting, which could
ultimately be extensive. It would take too much time to describe the
steps taken to clear up the various port issues. If you run into
further problems, please report them.

My advice: Allocate a couple more nodes than you will need in case you
run into IB connectivity problems with a few of them.

-Kirk
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/26ce8820-9b0c-4737-aaeb-012af9028fb7n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages