Experimental IP unavailable

38 views
Skip to first unread message

Amanda Baran

unread,
Apr 24, 2026, 1:32:42 PMApr 24
to cloudlab-users
Hi,

It looks like some of the nodes in my experiment are failing to bind to the request ip address and RDMA devices because the experimental link is not up. I am seeing this specifically on node1 and node2 in my experiment, but it is also possible on others. Is there a way to fix this on my end? I can't seem to bring up the link.

https://www.cloudlab.us/status.php?uuid=21d81089-db96-4997-a695-30fe8eb082a7

Mike Hibler

unread,
Apr 24, 2026, 4:32:33 PMApr 24
to cloudla...@googlegroups.com
Sometimes the Mellanox NIC on those machines is not seen by the OS on boot.
If you power cycle them (from the node action dropdown) then that usually
fixes it. Note you must power cycle and not just "reboot".
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> d240235c-b4ec-4b85-b524-629e52014555n%40googlegroups.com.

Mike Hibler

unread,
Apr 24, 2026, 4:36:53 PMApr 24
to cloudla...@googlegroups.com
Oh, and I did node1 and node2 in your experiment already.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/20260424203228.GI75885%40flux.utah.edu.
Message has been deleted

Amanda Baran

unread,
May 5, 2026, 1:32:41 PM (10 days ago) May 5
to cloudlab-users
Hi Mike,

I seem to have this issue with every experiment launched on a c6525-25g or -100g nodes, which are the only node types I can reserve with a strong enough CPU as well as a decently newer RNIC. I am currently experiencing this problem on this experiment: https://www.cloudlab.us/status.php?uuid=b39020eb-48bf-4a26-85fb-d2c1af935347#

I tried power cycling all nodes as suggested but I still can not bring up the experimental link. See output below.
Are there any other steps to remedy this? My research suggests this can only be fixed at the administrative level or requires adjustments to the switch itself. 

@node0:~$ sudo ethtool enp65s0f0np0

Settings for enp65s0f0np0:

Supported ports: [ Backplane ]

Supported link modes:   1000baseKX/Full

                        10000baseKR/Full

                        40000baseKR4/Full

                        40000baseCR4/Full

                        40000baseSR4/Full

                        40000baseLR4/Full

                        25000baseCR/Full

                        25000baseKR/Full

                        25000baseSR/Full

                        50000baseCR2/Full

                        50000baseKR2/Full

                        100000baseKR4/Full

                        100000baseSR4/Full

                        100000baseCR4/Full

                        100000baseLR4_ER4/Full

Supported pause frame use: Symmetric

Supports auto-negotiation: Yes

Supported FEC modes: None RS BASER

Advertised link modes:  1000baseKX/Full

                        10000baseKR/Full

                        40000baseKR4/Full

                        40000baseCR4/Full

                        40000baseSR4/Full

                        40000baseLR4/Full

                        25000baseCR/Full

                        25000baseKR/Full

                        25000baseSR/Full

                        50000baseCR2/Full

                        50000baseKR2/Full

                        100000baseKR4/Full

                        100000baseSR4/Full

                        100000baseCR4/Full

                        100000baseLR4_ER4/Full

Advertised pause frame use: Symmetric

Advertised auto-negotiation: Yes

Advertised FEC modes: Not reported

Speed: Unknown!

Duplex: Unknown! (255)

Auto-negotiation: on

Port: Direct Attach Copper

PHYAD: 0

Transceiver: internal

Supports Wake-on: d

Wake-on: d

Link detected: no

Mike Hibler

unread,
May 5, 2026, 1:35:40 PM (10 days ago) May 5
to cloudla...@googlegroups.com
Sorry, your message got stuck wanting approval. I have whitelisted your
email for the future.

I will look at this.
> fbc5a916-0d6e-4984-9f0e-c5324e663199n%40googlegroups.com.

Mike Hibler

unread,
May 5, 2026, 1:53:14 PM (10 days ago) May 5
to cloudla...@googlegroups.com
So Aleks explained what is happening with the -100g nodes.

You say that you have "this issue" with the -25g nodes as well. What is the
issue in this case? Is it that the interface does not appear in the OS? That
there is no link on the configured interface? Next time it happens on one of
those nodes, let us know.

Note that on the Topology View tab of your experiment, there is a
"Run Linktest" button on the bottom right of the frame. That will run a
quick connectivity test of the interfaces we enabled and configured to see
if things are configured as your profile indicates.

On Tue, May 05, 2026 at 11:35:36AM -0600, Mike Hibler wrote:
> Sorry, your message got stuck wanting approval. I have whitelisted your
> email for the future.
>
> I will look at this.
>
> On Tue, May 05, 2026 at 09:50:48AM -0700, Amanda Baran wrote:
> > Hi Mike,
> >
> > I seem to have this issue with every experiment launched on a c6525-25g or
> > -100g nodes, which are the only node types I can reserve with a strong enough
> > CPU as well as a decently newer RNIC. I am currently experiencing this problem
> > on this experiment:??https://www.cloudlab.us/status.php?uuid=
> > b39020eb-48bf-4a26-85fb-d2c1af935347#
> >
> > I tried power cycling all nodes as suggested but I still can not bring up the
> > experimental link. See output below.
> > Are there any other steps to remedy this? My research suggests this can only be
> > fixed at the administrative level or requires adjustments to the switch
> > itself.??
> >
> >
> > @node0:~$ sudo ethtool enp65s0f0np0
> >
> > Settings for enp65s0f0np0:
> >
> > Supported ports: [ Backplane ]
> >
> > Supported link modes: ?? 1000baseKX/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 10000baseKR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseKR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseCR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseSR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseLR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 25000baseCR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 25000baseKR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 25000baseSR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 50000baseCR2/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 50000baseKR2/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseKR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseSR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseCR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseLR4_ER4/Full
> >
> > Supported pause frame use: Symmetric
> >
> > Supports auto-negotiation: Yes
> >
> > Supported FEC modes: None RS BASER
> >
> > Advertised link modes:?? 1000baseKX/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 10000baseKR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseKR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseCR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseSR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 40000baseLR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 25000baseCR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 25000baseKR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 25000baseSR/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 50000baseCR2/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 50000baseKR2/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseKR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseSR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseCR4/Full
> >
> > ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? 100000baseLR4_ER4/Full
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/20260505173536.GT29568%40flux.utah.edu.

Amanda Baran

unread,
May 5, 2026, 2:10:10 PM (10 days ago) May 5
to cloudlab-users
Thanks, Mike and Aleks. We were unaware of the need to specify the link speed in the profile when instantiating the experiment. I believe the issue I observed on the -25g nodes was resolved with a power cycle. 
Once we finalize a few things, I will need a reservation of 19 -100g nodes to run our full test suite with 3 replicas and 16 servers. Hopefully, this has answered most of our questions/recurring issues. 
Reply all
Reply to author
Forward
0 new messages