Experiment fails to terminate

10 views
Skip to first unread message

Christopher Canel

unread,
May 18, 2026, 1:35:09 PMMay 18
to cloudlab-users
Hello CloudLab folks,

My experiment https://www.cloudlab.us/status.php?uuid=6a097963-95c8-4fb6-ad3b-2cea47f7ef0d was due to expire over the weekend, but it seems to be stuck terminating. I suspect this is because the switch ualloc-mlnx1 is stuck in a state change. 

Would you please help terminate this experiment?

Thanks,
Chris


Mike Hibler

unread,
May 18, 2026, 2:09:35 PMMay 18
to cloudla...@googlegroups.com
Since your experiment involved an actual L2 switch, its experiment links
traversed our L1 switch. Somewhere over the course of your experiment,
that switch decided to stop responding on its management interface.
Thus when you went to terminate your experiment, we could not talk to
that switch and your experiment would not terminate.

So we will try to get the switch running again so that your experiment
terminates or we will manually clean it up. Thanks for bringing this
to our attention.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> c711f3a7-6425-42c7-93c8-8c3a24bff67cn%40googlegroups.com.

Christopher Canel

unread,
May 18, 2026, 2:33:36 PMMay 18
to cloudla...@googlegroups.com
Hi Mike,

Thanks for your help.

I observed that ualloc-mlnx1 was experiencing problems with its management interface right from the start. It appeared as if the switch was unable to check in after provisioning, and the experiment was marked as failed to start. Rebooting the switch didn’t fix the management issue. But the data plane was working fine, so I bypassed the error and used the experiment anyway.

I’d like to use ualloc-mlnx1 for future experiments, so it would be awesome to fix this. Perhaps I’m setting the switch up incorrectly in my profile? I used the examples in the manual as a reference.

Thanks,
Chris
> You received this message because you are subscribed to a topic in the Google Groups "cloudlab-users" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/cloudlab-users/UXJTFDEkxWA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/20260518180930.GV73379%40flux.utah.edu.

Mike Hibler

unread,
May 18, 2026, 3:04:29 PMMay 18
to cloudla...@googlegroups.com
The current problem is not the Mellanox L2 switch, it is how our infrastructure
wires the L2 switch and the nodes together. That is done by programming the
Netscout L1 switch to which all xl170 nodes and the ualloc-* switches are
connected, creating layer 1 paths between a node NIC port and the L2 switch
port. The netscout switch appears to have died after your experiment was setup,
so the dataplane should work as those L1 paths are in place. We just cannot
teardown those paths now since we cannot talk to the management interface on
the Netscout to do so.

There may well be issues with the Mellanox switch too. The management port for
that switch is a serial port that we give you proxied access to.

Anyway, if we cannot get the Netscout switch working again with power cycle
and/or reseating cards, then there will be no way to do experiments with the
ualloc switches going forward. Parts for the netscout switches are just too
expensive and at that point, both Netscout switches will have died.

On Mon, May 18, 2026 at 11:33:00AM -0700, Christopher Canel wrote:
> Hi Mike,
>
> Thanks for your help.
>
> I observed that ualloc-mlnx1 was experiencing problems with its management interface right from the start. It appeared as if the switch was unable to check in after provisioning, and the experiment was marked as failed to start. Rebooting the switch didn???t fix the management issue. But the data plane was working fine, so I bypassed the error and used the experiment anyway.
>
> I???d like to use ualloc-mlnx1 for future experiments, so it would be awesome to fix this. Perhaps I???m setting the switch up incorrectly in my profile? I used the examples in the manual as a reference.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/59876CEB-70F1-4E14-B173-66F00FE0E397%40andrew.cmu.edu.

Christopher Canel

unread,
May 19, 2026, 2:20:11 AMMay 19
to cloudla...@googlegroups.com
Hi Mike,

Thank you for explaining.

The experiment I'm trying to perform is pretty simple: run DCTCP over an incast topology where the bottleneck switch queue is configured with ECN marking. Do any other CloudLab switches have ECN marking enabled, or is there a process where I could request a one-off configuration of ECN marking for the one switch egress queue/port corresponding to my receiver host? Do you know what other CloudLab users do if they need ECN marking? I really appreciate any advice you have about how to achieve this.

Thanks,
Chris
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/20260518190424.GW73379%40flux.utah.edu.

Reply all
Reply to author
Forward
0 new messages