Help restart one r7525 node

19 views
Skip to first unread message

Songyu Zhang

unread,
Feb 19, 2026, 2:19:56 AMFeb 19
to cloudlab-users
Hi admins,

In our current experiment (CloudLab - Experiment Status), one node (clgpu022) fails to restart during the GPU driver installation.
And it is trapped in the notready state for a long time.
Could you help us reboot the machine and check potential issues with it?

Thanks
Songyu
Message has been deleted

Songyu Zhang

unread,
Feb 19, 2026, 6:01:17 AMFeb 19
to cloudlab-users
Hi admins,

I retried GPU driver installation after I reset the machine.
It successfully booted now; but with a very long waiting time.

Thanks
Songyu

Mike Hibler

unread,
Feb 19, 2026, 8:29:44 AMFeb 19
to 'Songyu Zhang' via cloudlab-users
This generally indicates that the processor on the BlueField2 smart NIC
has become inaccessible to the system and reboots can takes 10s of minutes
as a result. The solution is to reinitialize the BF2 card which itself is
a lengthy process. We have a process that involves loading a custom disk
image with all the necessary packages installed and a startup script to
fire off the necessary scripts. We generally only do this between experiments
since it wipes out the contents of the local disk.

If it is not affecting your experiment now, then we can just let it be,
though any reboot of the node will take a long time.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> 80b64575-ea31-407e-865b-4dff1188f55fn%40googlegroups.com.

Message has been deleted

Songyu Zhang

unread,
Feb 19, 2026, 1:42:10 PMFeb 19
to cloudlab-users
Hi Mike,

Thanks for this useful information.
The node looks fine now.

Songyu
Reply all
Reply to author
Forward
0 new messages