Need urgent help

40 views
Skip to first unread message

ashiq rahman

unread,
Jul 2, 2025, 1:16:38 PMJul 2
to cloudla...@googlegroups.com
Dear Concern,

The server instant is not booting.
image.png

Status: "changing" but still unresponsive.

Thanks
Ashiqur

Mike Hibler

unread,
Jul 2, 2025, 1:45:31 PMJul 2
to cloudla...@googlegroups.com
Did it come up when you first instantiated the experiment?
If so, did you install any packages or are you using a custom kernel?

On Wed, Jul 02, 2025 at 01:16:17PM -0400, ashiq rahman wrote:
> Dear Concern,
>
> The server instant is not booting.
> image.png
>
> Status: "changing" but still unresponsive.
> https://www.cloudlab.us/status.php?uuid=baf3d8d1-5762-11f0-af1a-e4434b2381fc#
>
> Please have a look.
>
> Thanks
> Ashiqur
>
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/cloudlab-users/
> CALai2RGXXoWwTyBoUwTAYKYd3THy%2B4Ernn0LkxqJRfki8JWB_g%40mail.gmail.com.


ashiq rahman

unread,
Jul 2, 2025, 3:15:20 PMJul 2
to cloudlab-users
Hello,

I have installed

sudo apt install nvidia-driver-535 nvidia-dkms-535

Is this driver OK?

David M Johnson

unread,
Jul 2, 2025, 3:35:19 PMJul 2
to cloudla...@googlegroups.com
On 7/2/25 13:15, ashiq rahman wrote:
> Hello,
>
> I have installed
>
> sudo apt install nvidia-driver-535 nvidia-dkms-535
>
> Is this driver OK?

The driver is fine, but the nvidia packages do two things that can make
your node become unresponsive. First, they install networkmanager,
which conflicts with our management of the control network interface.
To disable NetworkManager, see
https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/rwkHf_kwAgAJ
(`sudo systemctl disable NetworkManager NetworkManager-wait-online`).
Second, sometimes the packages will suspend the processor; see
https://groups.google.com/g/cloudlab-users/c/Dyn1HYUEkqc/m/IkYcovRBAgAJ
for a workaround (`sudo systemctl mask sleep.target suspend.target
hibernate.target hybrid-sleep.target`).

You can login on serial console using the provided root password and
make these changes to your node.

David

> On Wednesday, July 2, 2025 at 1:45:31 PM UTC-4 Mike Hibler wrote:
>
> Did it come up when you first instantiated the experiment?
> If so, did you install any packages or are you using a custom kernel?
>
> On Wed, Jul 02, 2025 at 01:16:17PM -0400, ashiq rahman wrote:
> > Dear Concern,
> >
> > The server instant is not booting.
> > image.png
> >
> > Status: "changing" but still unresponsive.
> > https://www.cloudlab.us/status.php?uuid=baf3d8d1-5762-11f0-af1a-
> e4434b2381fc# <https://www.cloudlab.us/status.php?
> uuid=baf3d8d1-5762-11f0-af1a-e4434b2381fc#>
> >
> > Please have a look.
> >
> > Thanks
> > Ashiqur
> >
> > --
> > You received this message because you are subscribed to the
> Google Groups
> > "cloudlab-users" group.
> > To unsubscribe from this group and stop receiving emails from it,
> send an email
> > to cloudlab-user...@googlegroups.com.
> > To view this discussion visit https://groups.google.com/d/msgid/
> cloudlab-users/ <https://groups.google.com/d/msgid/cloudlab-users/>
> >
> CALai2RGXXoWwTyBoUwTAYKYd3THy%2B4Ernn0LkxqJRfki8JWB_g%40mail.gmail.com <http://40mail.gmail.com>.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cloudlab-user...@googlegroups.com
> <mailto:cloudlab-user...@googlegroups.com>.
> To view this discussion visit https://groups.google.com/d/msgid/
> cloudlab-users/9c7f80d1-1aef-43f5-b598-25a18fb00098n%40googlegroups.com
> <https://groups.google.com/d/msgid/cloudlab-users/9c7f80d1-1aef-43f5-
> b598-25a18fb00098n%40googlegroups.com?utm_medium=email&utm_source=footer>.

ashiq rahman

unread,
Jul 2, 2025, 3:58:11 PMJul 2
to cloudlab-users
I dont have root password


"You can login on serial console using the provided root password and
make these changes to your node."

from I can obtain root password

Leigh Stoller

unread,
Jul 2, 2025, 4:00:56 PMJul 2
to cloudlab-users

> On Jul 2, 2025, at 12:58 PM, ashiq rahman <ashiqur.r...@gmail.com> wrote:
>
> I dont have root password
>
> "You can login on serial console using the provided root password and
> make these changes to your node."
>
> from I can obtain root password

Hi. Click on the node in the Topology diagram. Choose the
Console option. The console tab that pops up will have a
little box to click that will show you the password.

Leigh

ashiq rahman

unread,
Jul 2, 2025, 11:16:09 PMJul 2
to cloudlab-users
Dear Concern,

I am really helpless. I worked with this instance before d8545 (Wisconsin), but then I didn't encounter any problems. But now with the same configuration I cant use PyTorch. It says "no Cuda GPU". But Nvidia-smi says there are 4 A100 GPUS. Please help me in this regard.

1) Nvidia Driver: 535
2) Python: 3.10
3) Pytorch: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

But simple python code does not able to access Cuda Core

UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)

  return torch._C._cuda_getDeviceCount() > 0

Failure. PyTorch cannot see your GPU.

Please check your NVIDIA driver and CUDA installation. 
Reply all
Reply to author
Forward
0 new messages