Wisconsin node c4130-110133 becomes unresponsive frequently

88 views
Skip to first unread message

Rajesh Shashi Kumar

unread,
Apr 11, 2022, 5:47:53 PM4/11/22
to cloudlab-users
Hi,

This issue was reported earlier on a different node in this thread  (https://groups.google.com/g/cloudlab-users/c/DjhzgXv4xGQ/m/56GUGWQRBAAJ)

I'm facing the same issue on the following c4130 node in the Wisconsin cluster from the past few days. Machine becomes unresponsive after every 15-20 minutes. A reboot from the dashboard is necessary to bring it to usable state.

Could you please let me know if I'm missing something in configuration?

ID: node-0
Node: c4130-110133
Type: c4130
Cluster: Wisc

Thanks,
Rajesh Shashi Kumar

Leigh Stoller

unread,
Apr 11, 2022, 6:06:18 PM4/11/22
to cloudla...@googlegroups.com

> I'm facing the same issue on the following c4130 node in the Wisconsin cluster from the past few days. Machine becomes unresponsive after every 15-20 minutes. A reboot from the dashboard is necessary to bring it to usable state.
>
> Could you please let me know if I'm missing something in configuration?

Good question. The same problem followed you to a different node at
a different cluster. :-) The thing to do at this point is tell us
what packages you installed and what config changes you made to the
node.

Thanks
Leigh

Rajesh Shashi Kumar

unread,
Apr 11, 2022, 6:18:09 PM4/11/22
to cloudlab-users
Thank you for the quick reply. I only installed CUDA on top of the provided RSPEC:

sudo apt-get install linux-headers-$(uname -r)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt-get update
sudo apt-get -y install cuda

RSPEC used:
# Import the Portal object.
import geni.portal as portal
# Import the ProtoGENI library.
import geni.rspec.pg as pg
# Import the Emulab specific extensions.
import geni.rspec.emulab as emulab

# Create a portal object,
pc = portal.Context()

# Create a Request object to start building the RSpec.
request = pc.makeRequestRSpec()

# Node node-0
node_0 = request.RawPC('node-0')
node_0.hardware_type = 'c4130'
node_0.disk_image = 'urn:publicid:IDN+emulab.net+image+emulab-ops//UBUNTU20-64-STD'


# Print the generated rspec
pc.printRequestRSpec(request)

Thanks,
Rajesh

Leigh Stoller

unread,
Apr 11, 2022, 6:19:09 PM4/11/22
to cloudla...@googlegroups.com


> On Apr 11, 2022, at 3:18 PM, 'Rajesh Shashi Kumar' via cloudlab-users <cloudla...@googlegroups.com> wrote:
>
> Thank you for the quick reply. I only installed CUDA on top of the provided RSPEC:
>

Ah. Go to the cloudlab-users group and search for CUDA. The first
match will tell you what is going wrong. https://groups.google.com/g/cloudlab-users/

Leigh

Rajesh Shashi Kumar

unread,
Apr 11, 2022, 9:50:08 PM4/11/22
to cloudlab-users
Hi,

I terminated the previous experiment and started a new one.

This time, I followed the instructions below before attempting CUDA installation from https://groups.google.com/g/cloudlab-users/c/B6rNj7Vhltk/m/i6wxDdwhAgAJ
The fix is either to `systemctl disable NetworkManager` before installing. 

I still encounter the same issue. Please let me know if I am referring to the correct workaround.

Thanks,
Rajesh

Mike Hibler

unread,
Apr 12, 2022, 12:40:25 AM4/12/22
to 'Rajesh Shashi Kumar' via cloudlab-users
I have the console working again for this particular node. So if it hangs
up again, see if you can connect to the console and get a login prompt.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/945a8edf-39d1-45dd-929a-a1130de43aebn%40googlegroups.com.

Rajesh Shashi Kumar

unread,
Apr 12, 2022, 1:51:58 AM4/12/22
to cloudlab-users
Hi,

It is unresponsive again. Trying to connect to console from CloudLab dashboard does not seem to work. Leaving it without a reboot for now in case it helps.

Just to double check, here's what I had done:
sudo -s
systemctl disable NetworkManager
<install CUDA>

Thank you for your time,
Rajesh

Mike Hibler

unread,
Apr 12, 2022, 10:05:42 AM4/12/22
to 'Rajesh Shashi Kumar' via cloudlab-users
Are you using all four GPUs on the Wisconsin node? Maybe you should try
using only one or two and see what happens. It is possible there is a thermal
issue. I don't see the network manager running, so I assume it is correct.
> cloudlab-users/5ab36dfe-d41b-42ff-b6e0-d6710e2138acn%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages