Clemson Node clgpu001 (c4130) becomes unresponsive

72 views
Skip to first unread message

Umakant Kulkarni

unread,
Apr 4, 2022, 6:50:22 PM4/4/22
to cloudlab-users
Hello,

I'm running following experiment: https://www.cloudlab.us/status.php?uuid=2e7e795e-b3d8-11ec-b318-e4434b2381fc on clemson cloudlab server with node clgpu001 of type c4130.

I'm running some machine learning models with Tesla GPUs. But, while the experiment is in progress, the node becomes unresponsive and I get exit out of ssh session. Only way to resume is to restart the node from cloudlab portal. FYI, on cloudlab portal, the node is still shown in "Ready/Green" state. Currently, the node is in the same state in case you want to take a look at the logs.

The same program runs fine on other node (ibm8335). So, I assume issue not related to the underlying program.

Is there any issue with this node or GPUs?

Thanks,
Umakant 

Leigh Stoller

unread,
Apr 5, 2022, 8:41:22 AM4/5/22
to cloudla...@googlegroups.com

> I'm running some machine learning models with Tesla GPUs. But, while the experiment is in progress, the node becomes unresponsive and I get exit out of ssh session. Only way to resume is to restart the node from cloudlab portal. FYI, on cloudlab portal, the node is still shown in "Ready/Green" state. Currently, the node is in the same state in case you want to take a look at the logs.

Hi. We will look at the node today, please leave it as is. Thanks!

Leigh


Umakant Kulkarni

unread,
Apr 5, 2022, 11:23:29 AM4/5/22
to cloudlab-users
Sure, its in that state.

Thanks, again,
Umakant

Umakant Kulkarni

unread,
Apr 6, 2022, 11:15:18 AM4/6/22
to cloudlab-users
Hi,

Did you get a chance to look at this issue?

-Umakant

Leigh Stoller

unread,
Apr 6, 2022, 11:28:30 AM4/6/22
to cloudla...@googlegroups.com

> Did you get a chance to look at this issue?

Hi. This should be resolved as of a few minutes ago. Some faulty
hardware on the node was replaced. Let us know if you have further
problems.

Leigh


Umakant Kulkarni

unread,
Apr 6, 2022, 11:29:54 AM4/6/22
to cloudlab-users
Oh, Thank you so much!

Let me try the experiment again.

Regards,
Umakant

Umakant Kulkarni

unread,
Apr 6, 2022, 1:47:07 PM4/6/22
to cloudlab-users
Hi, 

I'm still facing the same issue even while simply installing the packages.

Leigh Stoller

unread,
Apr 6, 2022, 1:49:07 PM4/6/22
to cloudla...@googlegroups.com

> I'm still facing the same issue even while simply installing the packages.

OK, we are looking at it again. Sorry about that.

Leigh

Umakant Kulkarni

unread,
Apr 6, 2022, 1:50:25 PM4/6/22
to cloudlab-users
No worries and thank you again!

Umakant Kulkarni

unread,
Apr 7, 2022, 12:07:53 PM4/7/22
to cloudlab-users
Hi,

Did you get a chance to look at this issue?

-Umakant

Leigh Stoller

unread,
Apr 7, 2022, 1:15:24 PM4/7/22
to cloudla...@googlegroups.com

>
> Did you get a chance to look at this issue?

Hi. Unfortunately we are not going to be able to fix this node today,
we are going to have to take it offline for diagnostics. Can you please
terminate your experiment so we can have control of it back.

Thanks!
Leigh


Umakant Kulkarni

unread,
Apr 7, 2022, 1:16:48 PM4/7/22
to cloudlab-users
Sure, I've terminated my experiment!

Thanks,
Umakant

Rajesh Shashi Kumar

unread,
Apr 11, 2022, 5:42:09 PM4/11/22
to cloudlab-users
Hi,

Was there a resolution on this thread?

I'm facing the same issue on the following c4130 node in the Wisconsin cluster from the past few days. Machine becomes unresponsive after every 15-20 minutes. A reboot from the dashboard is necessary to bring it to usable state.

ID: node-0
Node: c4130-110133
Type: c4130
Cluster: Wisc

Thanks,
Rajesh Shashi Kumar

Umakant Kulkarni

unread,
Apr 15, 2022, 6:25:48 PM4/15/22
to cloudlab-users
I just resolved this issue by making only one GPU visible to CUDA.
Added following line to ~/.bashrc:

export CUDA_VISIBLE_DEVICES=0

where 0 is the GPU device #

Rajesh Shashi Kumar

unread,
Apr 15, 2022, 6:29:22 PM4/15/22
to cloudlab-users
I found that this issue does not exist on Ubuntu 18.04 but only on Ubunti 20.04. Latest CUDA in both cases.

Used the following before CUDA install to work around the NetworkManager issue
# Ubuntu 16.04
sudo ln -s /dev/null /etc/systemd/system/NetworkManager.service

Thanks,
Rajesh Shashi Kumar
Reply all
Reply to author
Forward
0 new messages