You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Hi all,
In my experiment zhouaea-103112, my image-backed dataset mounted to /usr/local/cuda-11.2 is not mounting. I suspect that this might have something to do with the recent maintenance changes, and may have a fix similar to this discussion post. I could mount image datasets before, and the only change I made was switching to a different image backed dataset that I'm pretty sure should work.
Thanks for your time!
Neo
Mike Hibler
unread,
Jul 27, 2021, 1:14:34 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
The local part of the dataset setup seems to have gotten into some sort
of half-completed state. I cleared the local setup and am running the
full setup again. The dataset is downloading now. It will probably be
another 10-15 minutes before it is finished. I will let you know.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ > cloudlab-users/bf8f1f6e-dbd4-4b07-80a8-9b5809930831n%40googlegroups.com.
Neo Zhou
unread,
Jul 27, 2021, 1:49:48 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Sounds good— did something go wrong? I've found that creating images can take a while.
Mike Hibler
unread,
Jul 27, 2021, 2:13:15 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
The machine was locked up originally when I started looking at the problem,
but I didn't think much of it and just power cycled and started looking at
your problem. However, about 15 minutes into loading your image, it locked
up again.
Unfortunately, that was about an hour ago, before I got side-tracked.
Were you doing on the node about an hour ago; e.g., using the GPU?
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Hmm, interesting. Looking at my console, I set up a process using the screen command. Then I attempted to cd into the mounted directory, and used the df command to see if my mounted directory was there. I never closed the screen command since I got a pipe disconnect on my computer as I was reading through the google forum, so maybe it's eating up resources? I think it can be closed by doing screen -r, and then typing exit. I can do that if that would help.
Neo Zhou
unread,
Jul 27, 2021, 3:06:23 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Never mind, I tested the screen thing, but it seems that after rebooting, screen processes automatically close, so that shouldn't have been the issue. Is my only option to reinstall everything in that dataset and try to make an image dataset again?
Mike Hibler
unread,
Jul 27, 2021, 3:29:26 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
I don't think anything is wrong with the dataset, I think the machine is
having problems. It locked up again just sitting there. Let me look a the
management interface.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Interesting, it's not supposed to do that? For the past month I've been working with the VM, if I don't type anything for a little bit, it always just pipe disconnects my ssh connection, and then I have to reboot the VM to get it working again. Maybe I messed with the operating system by accident?
Mike Hibler
unread,
Jul 27, 2021, 4:12:46 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
I don't know for sure what you mean by "the VM"? What you have allocated
now is an actual physical node. Unless you are starting up a VM on it?
Anyway, no it should not be locking up like that. Has it been doing that
in just this instantiation of your profile or in others as well?
I had just power cycled it, logged in and determined that your dataset was
now mounted, went away for a few minutes, and it locked up again. The
management interfaces shows nothing unusual. Had it been a power or cooling
problem, it would have powered the node off. But it shows it as still
powered on.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Yeah, I meant the actual physical mode. This has been happening with every single instantiation of every profile I've used. Good to know about the management interfaces.
Mike Hibler
unread,
Jul 27, 2021, 4:30:24 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
Ah, you are running a custom image. Do you have a custom kernel?
Note that the "management interface" is one that you cannot access directly.
We proxy parts of it so you can power cycle and access the SOL console,
but that is all you can do.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
I think I'm using the base linux kernel. I used the cloudlab ubuntu 20.04 disk image as a base, and all other disk images just contain that same OS with different data installed.
Mike Hibler
unread,
Jul 27, 2021, 5:48:39 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
All I can suggest is that we load your current machine with the standard
Ubuntu 20 image and see if it happens there. If so, it might be a hardware
problem, if not then it is a side-effect of some software you loaded on the
machine.
If I load the standard image on there, it will wipe out anything you have
installed on the machine since you instantiated the experiment though.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Ok, sounds good. I can try this, though I think my issue is definitely a software issue since this has been happening on every single machine I've loaded my profiles on. I'll try redownloading the necessary packages one by one to see which one causes my machine to lock.
Mike Hibler
unread,
Jul 27, 2021, 6:50:02 PM7/27/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
> .
> >
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "cloudlab-users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> email
> > to cloudlab-user...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/ > msgid/
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
It seems that my machine starts locking once I download the CUDA toolkit and don't type anything for a bit. I'm not sure why... Is there a cloudlab timeout where the physical machine will freeze after no inputs?
Leigh Stoller
unread,
Jul 28, 2021, 4:23:42 PM7/28/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
> It seems that my machine starts locking once I download the CUDA toolkit and don't type anything for a bit. I'm not sure why... Is there a cloudlab timeout where the physical machine will freeze after no inputs?
Hi. No, there is no such timeout. So how do you know the machine has
stopped responding? Did you attach to the console? Maybe just your ssh
timed out (which *does* happen).
Leigh
Neo Zhou
unread,
Jul 28, 2021, 4:27:09 PM7/28/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Hi Leigh,
Ok, good to know. After I get "client_loop: send disconnect: Broken pipe" on my end, when I try to reconnect again via ssh, I don't get any response.
Leigh Stoller
unread,
Jul 28, 2021, 4:28:07 PM7/28/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
> Ok, good to know. After I get "client_loop: send disconnect: Broken pipe" on my end, when I try to reconnect again via ssh, I don't get any response.
OK, look at the console next time, maybe log into it and see if you
can spot something odd?
Leigh
Message has been deleted
Leigh Stoller
unread,
Jul 28, 2021, 4:36:24 PM7/28/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
> Ok, sounds good, I'll take a loot at that. Not sure if this is noteworthy, but when I try to open a shell via cloudlab, I get the message attached. Is this typical?
Looks like you got logged out. Reload the page, log in.
Leigh
Neo Zhou
unread,
Jul 28, 2021, 5:20:58 PM7/28/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Whoops, sorry about that— thanks for your patience. I logged in via the console and the machine did eventually freeze again. When I looked at the console logs, I didn't see anything out of the ordinary, just a blank prompt.
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
I could tell that it froze since anything I typed didn't register on the screen.
Mike Hibler
unread,
Jul 28, 2021, 5:44:26 PM7/28/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
Yes, the console is unresponsive in these cases as well. The management
interface works, and doesn't reveal anything wrong.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> After some testing, it seems that the cloudlab disk image of Ubuntu 20.04 might be causing the problems. After 10-15 minutes of inactivity, the machine locks without me installing anything.
Hi, do you have a current experiment that is locked up?
Thanks
Leigh
Neo Zhou
unread,
Aug 2, 2021, 10:38:43 AM8/2/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Hi, after switching from a disk image of Ubuntu 20.04 to 16.04, I no longer have issues with locking.
David M Johnson
unread,
Aug 2, 2021, 11:54:10 AM8/2/21
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudla...@googlegroups.com
I glanced briefly at one of your experiments over the weekend. Didn't
have enough time to be conclusive, but the problem I saw was that some
package install was pulling in NetworkManager, and we've seen that with
some nvidia or cuda package before. NetworkManager effectively undoes
part of our control network configuration and will end up disabling the
control network. You probably won't be able to block NetworkManager
from being installed, but you can prevent it from starting up. Here is
one coarse method to do that (run before you install the package that
requires NetworkManager):
I did this on a node running UBUNTU20-64-STD, then installed
network-manager, and it stayed up fine, of course.
David
> --
> You received this message because you are subscribed to the Google
> Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to cloudlab-user...@googlegroups.com
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to cloudlab-users
Hi David,
Thank you, I really appreciate your help with this issue. Interestingly, after switching to Ubuntu 18.04, downloading all of my required dependencies, and making a disk image and image dataset, I have yet to encounter any issues with locking, or the mounting problem I was originally having. It's possible that the CUDA dependencies designed for Ubuntu 18.04 do not use NetworkManager. If any issues arise, I'll be sure to use your modification. Thanks again for the help!