Experiment gets stuck in booting phase and fails eventually

116 views
Skip to first unread message

Umakant Kulkarni

unread,
Feb 15, 2022, 6:44:29 PM2/15/22
to cloudlab-users
Hi,

I'm trying to create a simple experiment but it is getting stuck in the booting phase and fails eventually. This happened 4 times since morning. The same profile worked earlier. I tried with different nodes across different locations but no luck. Is anyone else facing the similar issue?

Thank you for any and all help,
Umakant

Leigh Stoller

unread,
Feb 15, 2022, 6:55:09 PM2/15/22
to cloudla...@googlegroups.com

> I'm trying to create a simple experiment but it is getting stuck in the booting phase and fails eventually. This happened 4 times since morning. The same profile worked earlier. I tried with different nodes across different locations but no luck. Is anyone else facing the similar issue?

Hi. First thing you can do is tell us what experiment, what node,
a link to the status page of a failed experiment.
You know, details that make it possible for us to assist you. :-)

Leigh

Umakant Kulkarni

unread,
Feb 15, 2022, 7:15:57 PM2/15/22
to cloudlab-users
Hi,

[Reply-all]

It is an experiment consisting of a single amd275 node of type c6525-100g on cloudlab Utah site. It uses sfc_profile2 profile with custom disk image (urn:publicid:IDN+utah.cloudlab.us+image+sfcs-PG0:sfc_u20_5g_k8s). This experiment also requests 200GB of additional temp storage.

Here is a link to the experiment which is currently stuck in the booting phase - https://www.cloudlab.us/status.php?uuid=3effd130-8eb6-11ec-b318-e4434b2381fc

Please let me know if you need any additional details.

Thanks,
Umakant

Umakant Kulkarni

unread,
Feb 15, 2022, 7:44:37 PM2/15/22
to cloudlab-users
It just failed again with following error:
Experiment setup on the Cloudlab Utah cluster failed: SliverStart: Unable to OS setup nodes

Name of the profile and experiment: 'sfc_profile_2/umakant-118110'.

Mike Hibler

unread,
Feb 15, 2022, 8:03:51 PM2/15/22
to cloudla...@googlegroups.com
It looks like the image file is missing and we are failing in a not at
all obvious way. I would guess the image metadata still exists, but the
actual file got deleted from /proj/sfcs-PG0/images. The mod time on the
directory is 10:27am MST, so about 7 hours ago.

Any idea how that might have happened? Note that any node in any experiment
in the SFCs project would have access to /proj/sfcs-PG0, so someone might
have been attempted to clean up space.

The image is still in a snapshot from last night, so I have copied it back
in place.
> --
> You received this message because you are subscribed to the Google Groups
> "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email
> to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/
> cloudlab-users/59e3aa2f-6ee4-49ef-bdf0-bdb6b3d33414n%40googlegroups.com.

Umakant Kulkarni

unread,
Feb 15, 2022, 8:21:08 PM2/15/22
to cloudlab-users
Oh okay. I was working on the experiment in the morning and was running some scripts.

I fear they might have done it something like this? Anyways, I will refrain from running them and will check again.

Thanks again Mike for getting the image from snapshot.

Best,
Umakant


Umakant Kulkarni

unread,
Feb 15, 2022, 9:00:57 PM2/15/22
to cloudlab-users
I re-started the experiment but looks like it is still getting stuck in the same booting phase. Any idea why?

Umakant Kulkarni

unread,
Feb 15, 2022, 9:07:32 PM2/15/22
to cloudlab-users
Please ignore; its up now!
Reply all
Reply to author
Forward
0 new messages