Experiment destroyed

25 views
Skip to first unread message

Ben Greenman

unread,
Aug 24, 2022, 10:35:24 PM8/24/22
to cloudlab-users
Hi folks,

I started a big experiment at 7pm today with 115 m510 nodes. Around
9pm, it got destroyed. I don't know what happened.

Is there some way to find out why the experiment went down?

https://www.cloudlab.us/memlane.php?uuid=eba6fcf2-2402-11ed-b318-e4434b2381fc

Ben

Mike Hibler

unread,
Aug 25, 2022, 10:51:40 AM8/25/22
to cloudla...@googlegroups.com
The problem is that it did not completely setup, one of the nodes failed.
This can take a long time with a large experiment. There are two ways to
avoid this. One, is there should be a checkbox on the instantiate path that
says "Ignore errors". You can check that. The other is an option you can
set on each node in your profile to tell it to ignore failures to boot
correctly:

node.setFailureAction('nonfatal')

With these, the experiment will finish successfully even if one or more
nodes don't boot properly. You may not be able to use those nodes during
the experiment, but maybe that is okay for your scenario.
> --
> You received this message because you are subscribed to the Google Groups "cloudlab-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cloudlab-user...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cloudlab-users/CAFUu9R66jK6CkcR4%3DxOHzb1eUG2kQ-_3FHpcOvc-kv%3DCJbLTKA%40mail.gmail.com.

Ben Greenman

unread,
Aug 25, 2022, 11:00:53 AM8/25/22
to cloudla...@googlegroups.com
Thanks Mike. Ignoring failures is fine for my experiment.

In fact, that's what I thought I was doing yesterday! The experiment
seemed to setup fine except for the failed node --- and I'd connected
to each of the others to kick things off. It was a big surprise to see
the whole thing go down a few hours later.
> https://groups.google.com/d/msgid/cloudlab-users/20220825145137.GS8951%40flux.utah.edu.
>
Reply all
Reply to author
Forward
0 new messages