Halting but not done (VMDIRAC/sequential user jobs with one pilot)

Daniela Bauer

unread,

Sep 24, 2021, 11:52:06 AM9/24/21

to diracgrid-forum

Hi All,

before you say "it's a VMDIRAC issue, so it's your problem", hear me out, please.

While testing the python3 transition of VMDIRAC we came across the issue that machines would go into the running state, but the pilot monitor would then never show them as 'Done' even if the machine was long gone. Simon's analysis of the situation was:

"It looks to me like the VM stop policy is probably elastic as that's the
default:

self.vmStopPolicy = self.op.getValue("Cloud/%s/VMStopPolicy", 'elastic')
(Yes, that's an unexpanded %s, not sure it's meant to be there, but it
always has been).

What I think happens is that the VM runs the job, it then sits waiting for
other jobs, the load drops below 0.01 and the VMM requests that the
instance be removed (which is what elastic means). The instance getting
killed causes the pilot to never report back the final "Done" message as
it's still looking for jobs at the time.

This used to work because the pilot would only request one job, so the node
would never be idle for any length of time."

Does anyone have an elegant idea on how we could fix this without breaking too much in the process ? As far as I can tell it doesn't affect the running of the jobs, but does screw up the reporting.

Cheers,

Daniela

André Sailer

unread,

Sep 24, 2021, 11:59:16 AM9/24/21

to Daniela Bauer, diracgrid-forum

Hi Daniela,

Please consider using https://github.com/DIRACGrid/DIRAC/discussions

Why doesn't the VMM set the pilot to "Done", when it requests that the
VM is removed?

Cheers,
Andre

> --
> You received this message because you are subscribed to the Google
> Groups "diracgrid-forum" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to diracgrid-for...@googlegroups.com
> <mailto:diracgrid-for...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/diracgrid-forum/c7d35a58-7ed8-4be8-a9cf-c1b3ba469731n%40googlegroups.com
> <https://groups.google.com/d/msgid/diracgrid-forum/c7d35a58-7ed8-4be8-a9cf-c1b3ba469731n%40googlegroups.com?utm_medium=email&utm_source=footer>.

Daniela Bauer

unread,

Sep 24, 2021, 12:09:49 PM9/24/21

to André Sailer, diracgrid-forum

Because I really hate git discussions and it's Friday ?

As for the more obvious question: How does an external function convince the pilot that it's done ?

--

Sent from my guinea pig enhanced living room

-----------------------------------------------------------
daniel...@imperial.ac.uk
HEP Group/Physics Dep
Imperial College
London, SW7 2BW
Tel: Working from home, please use email.
http://www.hep.ph.ic.ac.uk/~dbauer/

Andre Sailer

unread,

Sep 24, 2021, 1:12:26 PM9/24/21

to Daniela Bauer, diracgrid-forum

Maybe I am missing the obvious, but I don't know because it is a VMDIRAC problem.

But like this?

https://github.com/DIRACGrid/DIRAC/blob/d3d030502c5782727073569843a40cd44311753e/src/DIRAC/WorkloadManagementSystem/Agent/JobAgent.py#L746-L748

Because the VMM ends the pilot processes inside the VM and you only care about the status in the DB, right?

From: Daniela Bauer [daniela.b...@googlemail.com]
Sent: 24 September 2021 18:09
To: Andre Sailer
Cc: diracgrid-forum
Subject: Re: [DIRACGrid] Halting but not done (VMDIRAC/sequential user jobs with one pilot)

Reply all

Reply to author

Forward