Hi All,
before you say "it's a VMDIRAC issue, so it's your problem", hear me out, please.
While testing the python3 transition of VMDIRAC we came across the issue that machines would go into the running state, but the pilot monitor would then never show them as 'Done' even if the machine was long gone. Simon's analysis of the situation was:
"It looks to me like the VM stop policy is probably elastic as that's the
default:
self.vmStopPolicy = self.op.getValue("Cloud/%s/VMStopPolicy", 'elastic')
(Yes, that's an unexpanded %s, not sure it's meant to be there, but it
always has been).
What I think happens is that the VM runs the job, it then sits waiting for
other jobs, the load drops below 0.01 and the VMM requests that the
instance be removed (which is what elastic means). The instance getting
killed causes the pilot to never report back the final "Done" message as
it's still looking for jobs at the time.
This used to work because the pilot would only request one job, so the node
would never be idle for any length of time."
Does anyone have an elegant idea on how we could fix this without breaking too much in the process ? As far as I can tell it doesn't affect the running of the jobs, but does screw up the reporting.
Cheers,
Daniela