IIUC, when you suspend a job it remains in memory but with no CPU time
allocated. If you reboot the node, the job state is lost (unless it uses
checkpointing). When you restarted the jobs, they actually began a new
run (Slurm doesn't know if they use checkpointing or not). You've been
lucky that your jobs seems to use checkpointing...
The pocedure we're following when a node reboot is required is to create
a reservation (or drain the nodes), leave jobs run until completion or
time limit and when the nodes are free we reboot 'em.
Diego
> *Fritz Ratnasamy*
>
> Data Scientist
>
> Information Technology
>
>
>
>
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com