That can help. Usually this happens due to laggy storage the job is
using taking time flushing the job's data. So making sure that your
storage is up, responsive, and stable will also cut these down.
-Paul Edmon-
On 11/30/2020 12:52 PM, Robert Kudyba wrote:
> I've seen where this was a bug that was fixed
> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e=
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D3941&d=DwIDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=uhj_tXWcDUyyhKZogEh3zXEjkcPHj3Yzkzh7dZnMLiI&s=Chhfs3vBdTd3SG3KKgQmrBf3W_B6tjn5lP4qS-YRrh8&e= > but this happens
Hello Robert,
I've been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It seems to have started to occur when I enabled proctrack/cgroup and changed select/linear to select/con_tres.
Are you using cgroup process tracking and have you manipulated the cgroup.conf file? Do jobs complete correctly when not cancelled?
Regards,
Willy Markuske |
|
HPC Systems Engineer |
![]() |
Research Data Services |
|
been having the same issue with BCM, CentOS 8.2 BCM 9.0 Slurm 20.02.3. It seems to have started to occur when I enabled proctrack/cgroup and changed select/linear to select/con_tres.
Are you using cgroup process tracking and have you manipulated the cgroup.conf file?
Do jobs complete correctly when not cancelled?