[slurm-users] Slurmdbd purge and reported downtime

84 views
Skip to first unread message

Davide DelVento via slurm-users

unread,
Aug 22, 2024, 3:33:40 PM8/22/24
to Slurm User Community List
I am confused by the reported amount of Down and PLND Down by sreport. According to it, our cluster would have had a significant amount of downtime, which I know didn't happen (or, according to the documentation "time that slurmctld was not responding", see https://slurm.schedmd.com/sreport.html)

Could it be my purge settings causing this problem? How can I check (maybe in some logs, maybe in the future) if actually slurmctld was not responding? The expected long-term numbers should be less than the ones reported for last month when we had an issue with a few nodes.... 

Thanks!


[davide@login ~]$ grep Purge /opt/slurm/slurmdbd.conf
#JobPurge=12
#StepPurge=1
PurgeEventAfter=1month
PurgeJobAfter=12month
PurgeResvAfter=1month
PurgeStepAfter=1month
PurgeSuspendAfter=1month

[davide@login ~]$ sreport -t percent -T cpu,mem cluster utilization start=2/1/22
--------------------------------------------------------------------------------
Cluster Utilization 2022-02-01T00:00:00 - 2024-08-21T23:59:59
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name       Allocated            Down      PLND Down             Idle   Planned         Reported
--------- -------------- --------------- --------------- -------------- ---------------- --------- ----------------
  cluster            cpu          19.50%          12.07%          3.92%           64.36%     0.15%          100.03%
  cluster            mem          16.13%          13.17%          4.56%           66.13%     0.00%           99.99%

[davide@login ~]$    sreport -t percent -T cpu,mem cluster utilization start=2/1/23
--------------------------------------------------------------------------------
Cluster Utilization 2023-02-01T00:00:00 - 2024-08-21T23:59:59
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name       Allocated            Down      PLND Down            Idle   Planned         Reported
--------- -------------- --------------- --------------- -------------- --------------- --------- ----------------
  cluster            cpu          28.74%          18.80%          6.44%          45.77%     0.24%          100.02%
  cluster            mem          22.52%          20.54%          7.38%          49.55%     0.00%           99.98%

[davide@login ~]$  sreport -t percent -T cpu,mem cluster utilization start=2/1/24
--------------------------------------------------------------------------------
Cluster Utilization 2024-02-01T00:00:00 - 2024-08-21T23:59:59
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name      Allocated            Down      PLND Down            Idle  Planned        Reported
--------- -------------- -------------- --------------- -------------- --------------- -------- ---------------
  cluster            cpu         29.92%          24.88%         17.73%          27.45%    0.02%         100.00%
  cluster            mem         20.07%          28.60%         19.57%          31.76%    0.00%         100.00%

[davide@login ~]$  sreport -t percent -T cpu,mem cluster utilization start=8/8/24
--------------------------------------------------------------------------------
Cluster Utilization 2024-08-08T00:00:00 - 2024-08-21T23:59:59
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name     Allocated         Down PLND Dow           Idle  Planned       Reported
--------- -------------- ------------- ------------ -------- -------------- -------- --------------
  cluster            cpu        15.96%        2.53%    0.00%         81.51%    0.00%        100.00%
  cluster            mem         9.18%        2.22%    0.00%         88.60%    0.00%        100.00%

[davide@login ~]$  sreport -t percent -T cpu,mem cluster utilization start=7/7/24
--------------------------------------------------------------------------------
Cluster Utilization 2024-07-07T00:00:00 - 2024-08-21T23:59:59
Usage reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster      TRES Name      Allocated          Down PLND Dow           Idle  Planned       Reported
--------- -------------- -------------- ------------- -------- -------------- -------- --------------
  cluster            cpu         27.07%         2.57%    0.00%         70.34%    0.02%        100.00%
  cluster            mem         17.35%         2.26%    0.00%         80.40%    0.00%        100.00%


Ole Holm Nielsen via slurm-users

unread,
Aug 23, 2024, 10:00:02 AM8/23/24
to slurm...@lists.schedmd.com
Hi Davide,

On 8/22/24 21:30, Davide DelVento via slurm-users wrote:
> I am confused by the reported amount of Down and PLND Down by sreport.
> According to it, our cluster would have had a significant amount of
> downtime, which I know didn't happen (or, according to the documentation
> "time that slurmctld was not responding", see
> https://slurm.schedmd.com/sreport.html
> <https://slurm.schedmd.com/sreport.html>)
>
> Could it be my purge settings causing this problem? How can I check (maybe
> in some logs, maybe in the future) if actually slurmctld was not
> responding? The expected long-term numbers should be less than the ones
> reported for last month when we had an issue with a few nodes....

Which version of Slurm are you using? There was an sreport bug that
should be fixed in 23.11: https://support.schedmd.com/show_bug.cgi?id=17689

/Ole



--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Davide DelVento via slurm-users

unread,
Aug 23, 2024, 1:06:45 PM8/23/24
to Ole Holm Nielsen, slurm...@lists.schedmd.com
Thanks Ole,
this is very helpful. I was unaware of that issue. From the bug report it's not clear to me if it was just a sreport (display) issue, or if the problem was in the way the data was stored.

In fact I am running 23.11.5 which I installed in April. The numbers I see for the last few months (including April) are fine. The earlier numbers (when I was running an earlier version) are the ones affected by this problem. So if the issue was the way the data was stored, that explains it and I can live with it (even if I can't provide an accurate report for my management now) knowing that the problem won't happen again in the future.

Thanks and have a great weekend
Reply all
Reply to author
Forward
0 new messages