[slurm-users] enabling job script archival

144 views
Skip to first unread message

Davide DelVento

unread,
Sep 28, 2023, 1:41:23 PM9/28/23
to Slurm User Community List
In my current slurm installation, (recently upgraded to slurm v23.02.3), I only have

AccountingStoreFlags=job_comment

I now intend to add both

AccountingStoreFlags=job_script
AccountingStoreFlags=job_env

leaving the default 4MB value for max_script_size

Do I need to do anything on the DB myself, or will slurm take care of the additional tables if needed? 

Any comments/suggestions/gotcha/pitfalls/horror_stories to share? I know about the additional diskspace and potentially load needed, and with our resources and typical workload I should be okay with that.

Thanks!

Paul Edmon

unread,
Sep 28, 2023, 1:49:22 PM9/28/23
to slurm...@lists.schedmd.com
Slurm should take care of it when you add it.

So far as horror stories, under previous versions our database size
ballooned to be so massive that it actually prevented us from upgrading
and we had to drop the columns containing the job_script and job_env. 
This was back before slurm started hashing the scripts so that it would
only store one copy of duplicate scripts.  After this point we found
that the job_script database stayed at a fairly reasonable size as most
users use functionally the same script each time. However the job_env
continued to grow like crazy as there are variables in our environment
that change fairly consistently depending on where the user is. Thus
job_envs ended up being too massive to keep around and so we had to drop
them. Frankly we never really used them for debugging. The job_scripts
though are super useful and not that much overhead.

In summary my recommendation is to only store job_scripts. job_envs add
too much storage for little gain, unless your job_envs are basically the
same for each user in each location.

Also it should be noted that there is no way to prune out job_scripts or
job_envs right now. So the only way to get rid of them if they get large
is to 0 out the column in the table. You can ask SchedMD for the mysql
command to do this as we had to do it here to our job_envs.

-Paul Edmon-

Ryan Novosielski

unread,
Sep 28, 2023, 1:56:01 PM9/28/23
to Slurm User Community List
Thank you; we’ll put in a feature request for improvements in that area, and also thanks for the warning? I thought of that in passing, but the real world experience is really useful. I could easily see wanting that stuff to be retained less often than the main records, which is what I’d ask for.

I assume that archiving, in general, would also remove this stuff, since old jobs themselves will be removed?

--
#BlackLivesMatter
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - novo...@rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB A555B, Newark
     `'

Ryan Novosielski

unread,
Sep 28, 2023, 1:58:22 PM9/28/23
to Slurm User Community List
Sorry for the duplicate e-mail in a short time: do you know (or anyone) when the hashing was added? Was planning to enable this on 21.08, but we then had to delay our upgrade to it. I’m assuming later than that, as I believe that’s when the feature was added.

Paul Edmon

unread,
Sep 28, 2023, 2:00:15 PM9/28/23
to slurm...@lists.schedmd.com

No, all the archiving does is remove the pointer.  What slurm does right now is that it creates a hash of the job_script/job_env and then checks and sees if that hash matches one on record. If not then it adds it to the record, if it does match then it adds a pointer to the appropriate record.  So you can think of the job_script/job_env as an internal database of all the various scripts and envs that slurm has ever seen and then what ends up in the Job record is a pointer to that database.  This way slurm can deduplicate scripts/envs that are the same. This works great for job_scripts as they are functionally the same and thus you have many jobs pointed to the same script, but less so for job_envs.

-Paul Edmon-

Paul Edmon

unread,
Sep 28, 2023, 2:04:00 PM9/28/23
to slurm...@lists.schedmd.com

Yes it was later than that. If you are 23.02 you are good.  We've been running with storing job_scripts on for years at this point and that part of the database only uses up 8.4G.  Our entire database takes up 29G on disk. So its about 1/3 of the database.  We also have database compression which helps with the on disk size. Raw uncompressed our database is about 90G.  We keep 6 months of data in our active database.

-Paul Edmon-

Davide DelVento

unread,
Sep 29, 2023, 7:50:02 AM9/29/23
to Slurm User Community List
Fantastic, this is really helpful, thanks!

Davide DelVento

unread,
Oct 2, 2023, 10:58:31 AM10/2/23
to Slurm User Community List
I deployed the job_script archival and it is working, however it can be queried only by root. 

A regular user can run sacct -lj towards any jobs (even those by other users, and that's okay in our setup) with no problem. However if they run sacct -j job_id --batch-script even against a job they own themselves, nothing is returned and I get a

slurmdbd: error: couldn't get information for this user (null)(xxxxxx)

where xxxxx is the posix ID of the user who's running the query in the slurmdbd logs.

Both configure files slurmdbd.conf and slurm.conf do not have any "permission" setting. FWIW, we use LDAP.

Is that the expected behavior, in that by default only root can see the job scripts? I was assuming the users themselves should be able to debug their own jobs... Any hint on what could be changed to achieve this?

Thanks!


Paul Edmon

unread,
Oct 2, 2023, 11:07:59 AM10/2/23
to slurm...@lists.schedmd.com

At least in our setup, users can see their own scripts by doing sacct -B -j JOBID

I would make sure that the scripts are being stored and how you have PrivateData set.

-Paul Edmon-

Davide DelVento

unread,
Oct 2, 2023, 11:21:52 AM10/2/23
to Slurm User Community List
Thanks Paul, this helps.

I don't have any PrivateData line in either config file. According to the docs, "By default, all information is visible to all users" so this should not be an issue. I tried to add a line with "PrivateData=jobs" to the conf files, just in case, but that didn't change the behavior.

Davide DelVento

unread,
Oct 3, 2023, 9:02:48 AM10/3/23
to Slurm User Community List
By increasing the slurmdbd verbosity level, I got additional information, namely the following:

slurmdbd: error: couldn't get information for this user (null)(xxxxxx)
slurmdbd: debug: accounting_storage/as_mysql: as_mysql_jobacct_process_get_jobs: User  xxxxxx  has no associations, and is not admin, so not returning any jobs.

again where xxxxx is the posix ID of the user who's running the query in the slurmdbd logs.

I suspect this is due to the fact that our userbase is small enough (we are a department HPC) that we don't need to use allocation and the like, so I have not configured any association (and not even studied its configuration, since when I was at another place which did use associations, someone else took care of slurm administration).

Anyway, I read the fantastic document by our own member at https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_accounting/#associations and in fact I have not even configured slurm users:

# sacctmgr show user
      User   Def Acct     Admin
---------- ---------- ---------
      root       root Administ+
#

So is that the issue? Should I just add all users? Any suggestions on the minimal (but robust) way to do that?

Thanks!

Paul Edmon

unread,
Oct 3, 2023, 9:42:55 AM10/3/23
to slurm...@lists.schedmd.com

You will probably need to.

The way we handle it is that we add users when the first submit a job via the job_submit.lua script. This way the database autopopulates with active users.

-Paul Edmon-

Davide DelVento

unread,
Oct 3, 2023, 7:44:54 PM10/3/23
to Slurm User Community List
For others potentially seeing this on mailing list search, yes, I needed that, which of course required creating an account charge which I wasn't using. So I ran

sacctmgr add account default_account
sacctmgr add -i user $user Accounts=default_account

with an appropriate looping around for $user and everything is working fine now.

Thanks everybody!

Davide DelVento

unread,
Oct 4, 2023, 9:48:35 PM10/4/23
to Slurm User Community List
And weirdly enough it has now stopped working again, after I did the experimentation for power save described in the other thread.
That is really strange. At the highest verbosity level the logs just say

slurmdbd: debug:  REQUEST_PERSIST_INIT: CLUSTER:cluster VERSION:9984 UID:1457 IP:192.168.2.254 CONN:13

I reconfigured and reverted stuff to no change. Does anybody have any clue?

Davide DelVento

unread,
Oct 5, 2023, 9:06:42 AM10/5/23
to Slurm User Community List
Okay, so perhaps this is another bug. At each reconfigure, users lose access to the jobs they submitted before the reconfigure itself and start "clean slate". Newly submitted jobs can be queried normally. The slurm administrator can query everything at all times, so the data is not lost, but this is really unfortunate....

Has anybody experienced this issue or can try querying some of their old jobs which were completed before a reconfigure and confirm if this is happening for them too?
Anybody knows this being already a bug and/or suggest if I should submit it?

Thanks!
Reply all
Reply to author
Forward
0 new messages