Implementation Suggestion: job not running due to some SLURM parameters

88 views
Skip to first unread message

Michela Benazzi

unread,
Jul 16, 2025, 11:31:38 AM7/16/25
to cp2k
Good morning,

I have recently noticed that I was not allocating nearly enough memory for my jobs - (25 GB allocated, 32 GB were actually being used).

Those jobs failed without any trace - no output, restart, log, error files at all.

Is there any way (or any way to implement it if that does not exist yet) to add a feature where:

1. The job outputs information until memory capacity is failed
2. There is an error message explaining the cause in the .out or .err files

Thank you!

Michela

Krack Matthias

unread,
Jul 16, 2025, 12:01:08 PM7/16/25
to cp...@googlegroups.com

Hi Michela

 

You can try to activate the TRACE input key in the &GLOBAL section (or TRACE_MASTER if the output becomes messy because of the large number of MPI ranks).

 

HTH

 

Matthias

 

 

--
You received this message because you are subscribed to the Google Groups "cp2k" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cp2k+uns...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cp2k/d4c2d7a3-fb48-4c39-b185-6ebfa692aa35n%40googlegroups.com.

Michela Benazzi

unread,
Jul 16, 2025, 4:30:02 PM7/16/25
to cp2k
Thank you Dr. Krack,

I will give it a try! 

Michela

Michela Benazzi

unread,
Jul 21, 2025, 1:55:43 PM7/21/25
to cp2k
Hello Dr. Krack and CP2K community,

I hope you are doing well! I have two additional questions after my jobs failed without trace again - SLURM exit code 15, which is supposedly on the side of the software/app being used.

My improvements: I have added TRACE T under &GLOBAL, and have fixed my bash script settings (I was not requesting enough time and memory). I am seeing a positive improvement there.

I wish I could troubleshoot, but because my jobs failed immediately upon starting without leaving trace, there's no .err files to refer to. Can I please get some assistance? I am not attaching any more input files because multiple jobs have failed, so I do not think that is the issue. Find my slurm script for reference below my signature.

Thank you,

Michela

#!/bin/bash
#SBATCH -p short
#SBATCH --mem=36G ##memory per task
#SBATCH --job-name=en1
#SBATCH --ntasks=8
#SBATCH --cpus-per-task=7
#SBATCH -N 1 ## containers can only run on one node, and c*n = 56 or 128 (intel nodes top out at 56 cores, zen2 at 128)
#SBATCH --constraint=ib
#SBATCH --time=42:00:00
#SBATCH -o %j.out
#SBATCH -e %j.err
 
#set up the number of OpenMP threads:
 export OMP_NUM_THREADS=7 ## should be the same as cpus-per-task
 
#The setup_new file contains all the instructions for setting up the correct environment before the user can compile and/or run CP2K:

export PATH=$PATH:/shared/centos7/cp2k/cp2k-6.1.0/data
dir=/home/benazzi.m/BHT/
inputFile=$dir/120h_bulk-1.restart

apptainer run -B /projects:/projects /shared/container_repository/cp2k/cp2k_2024.1_openmpi_generic_psmp.sif mpirun -n 8 --oversubscribe cp2k.psmp -i $inputFile

Krack Matthias

unread,
Jul 22, 2025, 4:42:57 AM7/22/25
to cp...@googlegroups.com

Hi Michela

 

Although the downloadable CP2K containers (apptainers) may work on many systems, this is not the case for all cluster systems given the large variety of cluster and slurm configurations.

 

Could you run successfully the apptainer self test? E.g. with

apptainer run -B /projects:/projects /shared/container_repository/cp2k/cp2k_2024.1_openmpi_generic_psmp.sif run_tests”

 

If that test does not work, the container is not suited for your cluster system and you should try to build a CP2K binary from scratch using the appropriate compiler and MPI modules installed on your cluster system. I recommend to ask one of the sysadmins for assistance to perform that task.

 

Best

 

Matthias

 

Michela Cavalieri

unread,
Jul 24, 2025, 12:47:59 PM7/24/25
to cp2k
Hello Dr. Krack,

I found that the issue was TRACE T that I recently added (see this same thread) under &GLOBAL. 

I did not think of that at all because it was suggested to me here. I took it out and things are running smoothly.

My guess as to what happened is that the "TRACE" is massive - see attached the file that I got from printing my compute node output. Perhaps it timed out before computing any MD steps (I allocated 42 hours).

I printed the attached file from the compute node (as I mentioned before, nothing was printed to my working directory even after submitting multiple times) in less than 10 min of runtime. I will keep running this on our HPC compute node and see if outputs any errors that could give us a better idea for why nothing was printed. 

I am not sure why it wouldn't still print the output and why it left me without any restart files at all! That is the complete opposite of why I was trying to use TRACE T!

I would like to raise this as an issue, unless you can perhaps correct something that I did wrong.

Thank you,

Michela

troubleshoot.txt

Michela Cavalieri

unread,
Jul 24, 2025, 12:50:11 PM7/24/25
to cp2k
I am guessing that the workaround is the TRACE_MASTER as you suggested in case it became messy. I would still like an explanation for why it didn't print anything out - should this be added to the CP2K manual?

Thank you kindly for your time,

Michela
Reply all
Reply to author
Forward
0 new messages