[slurm-users] Performance Discrepancy between Slurm and Direct mpirun for VASP Jobs.

47 views
Skip to first unread message

Hongyi Zhao via slurm-users

unread,
May 24, 2024, 9:35:01 AMMay 24
to slurm...@lists.schedmd.com
Dear Slurm Users,

I am experiencing a significant performance discrepancy when running
the same VASP job through the Slurm scheduler compared to running it
directly with mpirun. I am hoping for some insights or advice on how
to resolve this issue.

System Information:

Slurm Version: 21.08.5
OS: Ubuntu 22.04.4 LTS (Jammy)


Job Submission Script:

#!/usr/bin/env bash
#SBATCH -N 1
#SBATCH -D .
#SBATCH --output=%j.out
#SBATCH --error=%j.err
##SBATCH --time=2-00:00:00
#SBATCH --ntasks=36
#SBATCH --mem=64G

echo '#######################################################'
echo "date = $(date)"
echo "hostname = $(hostname -s)"
echo "pwd = $(pwd)"
echo "sbatch = $(which sbatch | xargs realpath -e)"
echo ""
echo "WORK_DIR = $WORK_DIR"
echo "SLURM_SUBMIT_DIR = $SLURM_SUBMIT_DIR"
echo "SLURM_JOB_NUM_NODES = $SLURM_JOB_NUM_NODES"
echo "SLURM_NTASKS = $SLURM_NTASKS"
echo "SLURM_NTASKS_PER_NODE = $SLURM_NTASKS_PER_NODE"
echo "SLURM_CPUS_PER_TASK = $SLURM_CPUS_PER_TASK"
echo "SLURM_JOBID = $SLURM_JOBID"
echo "SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST"
echo "SLURM_NNODES = $SLURM_NNODES"
echo "SLURMTMPDIR = $SLURMTMPDIR"
echo '#######################################################'
echo ""

module purge > /dev/null 2>&1
module load vasp
ulimit -s unlimited
mpirun vasp_std


Performance Observation:

When running the job through Slurm:

werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
grep LOOP OUTCAR
LOOP: cpu time 14.4893: real time 14.5049
LOOP: cpu time 14.3538: real time 14.3621
LOOP: cpu time 14.3870: real time 14.3568
LOOP: cpu time 15.9722: real time 15.9018
LOOP: cpu time 16.4527: real time 16.4370
LOOP: cpu time 16.7918: real time 16.7781
LOOP: cpu time 16.9797: real time 16.9961
LOOP: cpu time 15.9762: real time 16.0124
LOOP: cpu time 16.8835: real time 16.9008
LOOP: cpu time 15.2828: real time 15.2921
LOOP+: cpu time 176.0917: real time 176.0755

When running the job directly with mpirun:


werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
mpirun -n 36 vasp_std
werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
grep LOOP OUTCAR
LOOP: cpu time 9.0072: real time 9.0074
LOOP: cpu time 9.0515: real time 9.0524
LOOP: cpu time 9.1896: real time 9.1907
LOOP: cpu time 10.1467: real time 10.1479
LOOP: cpu time 10.2691: real time 10.2705
LOOP: cpu time 10.4330: real time 10.4340
LOOP: cpu time 10.9049: real time 10.9055
LOOP: cpu time 9.9718: real time 9.9714
LOOP: cpu time 10.4511: real time 10.4470
LOOP: cpu time 9.4621: real time 9.4584
LOOP+: cpu time 110.0790: real time 110.0739


Could you provide any insights or suggestions on what might be causing
this performance issue? Are there any specific configurations or
settings in Slurm that I should check or adjust to align the
performance more closely with the direct mpirun execution?

Thank you for your time and assistance.

Best regards,
Zhao
--
Assoc. Prof. Hongsheng Zhao <hongy...@gmail.com>
Theory and Simulation of Materials
Hebei Vocational University of Technology and Engineering
No. 473, Quannan West Street, Xindu District, Xingtai, Hebei province

--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com

Hongyi Zhao via slurm-users

unread,
May 24, 2024, 9:41:40 AMMay 24
to slurm...@lists.schedmd.com
The attachment is the test example used above.
Cr72_3x3x3K_350eV_10DAV.zip

Hermann Schwärzler via slurm-users

unread,
May 24, 2024, 12:02:30 PMMay 24
to slurm...@lists.schedmd.com
Hi Zhao,

my guess is that in your faster case you are using hyperthreading
whereas in the Slurm case you don't.

Can you check what performance you get when you add

#SBATCH --hint=multithread

to you slurm script?

Another difference between the two might be
a) the communication channel/interface that is used.
b) the number of nodes involved: when using mpirun you might run things
on more than one node.

Regards,
Hermann

Hongyi Zhao via slurm-users

unread,
May 24, 2024, 7:52:54 PMMay 24
to slurm...@lists.schedmd.com
On Sat, May 25, 2024 at 12:02 AM Hermann Schwärzler via slurm-users
<slurm...@lists.schedmd.com> wrote:
>
> Hi Zhao,
>
> my guess is that in your faster case you are using hyperthreading
> whereas in the Slurm case you don't.
>
> Can you check what performance you get when you add
>
> #SBATCH --hint=multithread
>
> to you slurm script?

I tried to add the above instructions to the slurm script, and only
found that the job will stuck there forever. Here are the results 10
minutes after the job was submitted:


werner@x13dai-t:~/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV$
cat sub.sh.o6
#######################################################
date = 2024年 05月 25日 星期六 07:31:31 CST
hostname = x13dai-t
pwd =
/home/werner/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV
sbatch = /usr/bin/sbatch

WORK_DIR =
SLURM_SUBMIT_DIR =
/home/werner/Public/hpc/servers/benchmark/Cr72_3x3x3K_350eV_10DAV
SLURM_JOB_NUM_NODES = 1
SLURM_NTASKS = 36
SLURM_NTASKS_PER_NODE =
SLURM_CPUS_PER_TASK =
SLURM_JOBID = 6
SLURM_JOB_NODELIST = localhost
SLURM_NNODES = 1
SLURMTMPDIR =
#######################################################

running 36 mpi-ranks, on 1 nodes
distrk: each k-point on 36 cores, 1 groups
distr: one band on 4 cores, 9 groups
vasp.6.4.3 19Mar24 (build May 17 2024 09:27:19) complex

POSCAR found type information on POSCAR Cr
POSCAR found : 1 types and 72 ions
Reading from existing POTCAR
scaLAPACK will be used
Reading from existing POTCAR
-----------------------------------------------------------------------------
| |
| ----> ADVICE to this user running VASP <---- |
| |
| You have a (more or less) 'large supercell' and for larger cells it |
| might be more efficient to use real-space projection operators. |
| Therefore, try LREAL= Auto in the INCAR file. |
| Mind: For very accurate calculation, you might also keep the |
| reciprocal projection scheme (i.e. LREAL=.FALSE.). |
| |
-----------------------------------------------------------------------------

LDA part: xc-table for (Slater+PW92), standard interpolation
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ... GRIDC
FFT: planning ... GRID_SOFT
FFT: planning ... GRID
WAVECAR not read


> Another difference between the two might be
> a) the communication channel/interface that is used.

I tried to use `mpirun', `mpiexec', and `srun --mpi pmi2', and they
all have similar behaviors as described above.

> b) the number of nodes involved: when using mpirun you might run things
> on more than one node.

This is a single-node cluster with two sockets.

> Regards,
> Hermann

Regards,
Zhao

Hongyi Zhao via slurm-users

unread,
May 24, 2024, 9:51:39 PMMay 24
to slurm...@lists.schedmd.com
Ultimately, I found that the cause of the problem was that
hyper-threading was enabled by default in the BIOS. If I disable
hyper-threading, I observed that the computational efficiency is
consistent between using slurm and using mpirun directly. Therefore,
it appears that hyper-threading should not be enabled in the BIOS when
using slurm.

Hongyi Zhao via slurm-users

unread,
May 24, 2024, 10:08:11 PMMay 24
to slurm...@lists.schedmd.com
Regarding the reason, I think the description here [1] is reasonable:

It is recommended to disable processor hyper-threading. In
applications that are compute-intensive rather than I/O-intensive,
enabling HyperThreading is likely to decrease the overall performance
of the server. Intuitively, the physical memory available per core is
reduced after hyper-threading is enabled.

[1] https://gist.github.com/weijianwen/acee3cd49825da8c8dfb4a99365b54c8#%E5%85%B3%E9%97%AD%E5%A4%84%E7%90%86%E5%99%A8%E8%B6%85%E7%BA%BF%E7%A8%8B

Regards,
Zhao

Hongyi Zhao via slurm-users

unread,
May 24, 2024, 10:19:50 PMMay 24
to slurm...@lists.schedmd.com
See here [1] for the related discussion.

[1] https://www.vasp.at/forum/viewtopic.php?t=19557

Regards,

Ole Holm Nielsen via slurm-users

unread,
May 26, 2024, 2:42:36 AMMay 26
to slurm...@lists.schedmd.com
On 25-05-2024 03:49, Hongyi Zhao via slurm-users wrote:
> Ultimately, I found that the cause of the problem was that
> hyper-threading was enabled by default in the BIOS. If I disable
> hyper-threading, I observed that the computational efficiency is
> consistent between using slurm and using mpirun directly. Therefore,
> it appears that hyper-threading should not be enabled in the BIOS when
> using slurm.

Whether or not to enable Hyper-Threading (HT) on your compute nodes
depends entirely on the properties of applications that you wish to run
on the nodes. Some applications are faster without HT, others are
faster with HT. When HT is enabled, the "virtual CPU cores" obviously
will have only half the memory available per core.

The VASP code is highly CPU- and memory intensive, and HT should
probably be disabled for optimal performance with VASP.

Slurm doesn't affect the performance of your codes with or without HT.
Slurm just schedules tasks to run on the available cores.

/Ole

Bjørn-Helge Mevik via slurm-users

unread,
May 27, 2024, 2:58:59 AMMay 27
to slurm...@schedmd.com
Ole Holm Nielsen via slurm-users <slurm...@lists.schedmd.com> writes:

> Whether or not to enable Hyper-Threading (HT) on your compute nodes
> depends entirely on the properties of applications that you wish to
> run on the nodes. Some applications are faster without HT, others are
> faster with HT. When HT is enabled, the "virtual CPU cores" obviously
> will have only half the memory available per core.

Another consideration is, if you keep HT enabled, do you want Slurm to
hand out physical cores to jobs, or logical cpus (hyperthreads)? Again,
what is best depends on your workload. On our systems, we tend to
either turn off HT, or hand our cores.

--
B/H
signature.asc

Hongyi Zhao via slurm-users

unread,
May 27, 2024, 4:45:37 AMMay 27
to Bjørn-Helge Mevik, slurm...@schedmd.com
In the case where Hyper-Threading (HT) is enabled, is it possible to
configure Slurm to achieve the following effects:

1. If the total number of cores used by jobs is less than the number
of physical cores, then hand out physical cores to jobs.
2. When the total number of cores used by jobs exceeds the number of
physical cores, use logical CPUs for the excess part.

I heard about the following method to achieve the above purpose, but
have not tried it so far:

To configure Slurm for managing jobs more effectively when
Hyper-Threading (HT) is enabled, you can implement a strategy that
involves distinguishing between physical and logical cores. Here's a
possible approach to meet the requirements you described:

1. Configure Slurm to Recognize Physical and Logical Cores

First, ensure that Slurm can differentiate between physical and
logical cores. This typically involves setting the CpuBind and
TaskPlugin parameters correctly in the Slurm configuration file
(usually slurm.conf).

# Settings in slurm.conf
TaskPlugin=task/affinity
CpuBind=cores


2. Use Gres (Generic Resources) to Identify Physical and Logical Cores

You can utilize the GRES (Generic RESources) feature to define
additional resource types, such as physical and logical cores. First,
these resources need to be defined in the slurm.conf.

# Define resources in slurm.conf
NodeName=NODENAME Gres=cpu_physical:16,cpu_logical:32 CPUs=32 Boards=1
Sockets=2 CoresPerSocket=8 ThreadsPerCore=2


Here, cpu_physical and cpu_logical are custom resource names, followed
by the number of resources. The CPUs field should be set to the total
number of physical and logical cores.

3. Write Job Submission Scripts

When submitting a job, users need to request the appropriate type of
cores based on their needs. For example, if a job requires more cores
than the number of physical cores available, it can request a
combination of physical and logical cores.

#!/bin/bash
#SBATCH --gres=cpu_physical:8,cpu_logical:4
#SBATCH --ntasks=12
#SBATCH --cpus-per-task=1

# Your job execution command

This script requests 8 physical cores and 4 logical cores, totaling 12 cores.

> --
> B/H

Regards,
Zhao

Hermann Schwärzler via slurm-users

unread,
May 27, 2024, 9:45:07 AMMay 27
to slurm...@lists.schedmd.com
Hi everbody,

On 5/26/24 08:40, Ole Holm Nielsen via slurm-users wrote:
[...]
> Whether or not to enable Hyper-Threading (HT) on your compute nodes
> depends entirely on the properties of applications that you wish to run
> on the nodes.  Some applications are faster without HT, others are
> faster with HT.  When HT is enabled, the "virtual CPU cores" obviously
> will have only half the memory available per core.
>
> The VASP code is highly CPU- and memory intensive, and HT should
> probably be disabled for optimal performance with VASP.
>
> Slurm doesn't affect the performance of your codes with or without HT.
> Slurm just schedules tasks to run on the available cores.

This is how we are handling Hyper-Threading in our cluster:
* It's enabled in the BIOS/system settings.
* The important parts in our slurm.conf are:
TaskPlugin=task/affinity,task/cgroup
CliFilterPlugins=cli_filter/lua
NodeName=DEFAULT ... ThreadsPerCore=2
* We make "--hint=nomultithread" the default for jobs by having this in
cli_filter.lua:
function slurm_cli_setup_defaults(options, early_pass)
options['hint'] = 'nomultithread'
return slurm.SUCCESS
end
So users can still use Hyper-Threading by specifying
"--hint=multithread" in their job-script which will give them two
"CPUs/Threads" per Core. Without this option they will get one Core
per requested CPU.

This works for us and our users. There is only one small side-effect:
when a job is pending, the expected number is displayed in the "CPUS"
column of the output of "squeue". But when a job is running, twice that
number is displayed (as Slurm counts both Hyper-Threads per Core as "CPUs").

Regards,
Hermann

Hongyi Zhao via slurm-users

unread,
May 28, 2024, 8:57:45 AMMay 28
to slurm...@lists.schedmd.com
---------- Forwarded message ---------
From: Hermann Schwärzler <hermann.s...@uibk.ac.at>
Date: Tue, May 28, 2024 at 4:10 PM
Subject: Re: [slurm-users] Re: Performance Discrepancy between Slurm
and Direct mpirun for VASP Jobs.
To: Hongyi Zhao <hongy...@gmail.com>


Hi Zhao,

On 5/28/24 03:08, Hongyi Zhao wrote:
[...]
>
> What's the complete content of cli_filter.lua and where should I put this file?
[...]

Below you find the complete content of our cli_filter.lua.
It has to be put into the same directory as "slurm.conf".

--------------------------------- 8< ---------------------------------
-- see
https://github.com/SchedMD/slurm/blob/master/etc/cli_filter.lua.example

function slurm_cli_pre_submit(options, pack_offset)
return slurm.SUCCESS
end

function slurm_cli_setup_defaults(options, early_pass)
-- Make --hint=nomultithread the default behavior
-- if users specify an other --hint=XX option then
-- it will override the setting done here
options['hint'] = 'nomultithread'

return slurm.SUCCESS
end

function slurm_cli_post_submit(offset, job_id, step_id)
return slurm.SUCCESS
end

--------------------------------- >8 ---------------------------------

Hopefully this helps...

Regards,
Hermann


--
Assoc. Prof. Hongsheng Zhao <hongy...@gmail.com>
Theory and Simulation of Materials
Hebei Vocational University of Technology and Engineering
No. 473, Quannan West Street, Xindu District, Xingtai, Hebei province

--

Reply all
Reply to author
Forward
0 new messages