segmentation fault for simulations with ntomp > 1

19 views
Skip to first unread message

Carlos Henrique

unread,
Apr 17, 2025, 5:45:38 AMApr 17
to PLUMED users
Dear, 

I've got segmentation fault error for well-tempered molecular dynamics simulations (62610 atoms.) using Gromacs 2023 coupled with Plumed 2.9.0 for ntomp greater than 1. With ntomp equal to 1, the simulations doesn't crash but the performance is at least 10x lower.

Any suggestion or comment is welcome.

Details:

GPU card: NVIDIA A40

GROMACS version: 2023-plumed_2.9.0
Precision: mixed
Memory model: 64 bit
MPI library: MPI (GPU-aware: CUDA)
OpenMP support: enabled (GMX_OPENMP_MAX_THREADS = 128)
GPU support: CUDA
NB cluster size: 8
SIMD instructions: AVX2_256
CPU FFT library: fftw-3.3.10-avx
GPU FFT library: cuFFT
Multi-GPU FFT: none
RDTSCP usage: enabled
TNG support: enabled
Hwloc support: disabled
Tracing support: disabled
C compiler: /share/apps/gcc-9.2.0/bin/gcc GNU 9.2.0
C compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -O3 -DNDEBUG
C++ compiler: /share/apps/gcc-9.2.0/bin/g++ GNU 9.2.0
C++ compiler flags: -fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
BLAS library: External - detected on the system
LAPACK library: External - detected on the system
CUDA compiler: /share/apps/cuda-12.0/bin/nvcc nvcc: NVIDIA (R) Cuda compiler driver;Copyright (c) 2005-2023 NVIDIA Corporation;Built on Fri_Jan__6_16:45:21_PST_2023;Cuda compilation tools, release 12.0, V12.0.140;Build cuda_12.0.r12.0/compiler.32267302_0
CUDA compiler flags:-std=c++17;--generate-code=arch=compute_50,code=sm_50;--generate-code=arch=compute_52,code=sm_52;--generate-code=arch=compute_60,code=sm_60;--generate-code=arch=compute_61,code=sm_61;--generate-code=arch=compute_70,code=sm_70;--generate-code=arch=compute_75,code=sm_75;--generate-code=arch=compute_80,code=sm_80;--generate-code=arch=compute_86,code=sm_86;--generate-code=arch=compute_89,code=sm_89;--generate-code=arch=compute_90,code=sm_90;-Wno-deprecated-gpu-targets;--generate-code=arch=compute_53,code=sm_53;--generate-code=arch=compute_80,code=sm_80;-use_fast_math;-Xptxas;-warn-double-usage;-Xptxas;-Werror;;-fexcess-precision=fast -funroll-all-loops -mavx2 -mfma -Wno-missing-field-initializers -pthread -Wno-cast-function-type-strict -fopenmp -O3 -DNDEBUG
CUDA driver: 12.40
CUDA runtime: 12.0

Error:

GPU selected for this run.
Mapping of GPU IDs to the 2 GPU tasks in the 1 rank on t

his node:
PP:0,PME:0
PP tasks will do (non-perturbed) short-ranged interactions on the GPU
PP task will update and constrain coordinates on the CPU
PME tasks will do all aspects on the GPU
Using 1 MPI process
Using 6 OpenMP threads

NOTE: The number of threads is not equal to the number of (logical) cpus
and the -pin option is set to auto: will not pin threads to cpus.
This can lead to significant performance degradation.
Consider using -pin on (and -pinoffset in case you run multiple jobs).

158649 steps, infinite ps (continuing from step 158650, 317.3 ps).
^Mstep 158650 performance: 0.9 ns/day

WARNING: Listed nonbonded interaction between particles 760 and 767
at distance 3.223 which is larger than the table limit 1.936 nm.
This is likely either a 1,4 interaction, or a listed interaction inside
a smaller molecule you are decoupling during a free energy calculation.
Since interactions at distances beyond the table cannot be computed,
they are skipped until they are inside the table limit again. You will
only see this message once, even if it occurs for several interactions.

IMPORTANT: This should not happen in a stable simulation, so there is
probably something wrong with your system. Only change the table-extension
distance in the mdp file if you are really sure that is the reason


step 158660: One or more water molecules can not be settled.
Check for bad contacts and/or reduce the timestep if appropriate.

Wrote pdb files with previous and current coordinates

[behemoth:03402] *** Process received signal ***
[behemoth:03402] Signal: Segmentation fault (11)
[behemoth:03402] Signal code: Address not mapped (1)
[behemoth:03402] Failing at address: 0xc30000053b
[behemoth:03402] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2b2556843630]
[behemoth:03402] [ 1] /lib64/libcuda.so.1(+0x24b01e)[0x2b256424b01e]
[behemoth:03402] [ 2] /lib64/libcuda.so.1(+0x24b5a8)[0x2b256424b5a8]
[behemoth:03402] [ 3] /lib64/libcuda.so.1(+0x2f18d3)[0x2b25642f18d3]
[behemoth:03402] [ 4] gmx_mpi[0x135d940]
[behemoth:03402] [ 5] gmx_mpi[0x13baa48]
[behemoth:03402] [ 6] gmx_mpi[0xb53f2d]
[behemoth:03402] [ 7] gmx_mpi[0xb433a1]
[behemoth:03402] [ 8] gmx_mpi[0x56454a]
[behemoth:03402] [ 9] gmx_mpi[0x9643d1]
[behemoth:03402] [10] gmx_mpi[0x4501fd]
[behemoth:03402] [11] gmx_mpi[0x4df085]
[behemoth:03402] [12] gmx_mpi[0x4df1dd]
[behemoth:03402] [13] gmx_mpi[0x5849ee]
[behemoth:03402] [14] gmx_mpi[0x4bd64d]
[behemoth:03402] [15] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b25577a9555]
[behemoth:03402] [16] gmx_mpi[0x4dbd55]
[behemoth:03402] *** End of error message ***
/opt/gridengine/default/spool/behemoth/job_scripts/5453100: line 132: 3402 Segmentation fault gmx_mpi mdrun -v -s ${name}.tpr -deffnm ${name}-run -cpi ${name}-run.cpt -noappend -ntomp 6 -ntmpi 1 -plumed plumed.dat
Reply all
Reply to author
Forward
0 new messages