CUDA RUNTIME API error: EventRecord failed with error cudaErrorInvalidResourceHandle

326 views
Skip to first unread message

singlebook

unread,
Feb 3, 2021, 8:20:50 PM2/3/21
to cp2k

Hello, All

I just install CP2K v8.1 on my workstation.  There are 12 NVIDIA K80 GPUs in the workstation. The compiler is GCC 6.5 and CUDA 10.0.

I want to perform AIMD for SiC, but when I use more than 6 GPUs, it always give me the error:

CUDA RUNTIME API error: EventRecord failed with error cudaErrorInvalidResourceHandle
error: GPU API call : invalid resource handle
terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x7fc42ccc626f in ???
#1  0x7fc42ccc61f7 in ???
#2  0x7fc42ccc78e7 in ???
#3  0x7fc43d68193c in _ZN9__gnu_cxx27__verbose_terminate_handlerEv
    at ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/vterminate.cc:95
#4  0x7fc43d67f905 in _ZN10__cxxabiv111__terminateEPFvvE
    at ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
#5  0x7fc43d67f950 in _ZSt9terminatev
    at ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
#6  0x7fc43d67fb68 in __cxa_throw
    at ../../../../gcc-6.5.0/libstdc++-v3/libsupc++/eh_throw.cc:87
#7  0x2b12c82 in check_runtime_status
    at /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/util.hpp:17
#8  0x2b12c82 in _ZNK3gpu13device_stream13enqueue_eventEv
    at /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/device_stream.hpp:62
#9  0x2b12c82 in _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE
    at /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:248
#10  0x2b1351c in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_b
    at /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:341
#11  0x2adfdee in _ZN5cosma14local_multiplyIdEEvPN3gpu9mm_handleIT_EEPS3_S6_S6_iiiS3_S3_
    at /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/local_multiply.cpp:86
#12  0x2ac8fb3 in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyERNS_12communicatorES2_S2_
    at /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:355
#13  0x2ac9c26 in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEiS2_S2_
    at /local/src/cp2k-8.1/tools/toolchain/build/cosma-2.2.0/src/cosma/multiply.cpp:272
#14  0x2a9fc5d in ???
#15  0x250cd5c in __cp_fm_basic_linalg_MOD_cp_fm_gemm
    at /local/src/cp2k-8.1/src/fm/cp_fm_basic_linalg.F:446
#16  0xcd8744 in __cp_gemm_interface_MOD_cp_gemm
    at /local/src/cp2k-8.1/src/cp_gemm_interface.F:138
#17  0x10c794b in __qs_wf_history_methods_MOD_wfi_extrapolate
    at /local/src/cp2k-8.1/src/qs_wf_history_methods.F:912
#18  0x17a5b53 in scf_env_initial_rho_setup
    at /local/src/cp2k-8.1/src/qs_scf_initialization.F:1122
#19  0x17a5b53 in init_scf_run
    at /local/src/cp2k-8.1/src/qs_scf_initialization.F:1047
#20  0x17a79b5 in __qs_scf_initialization_MOD_qs_scf_env_initialize
    at /local/src/cp2k-8.1/src/qs_scf_initialization.F:182
#21  0xf1e341 in __qs_scf_MOD_scf
    at /local/src/cp2k-8.1/src/qs_scf.F:222
#22  0xc0e966 in __qs_energy_MOD_qs_energies
    at /local/src/cp2k-8.1/src/qs_energy.F:88
#23  0x1979f13 in qs_forces
    at /local/src/cp2k-8.1/src/qs_force.F:209
#24  0x197dc87 in __qs_force_MOD_qs_calc_energy_force
    at /local/src/cp2k-8.1/src/qs_force.F:114
#25  0x112bfe5 in __force_env_methods_MOD_force_env_calc_energy_force
    at /local/src/cp2k-8.1/src/force_env_methods.F:271
#26  0x797c55 in __integrator_MOD_nvt
    at /local/src/cp2k-8.1/src/motion/integrator.F:1103
#27  0x78ddca in __velocity_verlet_control_MOD_velocity_verlet
    at /local/src/cp2k-8.1/src/motion/velocity_verlet_control.F:77
#28  0x6c1695 in qs_mol_dyn_low
    at /local/src/cp2k-8.1/src/motion/md_run.F:481
#29  0x6c209a in __md_run_MOD_qs_mol_dyn
    at /local/src/cp2k-8.1/src/motion/md_run.F:153
#30  0x5536ae in cp2k_run
    at /local/src/cp2k-8.1/src/start/cp2k_runs.F:378
#31  0x556764 in __cp2k_runs_MOD_run_input
    at /local/src/cp2k-8.1/src/start/cp2k_runs.F:983
#32  0x534a31 in cp2k
    at /local/src/cp2k-8.1/src/start/cp2k.F:337
#33  0x4ec1cc in main
    at /local/src/cp2k-8.1/src/start/cp2k.F:44

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 14969 RUNNING AT k172
=   EXIT CODE: 134
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

There is no problem for CP2K of CPU version, and I also perform classical MD for   argon.inp in the exercise with 12 GPUs smoothly.

Your response is highly appreciated.

Best wishes,

Wei
SiC.inp

Alfio Lazzaro

unread,
Feb 4, 2021, 1:41:48 AM2/4/21
to cp2k
The multi-gpu support is still not stable.
The error message is inside COSMA.
Could you remove this library from your installation of CP2K? I assume you are using the toolchain, so just use --with-cosma=no

Then, I assume you are using PSMP version of CP2K (the only way of using the multiple GPUs). Could you confirm? Note that there must be a rank (or multiple ranks) attached to each GPU, e.g. for 12 GPUs I need at least 12 ranks (or multiples).

Alfio

singlebook

unread,
Feb 4, 2021, 3:34:52 AM2/4/21
to cp2k
Thanks for your reply! Yes, I am using PSMP.  I will recompile cp2k without cosma and give you feedback later.

singlebook

unread,
Feb 4, 2021, 8:18:43 PM2/4/21
to cp2k

Hello!

I removed cosma from cp2k. Now it works for multiple GPUs, but the speed has not improved:
48 cpus without gpu :  each scf step costs 0.3 second  (cosma is available in cpu version.)
48 cpus with 12 gpus: each scf step costs 1.8 second  (cosma is not available. )
12 cpus with 12 gpus: each scf step costs 1.2 second  (cosma is not available. )

On Thursday, February 4, 2021 at 2:41:48 PM UTC+8 Alfio Lazzaro wrote:

Alfio Lazzaro

unread,
Feb 5, 2021, 2:08:22 AM2/5/21
to cp2k
Hello!
I assume that by "12 cpus" you mean 12 MPI ranks, could you confirm? How many threads?

First of all, consider that multigpu is still not well-tested. That said, more GPUs don't mean faster execution if the code doesn't exploit that...
I see some possible explanations for your results:
1. the GPU part in CP2K is DBCSR, likely your benchmark doesn't use DBCSR at lot, so no speed-up. From your CPU result, it seems that you are bound by PDGEMMs, so COSMA is beneficial...
2. multiple GPUs can share the same PCIe so the data movement becomes the bottleneck

I think a way to investigate is if you share the CP2K outputs. I can take a look...
One more question: you said that it crashed for >6 GPUs, do you have a run with 4 (or 6) GPUs with COSMA? If so, please share it.
One possibility is to use COSMA with only CPU and then the GPU for DBCSR. 
However, it can be also possible that 6 GPUS with COSMA are good enough to speed-up the execution... 
For the rest, I suggest opening an issue on the COSMA page (https://github.com/eth-cscs/COSMA/issues ) to understand why >6 GPUs are not working (this is not strictly CP2K related).

Alfio

singlebook

unread,
Feb 5, 2021, 3:24:34 AM2/5/21
to cp2k
Hello,  Alfio,

Yes, there are 12 MPI ranks, each rank has only one thread.
The output file is too large to upload, I only  put the head information for the cpu version here, those files for gpu are not saved for the moment. Whenever the workstation is idle, I will do more tests.

DBCSR| CPU Multiplication driver                                           XSMM
 DBCSR| Multrec recursion limit                                              512
 DBCSR| Multiplication stack size                                           1000
 DBCSR| Maximum elements for images                                    UNLIMITED
 DBCSR| Multiplicative factor virtual images                                   1
 DBCSR| Use multiplication densification                                       T
 DBCSR| Multiplication size stacks                                             3
 DBCSR| Use memory pool for CPU allocation                                     F
 DBCSR| Number of 3D layers                                               SINGLE
 DBCSR| Use MPI memory allocation                                              F
 DBCSR| Use RMA algorithm                                                      F
 DBCSR| Use Communication thread                                               T
 DBCSR| Communication thread load                                             87
 DBCSR| MPI: My node id                                                        0
 DBCSR| MPI: Number of nodes                                                  48
 DBCSR| OMP: Current number of threads                                         1
 DBCSR| OMP: Max number of threads                                             1
 DBCSR| Split modifier for TAS multiplication algorithm                  1.0E+00


  **** **** ******  **  PROGRAM STARTED AT               2021-02-04 09:18:01.088
 ***** ** ***  *** **   PROGRAM STARTED ON                                  k172
 **    ****   ******    PROGRAM STARTED BY                               chenwei
 ***** **    ** ** **   PROGRAM PROCESS ID                                 52126
  **** **  *******  **  PROGRAM STARTED IN /ncsfs02/chenwei/Machine Learning/CP2
                                           K/SiC

 CP2K| version string:                                          CP2K version 8.1
 CP2K| source code revision number:                                  git:0b61f2f
 CP2K| cp2kflags: omp libint fftw3 libxc elpa parallel mpi3 scalapack xsmm plume
 CP2K|            d2 spglib libvori libbqb
 CP2K| is freely available from                            https://www.cp2k.org/
 CP2K| Program compiled at                          Thu Feb  4 08:49:28 CST 2021
 CP2K| Program compiled on                                                  k172
 CP2K| Program compiled for                                                local
 CP2K| Data directory path                       /home/chenwei/src/cp2k-8.1/data
 CP2K| Input file name                                                   SiC.inp

 GLOBAL| Force Environment number                                              1
 GLOBAL| Basis set file name                                           BASIS_SET
 GLOBAL| Potential file name                                      GTH_POTENTIALS
 GLOBAL| MM Potential file name                                     MM_POTENTIAL
 GLOBAL| Coordinate file name                                      __STD_INPUT__
 GLOBAL| Method name                                                        CP2K
 GLOBAL| Project name                                                   SiC_AIMD
 GLOBAL| Preferred FFT library                                             FFTW3
 GLOBAL| Preferred diagonalization lib.                                     ELPA
 GLOBAL| Run type                                                             MD
 GLOBAL| All-to-all communication in single precision                          F
 GLOBAL| FFTs using library dependent lengths                                  F
 GLOBAL| Global print level                                                  LOW
 GLOBAL| MPI I/O enabled                                                       T
 GLOBAL| Total number of message passing processes                            48
 GLOBAL| Number of threads for this process                                    1
 GLOBAL| This output is from process                                           0
 GLOBAL| CPU model name                Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
 GLOBAL| CPUID                                                              1002

 MEMORY| system memory details [Kb]
 MEMORY|                        rank 0           min           max       average
 MEMORY| MemTotal            131748504     131748504     131748504     131748504
 MEMORY| MemFree              67523260      67523260      67523260      67523260
 MEMORY| Buffers                  4712          4712          4712          4712
 MEMORY| Cached               56159648      56159648      56159648      56159648
 MEMORY| Slab                  2740508       2740508       2740508       2740508
 MEMORY| SReclaimable          2447544       2447544       2447544       2447544
 MEMORY| MemLikelyFree       126135164     126135164     126135164     126135164


 GENERATE|  Preliminary Number of Bonds generated:                             0
 GENERATE|  Achieved consistency in connectivity generation.


 SCF WAVEFUNCTION OPTIMIZATION

  Step     Update method      Time    Convergence         Total energy    Change
  ------------------------------------------------------------------------------
     1 NoMix/Diag. 0.40E+00    0.3     3.80220882      -317.7175159821 -3.18E+02
     2 Broy./Diag. 0.40E+00    0.6     0.43368094      -291.0370906460  2.67E+01
     3 Broy./Diag. 0.40E+00    0.6     0.23506554      -308.2043627628 -1.72E+01
     4 Broy./Diag. 0.40E+00    0.6     0.26390650      -309.7756477106 -1.57E+00
     5 Broy./Diag. 0.40E+00    0.6     0.00311711      -310.0196552337 -2.44E-01
     6 Broy./Diag. 0.40E+00    0.6     0.01762115      -309.8687051316  1.51E-01
     7 Broy./Diag. 0.40E+00    0.6     0.00055086      -309.8505587170  1.81E-02
     8 Broy./Diag. 0.40E+00    0.6     0.00030811      -309.8516271774 -1.07E-03
     9 Broy./Diag. 0.40E+00    0.6     0.00001506      -309.8519055144 -2.78E-04
    10 Broy./Diag. 0.40E+00    0.6     0.00000129      -309.8519255844 -2.01E-05
    11 Broy./Diag. 0.40E+00    0.6     0.00000032      -309.8519300365 -4.45E-06
    12 Broy./Diag. 0.40E+00    0.6     0.00000002      -309.8519304271 -3.91E-07

  *** SCF run converged in    12 steps ***


Best wishes,

Wei

Alfio Lazzaro

unread,
Feb 5, 2021, 4:43:48 AM2/5/21
to cp2k
Well, what I need is the top (let's say up to "SCF WAVEFUNCTION OPTIMIZATION") and the bottom of the logs (starting at "DBCSR STATISTICS").

singlebook

unread,
Feb 5, 2021, 4:54:47 AM2/5/21
to cp2k
 *******************************************************************************
 *******************************************************************************
 **                                                                           **
 **     #####                         ##              ##                      **
 **    ##   ##            ##          ##              ##                      **
 **   ##     ##                       ##            ######                    **
 **   ##     ##  ##   ##  ##   #####  ##  ##   ####   ##    #####    #####    **
 **   ##     ##  ##   ##  ##  ##      ## ##   ##      ##   ##   ##  ##   ##   **
 **   ##  ## ##  ##   ##  ##  ##      ####     ###    ##   ######   ######    **
 **    ##  ###   ##   ##  ##  ##      ## ##      ##   ##   ##       ##        **
 **     #######   #####   ##   #####  ##  ##  ####    ##    #####   ##        **
 **           ##                                                    ##        **
 **                                                                           **
 **                                                ... make the atoms dance   **
 **                                                                           **
 **            Copyright (C) by CP2K developers group (2000 - 2020)           **
 **                      J. Chem. Phys. 152, 194103 (2020)                    **
 **                                                                           **
 *******************************************************************************


 TOTAL NUMBERS AND MAXIMUM NUMBERS

  Total number of            - Atomic kinds:                                   2
                             - Atoms:                                         64
                             - Shell sets:                                   128
                             - Shells:                                       320
                             - Primitive Cartesian functions:                320
                             - Cartesian basis functions:                    896
                             - Spherical basis functions:                    832

  Maximum angular momentum of- Orbital basis functions:                        2
                             - Local part of the GTH pseudopotential:          2
                             - Non-local part of the GTH pseudopotential:      2


 SCF PARAMETERS         Density guess:                                    ATOMIC
                        --------------------------------------------------------
                        max_scf:                                             300
                        max_scf_history:                                       0
                        max_diis:                                              4
                        --------------------------------------------------------
                        eps_scf:                                        1.00E-07
                        eps_scf_history:                                0.00E+00
                        eps_diis:                                       1.00E-01
                        eps_eigval:                                     1.00E-05
                        --------------------------------------------------------
                        level_shift [a.u.]:                                 0.00
                        --------------------------------------------------------
                        Mixing method:                            BROYDEN_MIXING
                                                charge density mixing in g-space
                        --------------------------------------------------------
                        No outer SCF

 PW_GRID| Information for grid number                                          1
 PW_GRID| Grid distributed over                                    48 processors
 PW_GRID| Real space group dimensions                                    48    1
 PW_GRID| the grid is blocked:                                                NO
 PW_GRID| Cutoff [a.u.]                                                    150.0
 PW_GRID| spherical cutoff:                                                   NO
 PW_GRID|   Bounds   1            -48      47                Points:          96
 PW_GRID|   Bounds   2            -48      47                Points:          96
 PW_GRID|   Bounds   3            -48      47                Points:          96
 PW_GRID| Volume element (a.u.^3)  0.5016E-02     Volume (a.u.^3)      4437.6722
 PW_GRID| Grid span                                                    FULLSPACE
 PW_GRID|   Distribution                         Average         Max         Min
 PW_GRID|   G-Vectors                            18432.0       18432       18432
 PW_GRID|   G-Rays                                 192.0         192         192
 PW_GRID|   Real Space Points                    18432.0       18432       18432

 PW_GRID| Information for grid number                                          2
 PW_GRID| Number of the reference grid                                         1
 PW_GRID| Grid distributed over                                    48 processors
 PW_GRID| Real space group dimensions                                    48    1
 PW_GRID| the grid is blocked:                                                NO
 PW_GRID| Cutoff [a.u.]                                                     50.0
 PW_GRID| spherical cutoff:                                                   NO
 PW_GRID|   Bounds   1            -27      26                Points:          54
 PW_GRID|   Bounds   2            -27      26                Points:          54
 PW_GRID|   Bounds   3            -27      26                Points:          54
 PW_GRID| Volume element (a.u.^3)  0.2818E-01     Volume (a.u.^3)      4437.6722
 PW_GRID| Grid span                                                    FULLSPACE
 PW_GRID|   Distribution                         Average         Max         Min
 PW_GRID|   G-Vectors                             3280.5        3402        3186
 PW_GRID|   G-Rays                                  60.8          63          59
 PW_GRID|   Real Space Points                     3280.5        5832        2916

 PW_GRID| Information for grid number                                          3
 PW_GRID| Number of the reference grid                                         1
 PW_GRID| Grid distributed over                                    48 processors
 PW_GRID| Real space group dimensions                                     6    8
 PW_GRID| the grid is blocked:                                                NO
 PW_GRID| Cutoff [a.u.]                                                     16.7
 PW_GRID| spherical cutoff:                                                   NO
 PW_GRID|   Bounds   1            -16      15                Points:          32
 PW_GRID|   Bounds   2            -16      15                Points:          32
 PW_GRID|   Bounds   3            -16      15                Points:          32
 PW_GRID| Volume element (a.u.^3)  0.1354         Volume (a.u.^3)      4437.6722
 PW_GRID| Grid span                                                    FULLSPACE
 PW_GRID|   Distribution                         Average         Max         Min
 PW_GRID|   G-Vectors                              682.7         704         640
 PW_GRID|   G-Rays                                  21.3          22          20
 PW_GRID|   Real Space Points                      682.7         768         640

 PW_GRID| Information for grid number                                          4
 PW_GRID| Number of the reference grid                                         1
 PW_GRID| Grid distributed over                                    48 processors
 PW_GRID| Real space group dimensions                                     6    8
 PW_GRID| the grid is blocked:                                                NO
 PW_GRID| Cutoff [a.u.]                                                      5.6
 PW_GRID| spherical cutoff:                                                   NO
 PW_GRID|   Bounds   1             -9       8                Points:          18
 PW_GRID|   Bounds   2             -9       8                Points:          18
 PW_GRID|   Bounds   3             -9       8                Points:          18
 PW_GRID| Volume element (a.u.^3)  0.7609         Volume (a.u.^3)      4437.6722
 PW_GRID| Grid span                                                    FULLSPACE
 PW_GRID|   Distribution                         Average         Max         Min
 PW_GRID|   G-Vectors                              121.5         144         108
 PW_GRID|   G-Rays                                   6.8           8           6
 PW_GRID|   Real Space Points                      121.5         162         108

 POISSON| Solver                                                        PERIODIC
 POISSON| Periodicity                                                        XYZ

 RS_GRID| Information for grid number                                          1
 RS_GRID|   Bounds   1            -48      47                Points:          96
 RS_GRID|   Bounds   2            -48      47                Points:          96
 RS_GRID|   Bounds   3            -48      47                Points:          96
 RS_GRID| Real space distribution over                                  6 groups
 RS_GRID| Real space distribution along direction                              2
 RS_GRID| Border size                                                         26
 RS_GRID| Real space distribution over                                  8 groups
 RS_GRID| Real space distribution along direction                              3
 RS_GRID| Border size                                                         26
 RS_GRID|   Distribution                         Average         Max         Min
 RS_GRID|   Planes                                  68.0          68          68
 RS_GRID|   Distribution                         Average         Max         Min
 RS_GRID|   Planes                                  64.0          64          64

 RS_GRID| Information for grid number                                          2
 RS_GRID|   Bounds   1            -27      26                Points:          54
 RS_GRID|   Bounds   2            -27      26                Points:          54
 RS_GRID|   Bounds   3            -27      26                Points:          54
 RS_GRID| Real space fully replicated
 RS_GRID| Group size                                                           1

 RS_GRID| Information for grid number                                          3
 RS_GRID|   Bounds   1            -16      15                Points:          32
 RS_GRID|   Bounds   2            -16      15                Points:          32
 RS_GRID|   Bounds   3            -16      15                Points:          32
 RS_GRID| Real space fully replicated
 RS_GRID| Group size                                                           1

 RS_GRID| Information for grid number                                          4
 RS_GRID|   Bounds   1             -9       8                Points:          18
 RS_GRID|   Bounds   2             -9       8                Points:          18
 RS_GRID|   Bounds   3             -9       8                Points:          18
 RS_GRID| Real space fully replicated
 RS_GRID| Group size                                                           1

 MD_PAR| Molecular dynamics protocol (MD input parameters)
 MD_PAR| Ensemble type                                                       NVT
 MD_PAR| Number of time steps                                              10000
 MD_PAR| Time step [fs]                                                 0.500000
 MD_PAR| Temperature [K]                                              300.000000
 MD_PAR| Temperature tolerance [K]                                      0.000000
 MD_PAR| Print MD information every                                   10 step(s)
 MD_PAR| File type   Print frequency [steps]                          File names
 MD_PAR| Coordinates         10                               SiC_AIMD-pos-1.xyz
 MD_PAR| Velocities          10                               SiC_AIMD-vel-1.xyz
 MD_PAR| Energies            10                                  SiC_AIMD-1.ener
 MD_PAR| Dump                20                               SiC_AIMD-1.restart

 ROT| Rotational analysis information
 ROT| Principal axes and moments of inertia [a.u.]
 ROT|                           1                   2                   3
 ROT| Eigenvalues      9.86893119935E+07   1.19427476747E+08   1.19427476747E+08
 ROT|      x              0.577350269190     -0.408248290464      0.707106781187
 ROT|      y              0.577350269190     -0.408248290464     -0.707106781187
 ROT|      z              0.577350269190      0.816496580928      0.000000000000
 ROT| Number of rotovibrational vectors                                        6

 DOF| Calculation of degrees of freedom
 DOF| Number of atoms                                                         64
 DOF| Number of intramolecular constraints                                     0
 DOF| Number of intermolecular constraints                                     0
 DOF| Invariants (translations + rotations)                                    3
 DOF| Degrees of freedom                                                     189

 DOF| Restraints information
 DOF| Number of intramolecular restraints                                      0
 DOF| Number of intermolecular restraints                                      0

 THERMOSTAT| Thermostat information for PARTICLES
 THERMOSTAT| Type of thermostat                               Nose-Hoover-Chains
 THERMOSTAT| Nose-Hoover-Chain length                                          3
 THERMOSTAT| Nose-Hoover-Chain time constant [fs]                    1000.000000
 THERMOSTAT| Order of Yoshida integrator                                       3
 THERMOSTAT| Number of multiple time steps                                     2
 THERMOSTAT| Initial potential energy                         0.000000000000E+00
 THERMOSTAT| Initial kinetic energy                           0.475022301493E-03
 THERMOSTAT| End of thermostat information for PARTICLES

 MD_VEL| Velocities initialization
 MD_VEL| Initial temperature [K]                                      300.000000
 MD_VEL| COM velocity             0.0000000000    -0.0000000000    -0.0000000000

 Number of electrons:                                                        256
 Number of occupied orbitals:                                                128
 Number of molecular orbitals:                                               128

 Number of orbital functions:                                                832
 Number of independent orbital functions:                                    832

 Extrapolation method: initial_guess



 -------------------------------------------------------------------------------
 -                                                                             -
 -                                DBCSR STATISTICS                             -
 -                                                                             -
 -------------------------------------------------------------------------------
 COUNTER                                    TOTAL       BLAS       SMM       ACC
 flops    13 x    32 x    13        7086601666560       0.0%    100.0%      0.0%
 flops    13 x    13 x    32        9891694059520       0.0%    100.0%      0.0%
 flops inhomo. stacks                           0       0.0%      0.0%      0.0%
 flops total                        16.978296E+12       0.0%    100.0%      0.0%
 flops max/rank                    732.153860E+09       0.0%    100.0%      0.0%
 matmuls inhomo. stacks                         0       0.0%      0.0%      0.0%
 matmuls total                         1569738880       0.0%    100.0%      0.0%
 number of processed stacks              28782912       0.0%    100.0%      0.0%
 average stack size                                     0.0      54.5       0.0
 marketing flops                    26.595494E+12
 -------------------------------------------------------------------------------
 # multiplications                         149911
 max memory usage/rank             153.088000E+06
 # max total images/rank                        3
 # max 3D layers                                1
 # MPI messages exchanged               143914560
 MPI messages size (bytes):
  total size                         3.855411E+12
  min size                           0.000000E+00
  max size                         137.904000E+03
  average size                      26.789580E+03
 MPI breakdown and total messages size (bytes):
             size <=      128            81866560                        0
       128 < size <=     8192                   0                        0
      8192 < size <=    32768            21587184             383158124544
     32768 < size <=   131072            36941696            2980859518208
    131072 < size <=  4194304             3519120             485300724480
   4194304 < size <= 16777216                   0                        0
  16777216 < size                               0                        0
 -------------------------------------------------------------------------------

 *** WARNING in dbcsr_mm.F:294 :: Using a non-square number of MPI ranks ***
 *** might lead to poor performance. Used ranks: 48 Suggested: 49 100    ***

 -------------------------------------------------------------------------------
 -                                                                             -
 -                      DBCSR MESSAGE PASSING PERFORMANCE                      -
 -                                                                             -
 -------------------------------------------------------------------------------
 ROUTINE             CALLS      AVE VOLUME [Bytes]
 MP_Bcast                3                     12.
 MP_Allreduce       869441                      8.
 MP_Alltoall       3098138                  32851.
 MP_ISend          7195728                  12717.
 MP_IRecv          7195728                  11224.
 -------------------------------------------------------------------------------

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                GRID STATISTICS                              -
 -                                                                             -
 -------------------------------------------------------------------------------
 LP    KERNEL             BACKEND                              COUNT     PERCENT
 2     collocate ortho    REF                             9708713949      36.60%
 4     integrate ortho    REF                              529879041       2.00%
 4     collocate ortho    REF                              221635148       0.84%
 2     integrate ortho    REF                             8736976861      32.94%
 0     collocate general  REF                               30723072       0.12%
 1     integrate general  REF                               30723072       0.12%
 5     integrate ortho    REF                               22183061       0.08%
 3     integrate ortho    REF                             3942635281      14.86%
 3     collocate ortho    REF                             3301325147      12.45%
 -------------------------------------------------------------------------------

 MEMORY| Estimated peak process memory [MiB]                                 146

 -------------------------------------------------------------------------------
 ----                             MULTIGRID INFO                            ----
 -------------------------------------------------------------------------------
 count for grid        1:      110066116          cutoff [a.u.]          150.00
 count for grid        2:      519820015          cutoff [a.u.]           50.00
 count for grid        3:      459986613          cutoff [a.u.]           16.67
 count for grid        4:      235051958          cutoff [a.u.]            5.56
 total gridlevel count  :     1324924702

 -------------------------------------------------------------------------------
 -                                                                             -
 -                         MESSAGE PASSING PERFORMANCE                         -
 -                                                                             -
 -------------------------------------------------------------------------------

 ROUTINE             CALLS      AVE VOLUME [Bytes]
 MP_Group                4
 MP_Bcast           203792                   2218.
 MP_Allreduce      1459647                    265.
 MP_Sync                 4
 MP_Alltoall       1818671                 396307.
 MP_ISendRecv     28177722                  18032.
 MP_Wait          42247738
 MP_ISend         12750952                  57626.
 MP_IRecv         12750952                  57626.
 -------------------------------------------------------------------------------


 -------------------------------------------------------------------------------
 -                                                                             -
 -                                T I M I N G                                  -
 -                                                                             -
 -------------------------------------------------------------------------------
 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
 CP2K                                 1  1.0     0.01     0.01 66822.69 66823.04
 qs_mol_dyn_low                       1  2.0     0.34     0.37 66822.51 66822.86
 velocity_verlet                  10000  3.0     1.48     5.04 66810.62 66811.08
 qs_forces                        10001  4.0     0.98     1.02 66806.91 66807.26
 qs_energies                      10001  5.0     0.88     1.24 59685.56 59686.71
 scf_env_do_scf                   10001  6.0     0.94     1.73 54615.83 54617.31
 scf_env_do_scf_inner_loop        89920  7.0     4.83    26.14 54614.78 54616.21
 rebuild_ks_matrix                99921  8.7     0.40     0.46 25783.42 25795.09
 qs_ks_build_kohn_sham_matrix     99921  9.7    13.65    14.24 25783.02 25794.65
 qs_rho_update_rho                99921  8.1     0.53     0.65 25411.34 25412.68
 calculate_rho_elec               99921  9.1    10.26    10.68 25410.81 25412.19
 sum_up_and_integrate             99921 10.7    10.04    11.19 24320.21 24334.14
 integrate_v_rspace               99921 11.7     3.82     4.21 24309.99 24324.85
 qs_ks_update_qs_env              89920  8.0     0.78     0.91 22462.31 22473.54
 grid_collocate_task_list         99921 10.1 18451.53 18769.98 18451.53 18769.98
 grid_integrate_task_list         99921 12.7 16303.94 16394.84 16303.94 16394.84
 rs_pw_transfer                  819370 12.3    15.23    17.78 11655.48 12071.19
 qs_scf_new_mos                   89920  8.0     1.71     1.94  8270.35  8321.12
 eigensolver                      89920  9.0     5.28     7.69  7862.09  7870.32
 density_rs2pw                    99921 10.1     6.01     6.82  6836.50  7045.41
 cp_fm_diag_elpa                  89920 10.0     0.64     0.79  6757.80  6804.53
 cp_fm_diag_elpa_base             89920 11.0  6676.81  6729.03  6756.91  6803.67
 mp_waitany                     ******* 14.1  5758.84  6457.62  5758.84  6457.62
 potential_pw2rs                  99921 12.7     6.04     6.56  5839.37  5848.24
 rs_pw_transfer_RS2PW_150        109922 11.9  1068.20  1206.58  5210.54  5627.18
 rs_pw_transfer_PW2RS_150        109922 14.3  1943.71  2063.73  4455.92  4497.89
 build_core_hamiltonian_matrix_   10001  5.0     0.39     0.44  2865.88  3438.38
 qs_ks_update_qs_env_forces       10001  5.0     0.05     0.06  3365.19  3366.37
 init_scf_run                     10001  6.0     0.61     0.93  3252.05  3253.43
 scf_env_initial_rho_setup        10001  7.0     0.24     1.03  3175.29  3176.49
 wfi_extrapolate                  10001  8.0     0.91     1.00  3104.21  3104.23
 pw_transfer                    1288972 11.8    67.54    70.98  2676.70  2707.44
 fft_wrap_pw1pw2                1089130 12.8    10.61    11.18  2555.45  2585.55
 mp_alltoall_d11v               1529045 12.0  2279.64  2399.42  2279.64  2399.42
 fft_wrap_pw1pw2_150             489604 13.2   220.23   228.66  2227.19  2283.38
 rs_gather_matrices               99921 12.7    10.55    14.72  2150.73  2276.05
 build_core_ppnl_forces           10001  6.0  1724.02  2032.14  1724.02  2032.14
 fft3d_ps                       1089130 14.8   824.46   858.66  1971.84  1994.23
 mp_sum_d                        869728 10.8  1050.61  1821.39  1050.61  1821.39
 qs_energies_init_hamiltonians    10001  6.0     0.17     0.19  1767.07  1767.08
 mp_waitall_1                   ******* 14.6  1405.52  1749.18  1405.52  1749.18
 calculate_ecore_overlap          20002  6.0     0.24     0.35   885.01  1685.36
 -------------------------------------------------------------------------------

 The number of warnings for this run is : 1

Alfio Lazzaro

unread,
Feb 5, 2021, 7:50:59 AM2/5/21
to cp2k
OK, Thanks for the timers.
I assume you sent me the CPU timers.
As suspected, you are massively dominated by no GPU part. I can even not see any COSMA stuff. 
These are the main parts where the time goes:

fft_wrap_pw1pw2_150              228.660
fft3d_ps                         858.660
rs_pw_transfer_RS2PW_150        1206.580
mp_waitall_1                    1749.180
mp_sum_d                        1821.390
build_core_ppnl_forces          2032.140
rs_pw_transfer_PW2RS_150        2063.730
mp_alltoall_d11v                2399.420
mp_waitany                      6457.620
cp_fm_diag_elpa_base            6729.030
grid_integrate_task_list       16394.840
grid_collocate_task_list       18769.980
CP2K_Total                     66823.040

More than half of the total time (66823.040) is in the grid_* functions. BTW, for this kind of testings, I suggest using fewer steps...
I suspect you are hitting the performance problem for the CPU and GPU reported for the CP2K 8.1 (see https://github.com/cp2k/cp2k/issues/1323 ).
I suggest to try CP2K 7.1...

Alfio

Wei Chen

unread,
Feb 5, 2021, 7:25:10 PM2/5/21
to cp2k
Thank you very much for your reply.

Best wishes,

Wei
Reply all
Reply to author
Forward
0 new messages