Re: TB2J: Computational cost

145 views
Skip to first unread message

Xu He

unread,
Sep 28, 2021, 5:15:20 AM9/28/21
to Dongzhe Li, TB2J

Dear Dongzhe,

It is possible to use processing parallelization within one computer node, with the --np option to specify the number of processes used. 

With Siesta basis set is a bit computational expensive, as often there are several basis per orbital (eg. DZP, or larger). I am working on several ways to optimize that.  

For a single atom structure, 21 min certainly looks really long. But don't be scared by that. For larger system the number of k-points can be reduced so usually it is still acceptable.  I have tried structures about 100 atoms, and it is possible to finish within one day with about 20 cpu cores.

There is a huge space for optimization in efficiency, e.g. using the symmetry will reduce the computational time, which we're considering to implement in the future version.

Best regards,

HeXu



On 9/28/21 09:54, Dongzhe Li wrote:

Dear Dr. He,

I am Dongzhe Li, currently working as a CNRS researcher in Toulouse.

I have a question concerning the computational cost for extracting J, D using TB2J. I tested with the example of Fe starting from siesta Hamiltonian.

TB2J/examples/Siesta/BccFe/

siesta2J.py --fdf_fname RUNSI.fdf --elements Fe --kmesh 9 9 9  --nz 100 --emax -0.005

It costs me like 21 mins walltime.
 
Any suggestions to accurate the calculations? MPI or thread? Thanks.
 
Best regards,
 
========================
Dr. Dongzhe Li
 
CNRS Research Scientist
CEMES-CNRS
29 Rue Jeanne Marvig
31055 Toulouse, France
========================

Dongzhe Li

unread,
Sep 28, 2021, 8:31:00 AM9/28/21
to Xu He, TB2J

Dear Dr. He,

Thanks for your email.

I just tried running with MPI (still Fe example):

export OMP_NUM_THREADS=1

srun -n 36 siesta2J.py    ### 27 mins wall time, single node

srun -n 1 siesta2J.py     ### 21 mins wall time

It seems --np option makes the calculation more time-consuming... Am I doing something wrong?

Best regards,

========================
Dr. Dongzhe Li
 
CNRS Research Scientist
CEMES-CNRS
29 Rue Jeanne Marvig
31055 Toulouse, France
========================

Xu He

unread,
Sep 28, 2021, 10:27:53 AM9/28/21
to Dongzhe Li, TB2J

Hi,

Note that TB2J does not use mpi. It uses multiprocessing.  Perhaps when you run with srun -n 36 it is a single node run. I think that depends on the strategy of the clustering being used. To request the CPU cores on one node explicitly is probably necessary. With sun -n 36, I am not sure if it is equivalent to ntask=1 and cpus-per-task=36,  or ntask=36, cpus-per-task=1.

Another possibility is that the memory is really the bottleneck. Could you try the option --use_cache?

Here is an example slurm script:
#SBATCH --ntasks=36
#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=36
siesta2J.py  ....

BTW, the parallelization is over the number of pools (--nz, in your case 100). So it is optimal to use something like 20 or 25.

Best regards,

HeXu

--
You received this message because you are subscribed to the Google Groups "TB2J" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tb2j+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tb2j/5bd02f5d461f61b283a63631a9c059d8%40cemes.fr.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages