Dear Dongzhe,
It is possible to use processing parallelization within one
computer node, with the --np option to specify the number of
processes used.
With Siesta basis set is a bit computational expensive, as often there are several basis per orbital (eg. DZP, or larger). I am working on several ways to optimize that.
For a single atom structure, 21 min certainly looks really long.
But don't be scared by that. For larger system the number of
k-points can be reduced so usually it is still acceptable. I have
tried structures about 100 atoms, and it is possible to finish
within one day with about 20 cpu cores.
There is a huge space for optimization in efficiency, e.g. using
the symmetry will reduce the computational time, which we're
considering to implement in the future version.
Best regards,
HeXu
Dear Dr. He,
I am Dongzhe Li, currently working as a CNRS researcher in Toulouse.
I have a question concerning the computational cost for extracting J, D using TB2J. I tested with the example of Fe starting from siesta Hamiltonian.
TB2J/examples/Siesta/BccFe/
siesta2J.py --fdf_fname RUNSI.fdf --elements Fe --kmesh 9 9 9 --nz 100 --emax -0.005
It costs me like 21 mins walltime.Any suggestions to accurate the calculations? MPI or thread? Thanks.Best regards,========================Dr. Dongzhe LiCNRS Research ScientistCEMES-CNRS29 Rue Jeanne Marvig31055 Toulouse, FranceEmail: dongz...@cemes.fr========================
Dear Dr. He,
Thanks for your email.
I just tried running with MPI (still Fe example):
export OMP_NUM_THREADS=1
srun -n 36 siesta2J.py ### 27 mins wall time, single node
srun -n 1 siesta2J.py ### 21 mins wall time
It seems --np option makes the calculation more time-consuming... Am I doing something wrong?
Best regards,
Hi,
Note that TB2J does not use mpi. It uses multiprocessing.
Perhaps when you run with srun -n 36 it is a single node run. I
think that depends on the strategy of the clustering being used.
To request the CPU cores on one node explicitly is probably
necessary. With sun -n 36, I am not sure if it is equivalent to
ntask=1 and cpus-per-task=36, or ntask=36, cpus-per-task=1.
Another possibility is that the memory is really the bottleneck.
Could you try the option --use_cache?
Here is an example slurm script:
#SBATCH --ntasks=36
#SBATCH --cpus-per-task=1
#SBATCH --tasks-per-node=36
siesta2J.py ....
BTW, the parallelization is over the number of pools (--nz, in your case 100). So it is optimal to use something like 20 or 25.
Best regards,
HeXu
--
You received this message because you are subscribed to the Google Groups "TB2J" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tb2j+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tb2j/5bd02f5d461f61b283a63631a9c059d8%40cemes.fr.
For more options, visit https://groups.google.com/d/optout.