DFT is very slow

Han Yoon

unread,

Aug 31, 2020, 3:47:58 PM8/31/20

to NWChem Forum

I started testing NWChem since the last week and I'm trying to figure out why my test case is unreasonably slow. I know it often doesn't make sense to compare the performance across different softwares but NWchem is almost 20 times slower than Jaguar for a simple system that I'm running on.

Can someone look at my input and see if I set up something too rigorously? If not, I guess my compilation with OpenMPI and MKL was not correct (though I see the full usage of CPUs)

It only takes ~7 minutes by Jaguar with one CPU and ~2 hours by NWChem.

echo
memory total 4000 mb
start product_0_0

geometry units angstrom autosym
  C     4.59795   0.69471 1.53822
  S     4.53829   0.02936   -0.14424
  C     2.90009   -0.77609   -0.09191
  C     1.77282   0.23839   -0.23201
  C     0.41576   -0.44265   -0.21733
  O     0.29066   -1.64469   -0.01333
  N   -0.60364   0.45485   -0.43790
  C   -1.97644   0.21442   -0.49053
  C   -2.56139   -1.03231   -0.32116
  N   -2.70014   1.33080   -0.72665
  C   -3.94973   -1.13846   -0.39786
  C   -4.04388   1.19943   -0.79648
  C   -4.70952   -0.00248   -0.64022
  H     5.57350   1.16024 1.70265
  H     4.46983   -0.10606 2.27107
  H     3.82713   1.45529 1.68225
  H     2.80999   -1.35579 0.83316
  H     2.86182   -1.49117   -0.92128
  H     1.78959   0.95880 0.59358
  H     1.87061   0.78691   -1.17690
  H   -0.38709   1.43027   -0.59386
  H   -1.97749   -1.92487   -0.13277
  H   -4.42903   -2.10496   -0.26897
  H   -4.58934   2.11911   -0.98832
  H   -5.79038   -0.05304   -0.70587
end

basis
* library 6-31G*
end
driver
maxiter 300
end

dft
xc b3lyp
iterations 300
direct
disp vdw 3
convergence energy 5e-5
convergence density 5e-6
grid xcoarse
end
task dft optimize

Edoardo Aprà

unread,

Aug 31, 2020, 5:11:29 PM8/31/20

to nwchem...@googlegroups.com

The sloppy computational parameters of the posted input file make the calculation slower since it requires more steps for the geometry optimization to converge.

I am attaching the modified input file and the output I got running on five out of six 2.6 GHz six-core Intel i7 cores of a 2019 Macbook Pro. I have used the binary distributed by Homebrew https://brew.sh/

Wall time for the calculation was 19 minutes

b3.nw

b3.out.brew

Han Yoon

unread,

Aug 31, 2020, 7:29:05 PM8/31/20

to NWChem Forum

Thank you. I see it is much faster now! But could you please explain a bit more how this could improve?

What is quickguess? I couldn't find the keyword in the documentations..

And what is difference between setting semidirect with filesize 0 and just using direct? According to the documentation, direct does not use disk, does it?

Which parts you added could make it faster? Was it because of quickguess, or 'clear' in the driver nest or semidirect?

Thank you!

Edoardo Aprà

unread,

Aug 31, 2020, 7:37:36 PM8/31/20

to NWChem Forum

quickguess did not save more than a few seconds. I used because it gets you to the first energy quicker and I was testing various options.

Semidirect with filesize zero and memsize non zero uses memory to cache integrals. More details in the documentation at

https://nwchemgit.github.io/Hartree-Fock-Theory-for-Molecules.html#direct-and-semidirect-recomputation-of-integrals

https://nwchemgit.github.io/Density-Functional-Theory-for-Molecules.html#direct-semidirect-and-noio-hardware-resource-control

I am not sure what makes my run faster than yours since I am not using the NWChem binary you have been using.

You can try to replicate my run with the binary you compiled and the input I posted to see if there are any performance issues in your installation

(OMP_NUM_THREADS with threaded MKL might be an issue, as described in https://nwchemgit.github.io/Special_AWCforum/sp/id10825.html)

jeff.science

unread,

Sep 1, 2020, 6:24:06 PM9/1/20

to NWChem Forum

At least for MKL, we could add `call mkl_set_num_threads(1)` to the top of main and then reset it with `call mkl_set_num_threads(omp_get_num_threads())` in the modules that have some hope of benefitting from threaded MKL, e.g. MP2 and CC. Linking sequential MKL is probably the right answer most of the time but that's a more rigid choice and forces CC users to relink if they don't know what's going on.

Reply all

Reply to author

Forward