Performance Issue on GPU

Narsimha Reddy

unread,

Nov 12, 2014, 10:24:28 PM11/12/14

to cp...@googlegroups.com

Hi all,

I have compiled cp2k with GPU and without GPU and started running sample benchmarks.

But i have observed that GPU is taking more used time than CPU. Kindly any one suggest me how to get of this hurdle.

(I am using Nvidia Tesla K20 C with Cuda compilation tools, release 6.0, V6.0.1 & Kernel Module 331.75 & CentOS 6.4).

Thanks,

Narsimha.

Samuel Andermatt

unread,

Nov 13, 2014, 4:14:08 AM11/13/14

to cp...@googlegroups.com

Could you post your archfile, your input file and your output (especially the timing report at the end). Also it would be interesting to know what CPU you have.

Samuel Andermatt

unread,

Nov 17, 2014, 4:27:54 AM11/17/14

to

Ok, thanks for emailing me the required data. There are a number of issues.

First only matrix multiplications and fft's can currently be accelerated by GPU's. Looking at the timing sections your calculation is dominated by CPU parts:
Total time:                CP2K                                  609.771
Main bottlenecks:    integrate_v_rspace              268.894
                                calculate_rho_elec              205.875

This is normal for smaller calculations, GPU's become more useful for systems with 1000+ atoms.

The second problem is that only a small part (12.4%) of your multiplications are ported to the GPU:

COUNTER                                      CPU                  GPU      GPU%
number of processed stacks                179436                25344      12.4

This is a result of there not being kernels for your basis set. You will have to manually add them:

Open: src/dbcsr/libsmm_acc/libcusmm/generaty.py

There is a section with triples just on the top of the file. Add to it:
triples += combinations(7,9,16,22)

Best

Samuel

P.S: The main parameter that determines that speed of the calculations that you want to do is the CUTOFF parameter in CP2K_INPUT/FORCE_EVAL/DFT/MGRID.

Narsimha Reddy

unread,

Nov 17, 2014, 8:56:12 PM11/17/14

to cp...@googlegroups.com, Andermatt Samuel

Dear Samuel,

Thank you for the analysis and reply.

First point:-

I understood that the no of atoms are less than 1000, in order to get the performance i need to increase the atoms more than 1000.

Second point:-

I got generaty.py file at src/dbcsr_lib/cuda/libcusmm/generate.py location, is that the same file that you are referring.

I am having the same file in both the source codes that i have compiled with & without GPU so can you tell me which file i have to modify.

what are these triples and how did you get those values, can you please clarify this.

Thanks,

Narsimha.

Samuel Andermatt

unread,

Nov 18, 2014, 2:22:31 AM11/18/14

to cp...@googlegroups.com, samuel.a...@mat.ethz.ch

Yes, that is probably the file ( I am using the development version, therefore my path was slightly different). You have to modify the version with GPU and then recompile it. These triples generate the code to multiply matrix blocks of a given size with each other. The sizes of the blocks are given by the basis set.
triples += combinations(16,22) means that the elements that correspond to the interaction between atoms with 16 and 22 basis functions (and 16 with 16 and 22 with 22) are ported to the GPU.
The GPU is especially efficient with somewhat larger blocks (higher quality basis set).

Best

Samuel

Michael Banck

unread,

Nov 18, 2014, 4:53:45 AM11/18/14

to cp...@googlegroups.com

On Mon, Nov 17, 2014 at 11:22:31PM -0800, Samuel Andermatt wrote:
> triples += combinations(16,22) means that the elements that correspond to
> the interaction between atoms with 16 and 22 basis functions (and 16 with
> 16 and 22 with 22) are ported to the GPU.
> The GPU is especially efficient with somewhat larger blocks (higher quality
> basis set).

In the long run, this would be really awesome to be generated on demand
at runtime :-/

Michael

Ole Schütt

unread,

Nov 18, 2014, 4:59:37 AM11/18/14

to cp...@googlegroups.com, mba...@debian.org

> In the long run, this would be really awesome to be generated on demand at runtime :-/

Yes, that's why we are actually already working on it :-)

-Ole

Reply all

Reply to author

Forward