Status of matrix-free CUDA support

Stephen DeWitt

unread,

Oct 29, 2018, 10:43:48 AM10/29/18

to deal.II User Group

Hello all,

I'm interested to know what the status is for using CUDA with matrix free calculations. We have a PRISMS-PF user who is interested in GPU calculations, and I'd like to get a better idea of what would be involved in adding CUDA support on our end.

So far I've read through the "CUDA Support" issue, the "Roadmap for inclusion of GPU implementation of matrix free in Deal.II" issue, the Doxygen documentation for classes in the CUDAWrappers namespace, and the manuscript by Karl Ljungkvist. Are there any other pages I should be looking at?

My understanding from these pages is that deal.II has partial support for using CUDA with matrix free calculations. Currently, calculations can be done with scalar variables (but not vector variables) and adaptive meshes.

A few (somewhat inter-related) questions:

1). Do all of the tools exist to create a GPU version of step-48? Has anyone done so?

2). What exactly would be involved in creating a GPU version of step-48? Is it just changing the CPU Vector, MatrixFree, and FEEvaluation classes to their GPU counterparts, plus packaging some data (plus local apply functions?) into a CUDAWrappers::MatrixFree< dim, Number >::Data struct?

3). Most of the discussions seemed to revolve around linear solves. For something like step-48 with explicit updates, will the current paradigm work well? Or would that require shuttling data between the GPU and CPU every time step, causing too much overhead? (I know that in general GPUs can work very well for explicit codes.)

Thanks!

Steve

Stephen DeWitt

unread,

Nov 5, 2018, 10:39:59 AM11/5/18

to deal.II User Group

I thought that I'd re-up this thread, since it didn't get any responses.

I apologize if the questions are a bit broad. Typically, I spend a bit of time trying to get some code working myself before posting a question here, but I don't currently have a machine with a NVIDIA GPU. If the CUDA support for deal.II is at the point where our code could take advantage of it, I'd be able to justify the investment to get access to a compatible machine.

Thanks!

Steve

Bruno Turcksin

unread,

Nov 5, 2018, 3:06:41 PM11/5/18

to deal.II User Group

Steve

On Monday, October 29, 2018 at 10:43:48 AM UTC-4, Stephen DeWitt wrote:

So far I've read through the "CUDA Support" issue, the "Roadmap for inclusion of GPU implementation of matrix free in Deal.II" issue, the Doxygen documentation for classes in the CUDAWrappers namespace, and the manuscript by Karl Ljungkvist. Are there any other pages I should be looking at?

No that's pretty much it.

My understanding from these pages is that deal.II has partial support for using CUDA with matrix free calculations. Currently, calculations can be done with scalar variables (but not vector variables) and adaptive meshes.

That's correct with the weird exception that you cannot impose constraints (and thus have hanging nodes) if dim equals 2 and fe_degree is even. There is a bug in that case but we don't know why.

A few (somewhat inter-related) questions:
1). Do all of the tools exist to create a GPU version of step-48? Has anyone done so?

I think so but nobody as tried so I am not 100% sure. Note that we don't have MPI support yet.

2). What exactly would be involved in creating a GPU version of step-48? Is it just changing the CPU Vector, MatrixFree, and FEEvaluation classes to their GPU counterparts, plus packaging some data (plus local apply functions?) into a CUDAWrappers::MatrixFree< dim, Number >::Data struct?

Yes, pretty much. You can take a look at the matrix_free_matrix_vector tests here. These tests are doing a matrix-vector multiplication using an Helmholtz operator. As you can see here the main difference between the GPU and the CPU code is that the body loop over the quadrature points needs to be encapsulated in a functor when using GPU. The tests are based on the CPU tests matrix_vector here. So you should have a good idea of the modifications required to use the GPU code.

3). Most of the discussions seemed to revolve around linear solves. For something like step-48 with explicit updates, will the current paradigm work well? Or would that require shuttling data between the GPU and CPU every time step, causing too much overhead? (I know that in general GPUs can work very well for explicit codes.)

That should work fine. Most people are interested in using matrix-free with a solver but I don't see a reason why it would be slow in step 48. The communication between the CPU and the GPU should be limited in step-48.

Best,

Bruno

Martin Kronbichler

unread,

Nov 5, 2018, 3:18:25 PM11/5/18

to dea...@googlegroups.com

Steve,

Just to add on what Bruno said: Explicit time integration typically requires only a subset of what you need for running an iterative solver (no decision making like in convergence test of conjugate gradient, no preconditioner), so running that should be pretty straight-forward.

Best,
Martin

--
The deal.II project is located at http://www.dealii.org/
For mailing list/forum options, see https://groups.google.com/d/forum/dealii?hl=en
---
You received this message because you are subscribed to the Google Groups "deal.II User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dealii+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Stephen DeWitt

unread,

Nov 6, 2018, 11:39:29 AM11/6/18

to deal.II User Group

Bruno and Martin,

Thank you for your responses. I'll try to get a CUDA version of step-48 up and running.