[sundials-users] Question about using CUDA with CVODE

Jonathan Mackey

unread,

Jul 22, 2024, 11:58:34 AM7/22/24

to SUNDIAL...@listserv.llnl.gov

Hello,

I have been using sundials/cvode for quite a few years now, solving
chemical kinetics as part of an astrophysical fluid-dynamics code. We
implemented hybrid openmp/mpi parallel execution where each OMP thread
gets a series of work-packages comprising either a 1D column of cells or
a single cell to integrate for the chemical kinetics. The userdata is
reset and the solver re-initialised for each cell at each timestep. It
works very well for up to ~20-40 OMP threads per MPI process, with a
reaction network of up to ~100 species. We are using the dense linear
solver of CVODE, with backward differencing. It is difficult to
implement an exact Jacobian with heating/cooling terms, charge-exchange
and photoionization reaction rates, so we have been using the solver
without a user-supplied Jacobian.

We want to extend this to use GPUs and a student has looked into it a
bit for us this summer. The main motivation is to take advantage of the
GPU partition on a computer we have access to (each node has 4 A100
GPUs), and we expect that the chemical kinetics is the most suitable
part of the problem to send to the GPU: lots of computation, not so much
memory throughput.

We have a few questions that hopefully someone here has already thought
about:

(1) Does this programming model of having many OMP threads working in
parallel on a compute node (each with their own instance of the CVODE
solver and ydot function in memory) lend itself well to porting to GPU?
I expect we have 64 OMP threads for 4 GPUs on each node.

(2) Is the reaction network big enough for efficient GPU usage?

(3) The dense linear solver is still on the CPU and not the GPU, so
should we expect efficient use of the GPU if only the vectors are on the
GPU? I guess if this implementation exists then it should be useful,
but wanted to double check. The answer was unclear to me from looking
at the other cuda threads on the mailing list, and the student was a bit
skeptical, pointing out that a couple of the other SUNDIALS solvers have
GPU implementations.

(4) Should we expect the solver to be more efficient if we could supply
a Jacobian? (This question relates both to CPU and GPU implementation.)

(5) Is there any ongoing work to write a GPU implementation of the dense
linear solver, and would this be expected to give further performance
benefit?

All the best,
Jonathan

############################

To unsubscribe from the SUNDIALS-USERS list:
write to: mailto:SUNDIALS-USERS-...@LISTSERV.LLNL.GOV

Balos, Cody

unread,

Jul 22, 2024, 12:03:19 PM7/22/24

to SUNDIAL...@listserv.llnl.gov

Hi Jonathan,

I will come back to answer your questions in further detail, but I wanted to quickly send you two relevant papers: https://www.sciencedirect.com/science/article/pii/S0167819121000831 and https://scholar.google.com/citations?view_op=view_citation&hl=en&user=wMS-K7oAAAAJ&sortby=pubdate&citation_for_view=wMS-K7oAAAAJ:9ZlFYXVOiuMC. The second paper is very relevant to your use case.

Cody

Jonathan Mackey

unread,

Jul 23, 2024, 1:12:16 PM7/23/24

to SUNDIAL...@listserv.llnl.gov

Hi Cody, thanks for the quick reply. I had seen the first paper and
discussed it with the student, but we had not the new one from May. It
looks very interesting and we'll read through it carefully.
All the best,
Jonathan

On 22/07/2024 18:02, Balos, Cody wrote:
> Hi Jonathan,
>
> I will come back to answer your questions in further detail, but I
> wanted to quickly send you two relevant papers:

> https://urldefense.us/v3/__https://www.sciencedirect.com/science/article/pii/S0167819121000831__;!!G2kpM7uM-TzIFchu!zcDi5bgbiCCv3B4cCyYu3HQ_LrHHRQw0k8O4Qe_aLkSXd8vEz8vkgZKQi941rr1Ky5kCOIUzU5abWA-7R-j0jfsY$
> <https://urldefense.us/v3/__https://www.sciencedirect.com/science/article/pii/S0167819121000831__;!!G2kpM7uM-TzIFchu!zcDi5bgbiCCv3B4cCyYu3HQ_LrHHRQw0k8O4Qe_aLkSXd8vEz8vkgZKQi941rr1Ky5kCOIUzU5abWA-7R-j0jfsY$ >
> and
> https://urldefense.us/v3/__https://scholar.google.com/citations?view_op=view_citation&hl=en&user=wMS-K7oAAAAJ&sortby=pubdate&citation_for_view=wMS-K7oAAAAJ:9ZlFYXVOiuMC__;!!G2kpM7uM-TzIFchu!zcDi5bgbiCCv3B4cCyYu3HQ_LrHHRQw0k8O4Qe_aLkSXd8vEz8vkgZKQi941rr1Ky5kCOIUzU5abWA-7R5nJng4n$ <https://urldefense.us/v3/__https://scholar.google.com/citations?view_op=view_citation&hl=en&user=wMS-K7oAAAAJ&sortby=pubdate&citation_for_view=wMS-K7oAAAAJ:9ZlFYXVOiuMC__;!!G2kpM7uM-TzIFchu!zcDi5bgbiCCv3B4cCyYu3HQ_LrHHRQw0k8O4Qe_aLkSXd8vEz8vkgZKQi941rr1Ky5kCOIUzU5abWA-7R5nJng4n$ >. The second paper is very relevant to your use case.
>
> Cody
>
> *From: *sundials-users <sundial...@llnl.gov> on behalf of Jonathan
> Mackey <jma...@CP.DIAS.IE>
> *Date: *Monday, July 22, 2024 at 8:58 AM
> *To: *sundials-users <sundial...@llnl.gov>
> *Subject: *[sundials-users] Question about using CUDA with CVODE

> <mailto:SUNDIALS-USERS-...@LISTSERV.LLNL.GOV>
>
>
> ------------------------------------------------------------------------

Peles, Slaven

unread,

Jul 25, 2024, 12:58:32 AM7/25/24

to SUNDIAL...@listserv.llnl.gov

Hi Jonathan,

Let me try to answer your questions and share some of my experiences in hope they will be helpful to you.

(1) If I understood your mail correctly you are running a separate instance of CVODE on _each_ openMP thread. You could in theory accelerate each instance of SUNDIALS if you run each on a GPU stream, but you would need to port your reaction equations to CUDA or HIP. You would also need to supply Jacobian, because finite difference Jacobian approximation that SUNDIALS provides runs only on CPU hardware.

(2) If you can simulate your reaction network with a dense solver while your Jacobian is computed using finite differencing, then your network is probably too small to take advantage of GPUs. You would most likely be better off with computing your Jacobian and using a sparse solver on CPU such as KLU. In most cases that I saw (and similar to your reaction network problem), dense solver on GPU can rarely beat sparse solver on CPU. For small problems, GPU kernel launching overhead is too taxing, and for large problems, O(N^3) complexity of dense solver kills your performance despite all the power of GPU.

(3) SUNDIALS does have interface for dense linear solvers on GPU, see MAGMA documentation in SUNDIALS, for example. The missing piece that you need is finite difference evaluation of the Jacobian. That computation is performed on CPU only. It is computationally inefficient way of evaluating Jacobian to begin with, so there is little motivation to accelerate it. If you profile your code, you will likely find that most of your computational cost is in evaluating Jacobian and solving dense linear system, so porting the rest of your code to GPU would give you little benefit.

(4) Yes, absolutely!

(5) There is a dense GPU solver available for use with SUNDIALS (see (3)), but you do need to provide Jacobian.

Hope this helps,
Slaven

write to: mailto:SUNDIALS-USERS-...@LISTSERV.LLNL.GOV <mailto:SUNDIALS-USERS-...@LISTSERV.LLNL.GOV>

Reply all

Reply to author

Forward