Hello,
I have been using sundials/cvode for quite a few years now, solving
chemical kinetics as part of an astrophysical fluid-dynamics code. We
implemented hybrid openmp/mpi parallel execution where each OMP thread
gets a series of work-packages comprising either a 1D column of cells or
a single cell to integrate for the chemical kinetics. The userdata is
reset and the solver re-initialised for each cell at each timestep. It
works very well for up to ~20-40 OMP threads per MPI process, with a
reaction network of up to ~100 species. We are using the dense linear
solver of CVODE, with backward differencing. It is difficult to
implement an exact Jacobian with heating/cooling terms, charge-exchange
and photoionization reaction rates, so we have been using the solver
without a user-supplied Jacobian.
We want to extend this to use GPUs and a student has looked into it a
bit for us this summer. The main motivation is to take advantage of the
GPU partition on a computer we have access to (each node has 4 A100
GPUs), and we expect that the chemical kinetics is the most suitable
part of the problem to send to the GPU: lots of computation, not so much
memory throughput.
We have a few questions that hopefully someone here has already thought
about:
(1) Does this programming model of having many OMP threads working in
parallel on a compute node (each with their own instance of the CVODE
solver and ydot function in memory) lend itself well to porting to GPU?
I expect we have 64 OMP threads for 4 GPUs on each node.
(2) Is the reaction network big enough for efficient GPU usage?
(3) The dense linear solver is still on the CPU and not the GPU, so
should we expect efficient use of the GPU if only the vectors are on the
GPU? I guess if this implementation exists then it should be useful,
but wanted to double check. The answer was unclear to me from looking
at the other cuda threads on the mailing list, and the student was a bit
skeptical, pointing out that a couple of the other SUNDIALS solvers have
GPU implementations.
(4) Should we expect the solver to be more efficient if we could supply
a Jacobian? (This question relates both to CPU and GPU implementation.)
(5) Is there any ongoing work to write a GPU implementation of the dense
linear solver, and would this be expected to give further performance
benefit?
All the best,
Jonathan
############################
To unsubscribe from the SUNDIALS-USERS list:
write to: mailto:
SUNDIALS-USERS-...@LISTSERV.LLNL.GOV