[sundials-users] CVODE with OpenMP and cuda streams

utpal kiran

unread,

Sep 16, 2024, 11:43:25 AM9/16/24

to SUNDIAL...@listserv.llnl.gov

Hi,

I am trying to solve a chemical reactions problem having many independent ODE systems. I have implemented a batched dense linear solver from MAGMA library to find the solution on GPU. However, I want to use CudaStream to run multiples batches concurrently on the GPU, anticipating even better performance. I could launch cudastreams using single thread by creating several instances of CVODE and attaching one CVODE instance with one stream. But, this results in serialized execution of streams. As per documentation and few posts in the user group, to achieve concurrent execution of cuda streams, each stream must be called by separate CPU threads. I have a few questions regarding this approach.

1. Is there any example that demonstrates this approach? If not, Can you share a relevant example?

2. Do I need to compile my application with any special flag ? .

3. Do I need to build CVODE with OpenMP enabled along with CUDA ?

4. Is there specific vector and matrix data structures for this approach?

Thanks

Utpal Kiran

To unsubscribe from the SUNDIALS-USERS list: write to: mailto:SUNDIALS-USERS-...@LISTSERV.LLNL.GOV

Balos, Cody

unread,

Sep 16, 2024, 11:52:39 AM9/16/24

to SUNDIAL...@listserv.llnl.gov

Hi Utpal,

Your overall strategy is on the right track. You will want to have one CVODE instance per CPU thread and assign a CUDA stream to each instance/thread.

There are no simple examples that demonstrate the strategy. We have implemented this approach in the Nyx code https://github.com/amrex-astro/nyx (and talked about it these papers: https://arxiv.org/pdf/2405.01713 and https://www.sciencedirect.com/science/article/pii/S0167819121000831).
The only special flag you need is the one for OpenMP.
You do not need to build CVODE with OpenMP enabled. The OpenMP part will be outside of SUNDIALS.
There are not special data structures for this approach. The MAGMA SUNMatrix interface is setup for both batched and non-batched systems.

Cody

utpal kiran

unread,

Sep 26, 2024, 12:40:53 PM9/26/24

to SUNDIAL...@listserv.llnl.gov

Hi Cody,

Thank you for your response.

I have used eight cuda streams and attached each one with a CPU thread. I have also created eight CVODE instances to attach with each stream, such that, for each CVODE instance specific vectors and matrices have been created. I could run my application and obtain correct results. However, compared to single CPU thread launching multiple cuda stream version, the approach with OMP threads is not providing me any advantage in run time. I have profiled both the versions and attached here screenshots from nsight systems. Can you take a look and give your comment about the correctness of the implementation? I have made a few observations and would like to discuss them with you.

1. The screenshot clearly shows cuda streams executing in interleaved manner (with some level of overlapping) for version with omp threads. But it fails to provide me any advantage in wall-clock timings. Are overheads associated with CUDA API calls spoiling the advantage? Is CVODE designed to run cuda streams concurrently? The current problem size leads to 3200 independent ODE (each with 53 equations) per batch (assigned to each stream). I am using NVIDIA V100 GPU for my work.

2. In both the versions, two streams (including default streams) execute longer than other streams. Infact, only two streams determine the length of the timeline. One reason could be the presence of stiff cells in the batch assigned to those streams. What could be other possible reasons behind this?

Thanks

Utpal

streams_with_omp_threads.PNG

streams_with_single_thread.PNG

Reply all

Reply to author

Forward