Hi Cody,
Thank you for your response.
I have used eight cuda streams and attached each one with a CPU thread. I have also created eight CVODE instances to attach with each stream, such that, for each CVODE instance specific vectors and matrices have been created. I could run my application and obtain correct results. However, compared to single CPU thread launching multiple cuda stream version, the approach with OMP threads is not providing me any advantage in run time. I have profiled both the versions and attached here screenshots from nsight systems. Can you take a look and give your comment about the correctness of the implementation? I have made a few observations and would like to discuss them with you.
1. The screenshot clearly shows cuda streams executing in interleaved manner (with some level of overlapping) for version with omp threads. But it fails to provide me any advantage in wall-clock timings. Are overheads associated with CUDA API calls spoiling the advantage? Is CVODE designed to run cuda streams concurrently? The current problem size leads to 3200 independent ODE (each with 53 equations) per batch (assigned to each stream). I am using NVIDIA V100 GPU for my work.
2. In both the versions, two streams (including default streams) execute longer than other streams. Infact, only two streams determine the length of the timeline. One reason could be the presence of stiff cells in the batch assigned to those streams. What could be other possible reasons behind this?
Thanks
Utpal