also about multiple kernels

32 views
Skip to first unread message

jizhen wei

<wish.all.healthy@gmail.com>
unread,
Sep 5, 2022, 4:22:30 AM9/5/22
to accel-sim

Hi, I have a few questions about multiple kernels.

1: CUDA program's main method is to execute each line of code serially. So, for example:

sumMatrixOnGPU2D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);

printf(“XXXXXX”);

printf(“XXXXXX”);

sumMatrixOnGPU3D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);

The second kernel comes after the first kernel and two printfs have been executed. In this way, only one kernel is launched to GPGPU-SIM for execution at a time, then how can the two kernels be launched(assuming that the two kernels are not mutually dependent).

2: CUDA has the concept of asynchronous, so it's multiple kernels launch if CUDA programs use asynchronous API. For example:

kernel<<<1, 64, 0, streams[i]>>>(data[i], N)

3: A CUDA program is executed in window 1 and a CUDA program is executed in window 2. So, will the kernel launched by these two Windows be on the same GPGPU-Sim? Or will it be launched on the same GPGPU-Sim thread?:

4: I learned papers about multi-kernel, Parboil and Rodinia benchmarks were used more often, but most of the programs in Parboil benchmark were used for(){kernel}, which is also a serial method. How to understand the analysis of the multi-kernel launch of these two benchmarks ? Maybe my understanding is wrong, but I can't find a wrong reason. I hope you can help me. If you can tell me which CUDA program in the benchmark uses the multi-kernel launch method, it would be better.

 

I wish you success in your work!

Junrui Pan

<panjunrui100@gmail.com>
unread,
Sep 6, 2022, 10:18:14 AM9/6/22
to accel-sim
  1. After gpgpu-sim received the first kernel, it will start the simulation. And during the simulation, any cuda api calls will be blocked until the kernel has finished. Take a look at this function https://github.com/accel-sim/gpgpu-sim_distribution/blob/e0e890a9c129c55c222dbea2fd0f15891949b1cf/src/gpgpusim_entrypoint.cc#L75. You will have to find a way to register the functions before the first kernel is launched. So in general, register func1->register func 2->setup func1 arguments->setup func2 arguments->launch func1->launch fun2.
  2. I don't think that specific api is supported. 
  3. They will be different programs.
  4. This is a very open-ended question. You can try nvprof if you have an old gpu. I think nsight-* serialize everything. 
Thanks
Reply all
Reply to author
Forward
0 new messages