Hi, I have a few questions about multiple kernels.
1: CUDA program's main method is to execute each line of code serially. So, for example:
sumMatrixOnGPU2D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);
printf(“XXXXXX”);
printf(“XXXXXX”);
sumMatrixOnGPU3D<<<grid, block>>>(d_MatA, d_MatB, d_MatC, nx, ny);
The second kernel comes after the first kernel and two printfs have been executed. In this way, only one kernel is launched to GPGPU-SIM for execution at a time, then how can the two kernels be launched(assuming that the two kernels are not mutually dependent).
2: CUDA has the concept of asynchronous, so it's multiple kernels launch if CUDA programs use asynchronous API. For example:
kernel<<<1, 64, 0, streams[i]>>>(data[i], N)
3: A CUDA program is executed in window 1 and a CUDA program is executed in window 2. So, will the kernel launched by these two Windows be on the same GPGPU-Sim? Or will it be launched on the same GPGPU-Sim thread?:
4: I learned papers about multi-kernel, Parboil and Rodinia benchmarks were used more often, but most of the programs in Parboil benchmark were used for(){kernel}, which is also a serial method. How to understand the analysis of the multi-kernel launch of these two benchmarks ? Maybe my understanding is wrong, but I can't find a wrong reason. I hope you can help me. If you can tell me which CUDA program in the benchmark uses the multi-kernel launch method, it would be better.
I wish you success in your work!