Hello everyone,
I am writing this issue to ask if you could give me some suggestions about how to solve the inaccurate result produced by warp primitives running in gpgpu-sim.
Code snippet: The minimal code comes from the official tutorial of CUDA:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-examples-broadcast.
Build environment:
- I used the image jonghyun1215/gpgpu:gpgpusim4 from docker hub, with GCC 7.5, gpgpu-sim 4.0.0(commit ID:90ec33997, exactly the latest commit in branch dev), CUDA 10.1.
- Also used the gpgpusim 4.0.0 inside the docker image of accel-sim, with GCC 7.5, CUDA 11.0.
Situation: The sample code should not print the failed message, and it was tested in a real GPU environment which got the expected results. But when running it through gpgpu-sim, no matter using performance simulation or functional simulation, the results are wrong.
Investigation: I turned to other warp samples like
shfl_down_sync,
shfl_xor_sync in that tutorial, the correctness error still exists. For comparison, I also wrote a simple reduction using two methods separately, 1) shared memory, 2) warp shuffle, the result of shared memory is exactly correct, but warp shuffle is not, which confused me a lot. Thus, I guess there are some bugs in the implementations of warp primitives in gpgpu-sim.
Possible Parts: To locate the relevant part in the gpgpu-sim codebase, I searched for the shfl PTX operator, and found the implementations here:
link. Not very experienced in the codes of gpgpu-sim, I've been blocked in these step for few days.
I would appreciate it sincerely if you could help me with this trouble. Thanks for your consideration!