Incorrect results produced by warp shuffles in gpgpu-sim 4.0.0

Hang Yan

<iyanhang@gmail.com>

unread,

Aug 15, 2021, 7:39:04 AM8/15/21

to accel-sim

Hello everyone,

I am writing this issue to ask if you could give me some suggestions about how to solve the inaccurate result produced by warp primitives running in gpgpu-sim.

Code snippet: The minimal code comes from the official tutorial of CUDA: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-examples-broadcast.

Build environment:

I used the image jonghyun1215/gpgpu:gpgpusim4 from docker hub, with GCC 7.5, gpgpu-sim 4.0.0(commit ID:90ec33997, exactly the latest commit in branch dev), CUDA 10.1.
Also used the gpgpusim 4.0.0 inside the docker image of accel-sim, with GCC 7.5, CUDA 11.0.

Situation: The sample code should not print the failed message, and it was tested in a real GPU environment which got the expected results. But when running it through gpgpu-sim, no matter using performance simulation or functional simulation, the results are wrong.

Investigation: I turned to other warp samples like shfl_down_sync, shfl_xor_sync in that tutorial, the correctness error still exists. For comparison, I also wrote a simple reduction using two methods separately, 1) shared memory, 2) warp shuffle, the result of shared memory is exactly correct, but warp shuffle is not, which confused me a lot. Thus, I guess there are some bugs in the implementations of warp primitives in gpgpu-sim.

Possible Parts: To locate the relevant part in the gpgpu-sim codebase, I searched for the shfl PTX operator, and found the implementations here: link. Not very experienced in the codes of gpgpu-sim, I've been blocked in these step for few days.

I would appreciate it sincerely if you could help me with this trouble. Thanks for your consideration!

Mahmoud Khairy

<khairy2011@gmail.com>

unread,

Aug 15, 2021, 9:41:03 AM8/15/21

to accel-sim

Hello,

GPGPU-sim PTX function has not been updated for a long time. So, there might be some updates missing or unfixed bugs here.

Keeping the PTX simulation updated and functionally correct requires significant engineering effort. Most of the people who wrote the GPGPU-sim PTX code have already graduated. This is why accel-sim comes in with the trace-driven mode.

If you can use trace-driven mode in your research and bypass the complex PTX simulation code, this will be better for you.

Hang Yan

<iyanhang@gmail.com>

unread,

Aug 15, 2021, 11:17:24 AM8/15/21

to accel-sim

Thank you for your quick response and kind help!

It is somewhat a pity to have flaws in functional simulation, but the total accel-sim is really an awesome project for performance tracing tasks. Thanks for the tremendous efforts of all you guys!

Reply all

Reply to author

Forward