Scoreboard deadlock detected

Yue Chen

<c370797515@gmail.com>

unread,

Apr 2, 2022, 11:31:28 AM4/2/22

to accel-sim

Hello, I am running a custom workload on gpgpu-sim, and got this error log.

---------------------------

Release register - warp:3, reg: 1428
Reserved Register - warp:3, reg: 617
Reserved register - warp:3, reg: 617
warp_inst: Register Released - warp:3, reg: 1428
Release register - warp:3, reg: 1428
Reserved Register - warp:3, reg: 617
Reserved register - warp:3, reg: 617

GPGPU-Sim uArch: ERROR ** deadlock detected: last writeback core 2 @ gpu_sim_cycle 47478 (+ gpu_tot_sim_cycle 4294867296) (52522 cycles ago)
GPGPU-Sim uArch: DEADLOCK shader cores no longer committing instructions [core(# threads)]:
GPGPU-Sim uArch: DEADLOCK 0(128) 1(128) 2(128) 3(96)

------------------------------

Then i used gdb and macro of shader_trace.h look into the cause. I found the gpgpu-sim stay around function checkCollision() and don't issue any instruction.

The instructions as follow

ld.const.u8 %r593, [%rd380];
ld.const.u8 %r594, [%rd372];
ld.const.u8 %r595, [%rd384];
shl.b32 %r596, %r595, 24;

The dst register of ld.const.u8 not release any more(three ld.const all). so shl.b32 stay RAW hazard. It looks like ld.const.u8 cause this result. But I can't work out the reason why(maybe another instruction lead to deadlock?)

gpgpu-sim environments is in Docker:

CUDA 10.1

GPGPU-Sim 4.2.0 dev branch (in 4.0.1 this bug produce also)

Ubuntu 18.04.4 LTS

config as SM75_RTX2060

how to reproduce it:

I make a trivial environment in GPGPU-Sim 4.0 and still reproduce it. you can reproduce it by this simple code，when change macro man_x to 4, the bug disappear.

#include <cuda_runtime.h>
#include <iostream>

__constant__ uint8_t const_memory[16][16];

#define max_n 8

__global__
void test(uint32_t threads, uint32_t* output) {
const uint32_t thread = (blockDim.x * blockIdx.x + threadIdx.x);
if(thread < threads)
{
uint8_t temp1 = const_memory[thread % max_n][thread % max_n];
output[thread & 0xf] = temp1;
}
}

int main(int argc, char* argv[]) {
uint32_t threads = 1024;
uint32_t threadsperblock = 128;
uint32_t* d_output = nullptr;
cudaMalloc(&d_output, 20 * sizeof(uint32_t));

dim3 grid((threads + threadsperblock - 1) / threadsperblock);
dim3 block(threadsperblock);
test<<<grid, block>>>(threads, d_output);
return 0;
}

Thank you!

Junrui Pan

<panjunrui100@gmail.com>

unread,

Apr 2, 2022, 11:18:08 PM4/2/22

to accel-sim

Hi,

Thank you for your interest. Have you tried other configs?

And I'm assuming this is PTX mode?

There could be quite a lot of reasons for this kind of issue. Could you please try turning on the trace mode? Uncomment these lines in the config file.

https://github.com/JRPan/gpgpu-sim_distribution/blob/mydev/configs/tested-cfgs/SM75_RTX2060/gpgpusim.config#L176-L177

Please let me know how it loooks.

Thanks,

Junrui

Yue Chen

<c370797515@gmail.com>

unread,

Apr 3, 2022, 12:26:05 AM4/3/22

to accel-sim

Thanks for your reply

Sure I running in PTX mode.

I try your config in the simple code I refer before.

I upload truncated log(because scoreboard fails keep loop) and ptx generated by gpgpusim in addition

Thanks for your reply

Sure I running in PTX mode.

I try your config in the simple code I refer before.
I upload truncated log(because scoreboard fails keep loop) and ptx generated by gpgpusim in addition

gpgpusim-log

ptxfile

Junrui Pan

<panjunrui100@gmail.com>

unread,

Apr 4, 2022, 12:03:51 AM4/4/22

to accel-sim

Looks weird that your ptx file is SM30 while you are running rtx 2060 config. This line:

Extracting PTX file and ptxas options 1: vec_add.1.sm_30.ptx -arch=sm_30

Looks like you compiled the binary with only sm30 compute capability. Could you please check that?

Ideally, the gpgpu-sim will extract all version of compute capability ptx from the fatbin (fat binary). Since it only extracted sm_30, I would assume this is because it compiled only with sm_30. Please confirm that and try adding the correct version of RTX 2060.

Thanks,

Junrui

Yue Chen

<c370797515@gmail.com>

unread,

Apr 5, 2022, 12:02:09 PM4/5/22

to accel-sim

It extract sm_30 default may related to CUDA version. I just compile with common flag --cudart shared.

These days I change my station from CUDA 10.1 to CUDA 11.0. Now it extract sm_52 default, but problem still produce.

Could you tell me this program can run normally in your station? I think the problem is produced by gpgpu-sim

In addition, I upload new files from new station

ptxfile

gpgpusim.config

gpgpusim-log

Junrui Pan

<panjunrui100@gmail.com>

unread,

Apr 5, 2022, 9:18:42 PM4/5/22

to accel-sim

Let me see if I have time.

But to be honest, sm_52 could still cause problem.

Matching CUDA arch and CUDA gencode for various NVIDIA architectures - Arnon Shimoni

Turing should be SM_75

Yue Chen

<c370797515@gmail.com>

unread,

Apr 6, 2022, 1:34:03 AM4/6/22

to accel-sim

I have fixed it, the problem produced in constant_cycle(). Can I make a pull request to dev branch?

Junrui Pan

<panjunrui100@gmail.com>

unread,

Apr 6, 2022, 9:17:51 PM4/6/22

to accel-sim

Glad to hear that!

Please do so and than you for your contribution. We will review it.

Thanks

Ardhi Wiratama Baskara Yudha

<ardhy.yudha@gmail.com>

unread,

Dec 10, 2023, 12:24:48 AM12/10/23

to accel-sim

Hello,

How did you solve this issue? I didn't see any PR related to this. Thanks!

Reply all

Reply to author

Forward