Hello, I am running a custom workload on gpgpu-sim, and got this error log.
---------------------------
Release register - warp:3, reg: 1428
Reserved Register - warp:3, reg: 617
Reserved register - warp:3, reg: 617
warp_inst: Register Released - warp:3, reg: 1428
Release register - warp:3, reg: 1428
Reserved Register - warp:3, reg: 617
Reserved register - warp:3, reg: 617
GPGPU-Sim uArch: ERROR ** deadlock detected: last writeback core 2 @ gpu_sim_cycle 47478 (+ gpu_tot_sim_cycle 4294867296) (52522 cycles ago)
GPGPU-Sim uArch: DEADLOCK  shader cores no longer committing instructions [core(# threads)]:
GPGPU-Sim uArch: DEADLOCK  0(128) 1(128) 2(128) 3(96) 
------------------------------
Then i used gdb and macro of shader_trace.h look into the cause.  I found the gpgpu-sim stay around function checkCollision() and don't issue any instruction.
The instructions as follow
ld.const.u8 %r593, [%rd380];
ld.const.u8 %r594, [%rd372];
ld.const.u8 %r595, [%rd384];
shl.b32 %r596, %r595, 24;
The dst register of ld.const.u8 not release any more(three ld.const all). so shl.b32 stay RAW hazard. It looks like ld.const.u8 cause this result. But I can't work out the reason why(maybe another instruction lead to deadlock?)
gpgpu-sim environments is in Docker:
CUDA 10.1
GPGPU-Sim 4.2.0 dev branch (in 4.0.1 this bug produce also)
Ubuntu 18.04.4 LTS
config as SM75_RTX2060
how to reproduce it:
I make a trivial environment in GPGPU-Sim 4.0 and still reproduce it. you can reproduce it by  this simple code,when change macro man_x to 4, the bug disappear.
#include <cuda_runtime.h>
#include <iostream>
__constant__ uint8_t const_memory[16][16];
#define max_n 8
__global__
void test(uint32_t threads, uint32_t* output) {
    const uint32_t thread = (blockDim.x * blockIdx.x + threadIdx.x);
    if(thread < threads)
    {
        uint8_t temp1 = const_memory[thread % max_n][thread % max_n];
        output[thread & 0xf] = temp1;
    }
}
int main(int argc, char* argv[]) {
    uint32_t threads = 1024;
    uint32_t threadsperblock = 128;
    uint32_t* d_output = nullptr;
    cudaMalloc(&d_output, 20 * sizeof(uint32_t));
    dim3 grid((threads + threadsperblock - 1) / threadsperblock);
    dim3 block(threadsperblock);
    test<<<grid, block>>>(threads, d_output);
    return 0;
}
Thank you!