CUDA error cudaStreamSynchronize(stream) and CUDA error in ComputeBondedCUDA

Francesco Pietra

unread,

Nov 20, 2022, 1:10:06 PM11/20/22

to

Hello

Main board GA-X79-UD3 with two 680 GPUs
Debian10 Linux,
kernel 5.10.0-19-amd64

OpenGL 4.6.0

nvidia driver 470.141.03

Months ago, following updating/upgrading of amd64, the GPUs, while rendering correctly, became unable to run classical molecular dynamics simulations. Launching a minimization with software NAMD with both GPUs or with one of them (by software or even by removing one GPU)

namd2 +idlepoll +p12 +devices 0,1 min.conf

namd2 +idlepoll +p12 +devices 0 min.conf

namd2 +idlepoll +p12 +devices 1 min.conf

NAMD organizes the simulation correctly but at the stage of starting the computation, accessing memory, a crash occurs with error

TCL: Minimizing for 3000 steps
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
FATAL ERROR: CUDA error cudaStreamSynchronize(stream) in file src/CudaTileListKernel.cu, function buildTileLists, line 1136
on Pe 4 (gig64 device 0 pci 0:2:0): an illegal memory access was encountered
FATAL ERROR: CUDA error in ComputeBondedCUDA::forceDoneCheck after polling 48 times over 0.005047 s on Pe 8 (gig64 device 1 pci 0:3:0): an illegal memory access was encountered
[Partition 0][Node 0] End of program

"illegal memory access" is a software error (as also proven by using alternatively one of the two GPUs) that escapes all my attempts at unraveling its origin. I had no clues from NAMD forum. Hope here.

Thanks for your kind attention

francesco pietra

Peter von Kaehne

unread,

Nov 20, 2022, 2:20:06 PM11/20/22

to

I do not know if this would work with this kind of computation but I would suggest you try and run the programme under gdb.

This should tell you where things go wrong. You might have recompile the programme and enable debugging symbols

Peter

Sent from my phone. Please forgive misspellings and weird “corrections”

On 20 Nov 2022, at 18:08, Francesco Pietra <chien...@gmail.com> wrote:

Peter von Kaehne

unread,

Nov 21, 2022, 1:51:42 AM11/21/22

to

I do not know if this would work with this kind of computation but I would suggest you try and run the programme under gdb.

Nvidia suggests CUDA-gdb for this purpose

CUDA-GDB :: CUDA Toolkit Documentation

docs.nvidia.com

So, I think this is the way to go. You should be able to figure out even without great knowledge of this kind of thing where the error lies - in your main programme or in a library.

The whole thing is, I guess, so niche that you need to do some work to narrow things down more for others likely to show some interest.

As a guess I would say that a library was updated and your programme does not check on that but struggles (and fails) now.

Peter

Francesco Pietra

unread,

Nov 22, 2022, 5:50:05 AM11/22/22

to

I am now wondering whether the nvidia driver, as built automatically by Debian during tecent updating/upgrading, allows correct rendering but fails with NAMD computations

To this concern, it is not clear to me whether Debian, with its automatic building, uses the proprietary nvidia driver. If not, I could try by downloading the proprietary nvidia driver