Hi all,
In my code I used to set the GPU device to 0 as my machine always had one GPU. Now I wanted to test my code on a machine with more than one GPU and I started with the platforms where a container is running and provide up to 5 GPUs. In this case when I set my GPU device to e.g., 1 or 2 when we have 3 GPUs, MAGMA always crashes and following errors are shown:
** On entry to cusparseCreate(): CUDA context cannot be initialized
** On entry to cusparseSetStream() parameter number 1 (handle) had an illegal value: NULL pointer
** On entry to cusparseCreate(): CUDA context cannot be initialized
** On entry to cusparseSetStream() parameter number 1 (handle) had an illegal value: NULL pointer
** On entry to cusparseCreate(): CUDA context cannot be initialized
** On entry to cusparseSetStream() parameter number 1 (handle) had an illegal value: NULL pointer
** On entry to cusparseCreate(): CUDA context cannot be initialized
** On entry to cusparseSetStream() parameter number 1 (handle) had an illegal value: NULL pointer
Segmentation fault (core dumped)
The crash happens exactly when the first call to a MAGMA routine happens. I set CUDA device this way:
cudaError_t e = cudaSetDevice(m_default_gpu);
if (e != cudaSuccess) {
cerr << "cudaSetDevice(" << m_default_gpu << ") failed: "
<< cudaGetErrorString(e) << "\n";
return 1;
}
e = cudaFree(0); // Forces context creation NOW.
if (e != cudaSuccess) {
cerr << "Context init failed on GPU " << m_default_gpu << ": "
<< cudaGetErrorString(e) << "\n";
return 1;
}
Does anybody know what can be the source of the problem? Does MAGMA have any dedicated routine to set the GPU device that shall be used instead of cudaSetDevice? If my code is correct, I am also wondering if the problem is with how a container may abstract GPU devices when later on a machine with real physical GPU devices the code may work correctly? Thanks in advance.
Regards,
Danesh