Numba Cuda, multiple GPUs and host thread(s)

Peter Klenner

unread,

Apr 22, 2018, 3:08:53 PM4/22/18

to Numba Public Discussion - Public

For a system with multiple GPUs the documentation states on select_device that "Numba currently allows only one context per thread." (see here https://numba.pydata.org/numba-doc/latest/cuda/device-management.html?highlight=select_device). This seems to suggest that the main host thread cannot handle multiple Gpus on its own and points in a direction of using the threading module or the ThreadPoolExecutor instead.
However, first experiments of just using the main thread for executing kernels on multiple GPUs run without any apparent issues, but it might be that I am missing subtle errors. Hence, my question: What is the recommended/pythonic practice to use numba with multiple GPUs?

Chris

unread,

Apr 27, 2018, 4:39:06 PM4/27/18

to Numba Public Discussion - Public

Numba has some example code on how to use multiple GPUs: https://github.com/numba/numba/blob/master/numba/cuda/tests/cudapy/test_multigpu.py. A CUDA context is initiated per thread, not per GPU. The host thread will have one context to communicate with both GPU's. If you generate a new thread that thread will have its own context that it can use to communicate with GPUs.

Chris

unread,

Apr 27, 2018, 5:02:35 PM4/27/18

to Numba Public Discussion - Public

Here is some further information on contexts that I've found helpful https://devtalk.nvidia.com/default/topic/519087/cuda-context-and-threading/ although it is old so some of the information might have changed since the posting.

Peter Klenner

unread,

May 30, 2018, 3:18:58 AM5/30/18

to Numba Public Discussion - Public

In the last link I find the recommendation for "one context per device per process". Hence, I tried the following with two GPUs and one CPU.

1. Call select_device for both GPUs from the host thread, and check the returned context.

2. Call select_device for both HPUs via two processes from the ProcessPoolExectur, and check the returned context.

As a result I see this:

select() from host thread:
<CUDA context c_void_p(48215136) of device 0>
<CUDA context c_void_p(48215136) of device 0>

select() from child processes:
<CUDA context c_void_p(48314656) of device 0>
<CUDA context c_void_p(61306784) of device 1>

The host thread seems to use a single context (48215136). But it also appears that select_device() silenty fails and always selects the same GPU?

"The host thread will have one context to communicate with both GPU's." Considering this simple experiment, what is required to let the CPU communicate with the second GPU?

To reproduce, I used this code:

from concurrent.futures import ProcessPoolExecutor

import numba
from numba import cuda

def select(i):
    dev = cuda.select_device(i)
    print(cuda.current_context())

def main():
    print('selet() from host thread:')
    for i in range(2):
        select(i)
    print('\nselect() from child processes:')
    ex = ProcessPoolExecutor(max_workers=2)
    tasks = ex.map(select, range(2))

if __name__ == '__main__':
    main()

Peter Klenner

unread,

May 30, 2018, 9:04:44 AM5/30/18

to Numba Public Discussion - Public

I will answer my own question. Using the context switch statement 'with cuda.gpus[]' does the trick.

from concurrent.futures import ProcessPoolExecutor
import numba
from numba import cuda

def select(i):
dev = cuda.select_device(i)
print(cuda.current_context())

def main():

    print('select() from host thread:')
    for i in range(2):
        with cuda.gpus[i]:

            select(i)

    print('\nselect() from child processes:')
    ex = ProcessPoolExecutor(max_workers=2)
    tasks = ex.map(select, range(2))

if __name__ == '__main__':
    main()

Output:

select() from host thread:
<CUDA context c_void_p(48299744) of device 0>
<CUDA context c_void_p(48307568) of device 1>

select() from child processes:
<CUDA context c_void_p(60832224) of device 1>
<CUDA context c_void_p(60817440) of device 0>

Thanks!

Reply all

Reply to author

Forward