Hello,
I am trying to implement a summation reduction kind of code in cuda and I am following this example
https://github.com/ContinuumIO/numbapro-examples/blob/master/laplace2d/laplace2d-numba-gpu-improve.pyHere is the kernel code (in a module called gpu):
threadsperblock = (32, 32)
@cuda.jit
def _compute_sum_of_q_on_gpu(t_sne, partial_sum_q):
i, j = cuda.grid(2)
n = t_sne.shape[0]
m = t_sne.shape[0]
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bx = cuda.blockIdx.x
by = cuda.blockIdx.y
# make and fill up with the q value shared memory among threads of one block
block_shared_mem = cuda.shared.array(threadsperblock, dtype=float64)
block_shared_mem[tx, ty] = 0
if j >= 0 and j <= n - 1 and i >= 0 and i <= m - 1:
# get the distance between 2 data points
# t = math.sqrt(math.pow((t_sne[i, 0] - t_sne[j, 0]), 2) + math.pow((t_sne[i, 1] - t_sne[j, 1]), 2)) # This breaks
t1 = (t_sne[i, 0] - t_sne[j, 0])
t2 = (t_sne[i, 1] - t_sne[j, 1])
distance = math.sqrt((t1 * t1) + (t2 * t2))
block_shared_mem[tx, ty] = 1 / (1 + distance)
cuda.syncthreads()
# sum up the values of the shared memory array to generate a partial summation matrix (that needs to be summed up
# further on the cpu)
t = threadsperblock[0] // 2
while t > 0:
if tx < t:
block_shared_mem[tx, ty] = block_shared_mem[tx, ty] + block_shared_mem[tx + t, ty]
t //= 2
cuda.syncthreads()
t = threadsperblock[0] // 2
while t > 0:
if ty < t and tx == 0:
block_shared_mem[tx, ty] = block_shared_mem[tx, ty] + block_shared_mem[tx, ty + t]
t //= 2
cuda.syncthreads()
if tx == 0 and ty == 0:
partial_sum_q[bx, by] = block_shared_mem[0, 0]
and this is how I invoke it
import numpy as np
import math
from TSne_Numba import gpu
from numba import cuda
from timeit import default_timer as timer
def put_array_to_device(array, array_name, verbose=True):
s = timer()
temp = np.array(array, np.float64)
d_array = cuda.to_device(temp)
e = timer()
if verbose:
print(' Load ' + array_name + ' to device time: ' + str(e - s))
return d_array
n = 200000 # last working value 65611
tsne = np.array(np.random.random((n, 2)), dtype=np.float32)
verbose = True
threadsperblock = (32, 32)
blockspergrid_x = math.ceil(tsne.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(tsne.shape[0] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y)
num_of_dims = tsne.shape[1]
partial_sum_q = np.zeros(blockspergrid)
c = cuda.current_context(0)
mem1 = c.get_memory_info()[0]
d_tsne = put_array_to_device(tsne, 't_sne', verbose)
d_partial_sum_q = put_array_to_device(partial_sum_q, 'partial_sum_q', verbose)
mem2 = c.get_memory_info()[0]
print(mem1 - mem2)
gpu._compute_sum_of_q_on_gpu[blockspergrid, threadsperblock](d_tsne, d_partial_sum_q)
mem3 = c.get_memory_info()[0]
print(mem2 - mem3)
partial_sum_q = d_partial_sum_q.copy_to_host()
sum_q = np.sum(partial_sum_q)
print(sum_q)
Unfortunately I am failing in making it work. The error I get is thrown when I try to copy the d_partial_sum_q to the host:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
I cannot see what I am doing wrong. Although my arrays are pretty big I have plenty of space in my Titan X for them.and I do not think that I am writing to any wrong parts of my memory (I have been wrong about this part before though)
By removing lines of code from the kernel to see where it stops working I can tell that if I totally remove the code in the while loops then the rest of the code will work as expected.
Also the code will break (with the same error) if I use any math. functions.(see the commented out like in my kernel)
To make things more weird I can get the whole thing to work if I lower the value of n (to something like 60000). But as I have said I do not have a memory size problem.
Any help would be highly appreciated.