Problems creating a reduction on cuda

0 views
Skip to first unread message

gdimitr...@gmail.com

unread,
Mar 23, 2017, 2:02:24 PM3/23/17
to Numba Public Discussion - Public
Hello,

I am trying to implement a summation reduction kind of code in cuda and I am following this example
https://github.com/ContinuumIO/numbapro-examples/blob/master/laplace2d/laplace2d-numba-gpu-improve.py

Here is the kernel code (in a module called gpu):

threadsperblock = (32, 32)

@cuda.jit
def _compute_sum_of_q_on_gpu(t_sne, partial_sum_q):

    i
, j = cuda.grid(2)

    n
= t_sne.shape[0]
    m
= t_sne.shape[0]

    tx
= cuda.threadIdx.x
    ty
= cuda.threadIdx.y

    bx
= cuda.blockIdx.x
   
by = cuda.blockIdx.y

   
# make and fill up with the q value shared memory among threads of one block
    block_shared_mem
= cuda.shared.array(threadsperblock, dtype=float64)

    block_shared_mem
[tx, ty] = 0
   
if j >= 0 and j <= n - 1 and i >= 0 and i <= m - 1:
       
# get the distance between 2 data points
       
# t = math.sqrt(math.pow((t_sne[i, 0] - t_sne[j, 0]), 2) + math.pow((t_sne[i, 1] - t_sne[j, 1]), 2)) # This breaks
        t1
= (t_sne[i, 0] - t_sne[j, 0])
        t2
= (t_sne[i, 1] - t_sne[j, 1])
        distance
= math.sqrt((t1 * t1) + (t2 * t2))
        block_shared_mem
[tx, ty] = 1 / (1 + distance)

    cuda
.syncthreads()

   
# sum up the values of the shared memory array to generate a partial summation matrix (that needs to be summed up
   
# further on the cpu)

    t
= threadsperblock[0] // 2
   
while t > 0:
       
if tx < t:
            block_shared_mem
[tx, ty] = block_shared_mem[tx, ty] + block_shared_mem[tx + t, ty]
        t
//= 2
        cuda
.syncthreads()

    t
= threadsperblock[0] // 2
   
while t > 0:
       
if ty < t and tx == 0:
            block_shared_mem
[tx, ty] = block_shared_mem[tx, ty] + block_shared_mem[tx, ty + t]
        t
//= 2
        cuda
.syncthreads()

   
if tx == 0 and ty == 0:
        partial_sum_q
[bx, by] = block_shared_mem[0, 0]


and this is how I invoke it


import numpy as np
import math
from TSne_Numba import gpu
from numba import cuda
from timeit import default_timer as timer


def put_array_to_device(array, array_name, verbose=True):
    s
= timer()
    temp
= np.array(array, np.float64)
    d_array
= cuda.to_device(temp)
    e
= timer()
   
if verbose:
       
print('     Load ' + array_name + ' to device time: ' + str(e - s))
   
return d_array


n
= 200000  # last working value 65611
tsne
= np.array(np.random.random((n, 2)), dtype=np.float32)

verbose
= True
threadsperblock
= (32, 32)
blockspergrid_x
= math.ceil(tsne.shape[0] / threadsperblock[0])
blockspergrid_y
= math.ceil(tsne.shape[0] / threadsperblock[1])
blockspergrid
= (blockspergrid_x, blockspergrid_y)


num_of_dims
= tsne.shape[1]
partial_sum_q
= np.zeros(blockspergrid)


c
= cuda.current_context(0)
mem1
= c.get_memory_info()[0]

d_tsne
= put_array_to_device(tsne, 't_sne', verbose)
d_partial_sum_q
= put_array_to_device(partial_sum_q, 'partial_sum_q', verbose)

mem2
= c.get_memory_info()[0]
print(mem1 - mem2)
gpu
._compute_sum_of_q_on_gpu[blockspergrid, threadsperblock](d_tsne, d_partial_sum_q)

mem3
= c.get_memory_info()[0]
print(mem2 - mem3)
partial_sum_q
= d_partial_sum_q.copy_to_host()

sum_q
= np.sum(partial_sum_q)
print(sum_q)



Unfortunately I am failing in making it work. The error I get is thrown when I try to copy the d_partial_sum_q to the host:
numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR


I cannot see what I am doing wrong. Although my arrays are pretty big I have plenty of space in my Titan X for them.and I do not think that I am writing to any wrong parts of my memory (I have been wrong about this part before though)

By removing lines of code from the kernel to see where it stops working I can tell that if I totally  remove the code in the  while loops then the rest of the code will work as expected.

Also the code will break (with the same error) if I use any math. functions.(see the commented out like in my kernel)

To make things more weird I can get the whole thing to work if I lower the value of n (to something like 60000). But as I have said I do not have a memory size problem.

Any help would be highly appreciated.

gdimitr...@gmail.com

unread,
Mar 24, 2017, 5:28:05 AM3/24/17
to Numba Public Discussion - Public
To answer my own question,

there is nothing wrong with that code.

The problem was that some windows update had lowered my windows 'stop the GPU drivers when they are stuck' timer (TdrDelay) from the large value I had it set back to to 8 seconds so windows was crashing the kernel before it finished. Unfortunately I didn't even get the usual windows popup notification saying that it was very helpfully closing down the NVidia drivers for me. When I put the TdRDelay value back up to 60 seconds the code runs just fine.

What I also discovered is that this

math.sqrt(math.pow((t_sne[i, 0] - t_sne[j, 0]), 2) + math.pow((t_sne[i, 1] - t_sne[j, 1]), 2))
is much slower than this

t1 = (t_sne[i, 0] - t_sne[j, 0])
t2
= (t_sne[i, 1] - t_sne[j, 1])
distance
= math.sqrt((t1 * t1) + (t2 * t2))


Anyway, I thought I'd post my "solution" here to stop anyone to try and figure it out themselves




Reply all
Reply to author
Forward
0 new messages