Different results from CUDA kernel when using local variable inside loop

0 views

Skip to first unread message

anthony...@gmail.com

unread,

Apr 29, 2016, 11:18:08 AM4/29/16

to Numba Public Discussion - Public

Hi everyone,

I have been attempting to implement a CUDA kernel and come across some strange behaviour. The behaviour relates to the use of local variables inside loops within the kernel function. The heart of the matter is that kernel code (see numbered code below) gives different results to those calculated using Numpy (see test case code at the bottom of the post). Specifically:

a) Using the kernel code as listed below (and exercising it using the test case) results in the follow:

Result from CUDA kernel : -1394.3506603578408

Result from Numpy : 300.53724387519816

b) Uncommenting line 26 of the kernel code causes the Numpy and CUDA kernel results to match.

Using the CUDA simulator generates results that match in both a) and b).

I am using Python 2.7.11, Anaconda 2.5.0 and Numba 0.25.0.

I have created this example in order to demonstrate the problem I am seeing in a different kernel implementation so my apologies for the contrivance.

Any help would be most appreciated!

==============

Kernel code:

==============

1 @cuda.jit(device=True)

2 def dev_sum(a, b):

3 return a + b

5 @cuda.jit

6 def cu_sum(a, num_to_sums, b):

7 sa = cuda.shared.array(shape=(512,), dtype=float64)

8 tx = cuda.threadIdx.x

9 bx = cuda.blockIdx.x

10 bw = cuda.blockDim.x

11 i = tx + bx * bw

13 sa[tx] = 0

14 if i < a.shape[0]:

15 num_choices = 200

16 num_to_sum = num_to_sums[i]

17 l_a = cuda.local.array(shape=num_choices, dtype=float64)

19 l_sum = float64(0.)

20 for j in xrange(num_to_sum):

21 l_a[j] = math.exp(a[i, j])

22 l_sum += l_a[j]

24 l_b = float64(0.)

25 # uncommenting line 26 means results match those expected (i.e. from Numpy)

26 #l_one = l_sum / l_sum

27 for j in xrange(num_to_sum):

28 # fails unless line 26 is uncommented

29 l_b += math.log(l_a[j]) * l_sum / l_sum

31 sa[tx] = l_b

32 cuda.syncthreads()

33 if tx == 0:

34 # Uses the first thread of each block to perform the actual reduction

35 s = sa[tx]

36 for j in range(1, bw):

37 s = dev_sum(s, sa[j])

38 b[bx] = s

==============

Test case to exercise the kernel above and generate results for comparison:

==============

import numpy as np

def test_cu_sum(self):
    n = 1000 # total length of vector to be summed
    np.random.seed(seed=1112)
    a = np.random.uniform(low=-100, high=100, size=(n, 200)).astype(np.float64)
    num_to_sums = np.random.randint(1, 200, n)

    # add up a different number of items from each dimension of n
    n_dim = 0
    numpy_sum_a = np.float64(0.)
    for num in num_to_sums:
        for i in xrange(num):
            numpy_sum_a += np.log(np.exp(a[n_dim, i]))
        n_dim += 1

    d_a = cuda.to_device(a)
    grid_size = n
    threads = 512
    threads_per_block = (threads, 1)
    blocks_per_grid = int(math.ceil(grid_size / float(threads)))
    b = np.zeros((blocks_per_grid,), dtype=np.float64)
    d_b = cuda.to_device(b)
    d_num_to_sums = cuda.to_device(num_to_sums)

    cu_sum[blocks_per_grid, threads_per_block](d_a, d_num_to_sums, d_b)
    d_b.to_host()
    sum_a = 0
    for result_index in xrange(blocks_per_grid):
        sum_a += b[result_index]

    self.assertAlmostEqual(sum_a, numpy_sum_a)

Anthony Senyard

unread,

May 10, 2016, 10:19:17 PM5/10/16

to Numba Public Discussion - Public

For reference (if anyone is interested) I've created an issue on github (see https://github.com/numba/numba/issues/1891).

Reply all

Reply to author

Forward

0 new messages