Device array memory allocation

Chris

unread,

Sep 25, 2017, 3:11:09 AM9/25/17

to Numba Public Discussion - Public

I am working on some basic stencil code to show as an example. I stencil code isn't 100% finished but I noticed something strange happening each time the code is run. The code is broken up into two files, main and shared. At the bottom of shared I allocate a device memory array d_deriv to fill with the derivative values calculated from the stencil. If I run this and plot the output, the plot for the derivative continually increases each time the code is run. It seems like the memory locations from the previous instance are being used over and over again, each time summing all previous iterations of the code. If, instead, I do cuda.to_device(np.zeros(n, dtype = np.float32)), it re zeros the memory locations being used and this isn't an issue. Is this intended or am I supposed to re zero the d_deriv array within the kernel each time it is run if I use a device array?

########main.py########

import numpy as np

import matplotlib.pyplot as plt

import math

N = 128*2

def main():

x = np.linspace(0,1,N, dtype = np.float32)

#from parallel import pArray, derivativeArray

from shared import pArray, derivativeArray

f = pArray(x)

plt.plot(x, f, 'o', label = 'f(x) = x^2')

dfdx = derivativeArray(f, 1)

plt.plot(x, dfdx, 'o', label ='First derivative')

#print dfdx

#d2fdx2 = derivativeArray(f, 2)

#plt.plot(x, d2fdx2, 'o', label = 'Second derivative')

plt.legend()

plt.show()

if __name__ == '__main__':

main()

########shared.py########

import numpy as np

from numba import cuda, float32

TPB = 128

NSHARED = 130 # This value must agree with TPB + 2*RAD

@cuda.jit(device = True)

def parab(x0):

return x0**2

@cuda.jit

def pKernel(d_f, d_x):

i = cuda.grid(1)

n = d_x.size

if i < n:

d_f[i] = parab(d_x[i])

def pArray(x):

n = x.size

d_x = cuda.to_device(x)

d_f = cuda.device_array(n, dtype = np.float32)

pKernel[(n + TPB - 1)/TPB, TPB](d_f, d_x)

return d_f.copy_to_host()

@cuda.jit

def derivativeKernel(d_deriv, d_f, d_stencil):

i = cuda.grid(1)

n = d_f.size

radius = len(d_stencil)/2

sh_f = cuda.shared.array(NSHARED, dtype = float32)

if i >= n:

return

tIdx = cuda.threadIdx.x

shIdx = tIdx + radius

#Load Regular Cells

sh_f[shIdx] = d_f[i]

#Load Left Halo Cells

if tIdx < radius:

if i == 0:

sh_f[tIdx] = 0

else:

sh_f[tIdx] = d_f[i - radius]

#Load Right Halo Cells

elif tIdx > cuda.blockDim.x - 1 - radius:

if i == n - 1:

sh_f[tIdx + 2*radius] = 0

else:

sh_f[tIdx + 2*radius] = d_f[i + 2*radius]

cuda.syncthreads()

#Only write values where the full stencil is "in bounds"

if radius <= i <= n - 1 - radius:

for j in range(len(d_stencil)):

d_deriv[i] += 1 #sh_f[shIdx + j - radius]*d_stencil[j]

def derivativeArray(f, order):

n = f.size

h = 1./(n - 1.)

if order == 1:

stencil = (1./(2.*h))*np.array([-1., 0., 1.], dtype = np.float32)

elif order == 2:

stencil = (1./h**2)*np.array([1., -2., 1.], dtype = np.float32)

d_f = cuda.to_device(f)

#d_deriv = cuda.to_device(np.zeros(n, dtype = np.float32))

d_deriv = cuda.device_array(n, dtype = np.float32)

d_stencil = cuda.to_device(stencil)

derivativeKernel[(n + TPB - 1)/TPB, TPB](d_deriv, d_f, d_stencil)

return d_deriv.copy_to_host()

Chris

unread,

Sep 25, 2017, 1:14:56 PM9/25/17

to Numba Public Discussion - Public

After some thought, what is occurring makes sense. I am using += instead of = so I am incriminating garbage values. I added this d_deriv[i] = 0 at the start of the kernel to 0 out the device array before anything was done to it. Is there an easier way to generate a zeroed device array that doesn't involve this or copying a zeroed numpy array over to the GPU?

Stanley Seibert

unread,

Sep 25, 2017, 1:27:24 PM9/25/17

to Numba Public Discussion - Public

We don't currently have a GPU equivalent of np.zeros(), so you either have to copy NumPy array (slower, but short to write) or write a CUDA kernel to zero the memory.

On Mon, Sep 25, 2017 at 12:14 PM, Chris <chrisuch...@gmail.com> wrote:

After some thought, what is occurring makes sense. I am using += instead of = so I am incriminating garbage values. I added this d_deriv[i] = 0 at the start of the kernel to 0 out the device array before anything was done to it. Is there an easier way to generate a zeroed device array that doesn't involve this or copying a zeroed numpy array over to the GPU?

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/ec3b1fc5-555e-49da-8bae-5a9412e07b3e%40continuum.io.

For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Reply all

Reply to author

Forward