I'm trying to understand how to use local memory. In the code below, if N is small enough, the code runs perfectly; however, if N>32, I get the following error:
pyopencl._cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_GROUP_SIZE
I've read that this happens because local_size exceeds the max work group size allowed by my hardware. How can the code be modified to use local memory when global_size exceeds my max work group size?
import numpy as np
import reikna.cluda as cluda
N = 128
api = cluda.ocl_api()
thr = api.Thread.create()
program = thr.compile("""
KERNEL void multiply_them(
GLOBAL_MEM float *dest,
GLOBAL_MEM float *a,
GLOBAL_MEM float *b)
{
const SIZE_T i = get_local_id(0);
dest[i] = a[i] * b[i];
}
""")
multiply_them = program.multiply_them
a = np.random.randn(N,N).astype(np.float32)
b = np.random.randn(N,N).astype(np.float32)
a_dev = thr.to_device(a)
b_dev = thr.to_device(b)
dest_dev = thr.empty_like(a_dev)
multiply_them(dest_dev, a_dev, b_dev, local_size=N*N, global_size=N*N)
print(dest_dev.get() - a * b )
Thanks for the overview on local memory and the reference links. I've been looking for ways to optimize my code which integrates the Kuramoto-Sivashinsky equation (a nonlinear PDE), and the links should help with that. Do you have any Reikna specific tips to getting optimal performance?
I'll take a look at the integrator code you linked -- it should at least give me an idea of how to join computations in Reikna. Thanks for the suggestions!