using local memory

dan89p...@gmail.com

unread,

Mar 14, 2019, 1:06:07 PM3/14/19

to reikna

Hi!

I'm trying to understand how to use local memory. In the code below, if N is small enough, the code runs perfectly; however, if N>32, I get the following error:

pyopencl._cl.LogicError: clEnqueueNDRangeKernel failed: INVALID_WORK_GROUP_SIZE

I've read that this happens because local_size exceeds the max work group size allowed by my hardware. How can the code be modified to use local memory when global_size exceeds my max work group size?

import numpy as np
import reikna.cluda as cluda
N = 128
api = cluda.ocl_api()
thr = api.Thread.create()
program = thr.compile("""
KERNEL void multiply_them(
GLOBAL_MEM float *dest,
GLOBAL_MEM float *a,
GLOBAL_MEM float *b)
{
const SIZE_T i = get_local_id(0);
dest[i] = a[i] * b[i];
}
""")
multiply_them = program.multiply_them
a = np.random.randn(N,N).astype(np.float32)
b = np.random.randn(N,N).astype(np.float32)
a_dev = thr.to_device(a)
b_dev = thr.to_device(b)
dest_dev = thr.empty_like(a_dev)
multiply_them(dest_dev, a_dev, b_dev, local_size=N*N, global_size=N*N)
print(dest_dev.get() - a * b )

Bogdan Opanchuk

unread,

Mar 14, 2019, 9:20:32 PM3/14/19

to reikna

Local size means how many work items are executed on a single compute unit (AMD terminology, nVidia calls them multiprocessors). This effectively means that all these work items (a workgroup in terms of OpenCL) can exchange data (through local memory, which has a much smaller latency than global memory, but is also quite limited in size) and synchronise their execution (using barriers). The size of a workgroup is limited, depending on the specific card it's usually 1024 or 2048 - that's why you get an error when you try to request a local size greater than 32x32. For larger kernels it can also be limited by the number of registers your code uses.

Since you don't do any data exchange or synchronisation between work items, you don't have to specify the local size - OpenCL driver will do that for you, and you only need to call `get_global_id()` in the kernel. You may want to set it manually, if you want to try and optimise occupancy yourself instead of letting the driver do it (occupancy means, basically, how many work items are actually being concurrently executed on a compute unit as compared to the total supported amount), but often it is not necessary.

If you plan to use local memory, you will have to set the local size that is different from the global size (namely, in range between 1 and max_work_group_size), and use both `get_local_id()` to see where you are in the workgroup and access the local memory, and `get_global_id()` to guide the access to global memory.

You can have a look at

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#occupancy

They describe CUDA, but OpenCL programming model is pretty much the same, just with a different terminology.

dan89p...@gmail.com

unread,

Mar 15, 2019, 3:43:31 PM3/15/19

to reikna

Hi Bogdan,

Thanks for the overview on local memory and the reference links. I've been looking for ways to optimize my code which integrates the Kuramoto-Sivashinsky equation (a nonlinear PDE), and the links should help with that. Do you have any Reikna specific tips to getting optimal performance?

Bogdan Opanchuk

unread,

Mar 16, 2019, 9:45:27 AM3/16/19

to reikna

By the looks of it, it's probably solved in the momentum space, so you won't even need FFT? If that's the case, you shouldn't be concerned about local memory.

As a matter of fact, Reikna was originally created and mainly developed for the purpose of integrating nonlinear Shroedinger equation (including its stochastic modifications), and the actual integrator resides in https://github.com/fjarri/reikna-integrator . Not sure how helpful it is, since, unfortunately, there are no docs (there are docstrings though, and a usage example). There are only fixed-step methods there though, because I was mostly interested in equations with stochastic terms, and that's what worked best (well, there is some ad-hoc adaptive step system, but it's rather wonky).

As for the optimization, most of it is the low-level stuff (global memory coalescing, local memory banks, avoiding branches in a single warp etc) you can find in the Best Practices guide I linked earlier. I guess the coalescing bit is mostly what you will have to worry about, and it's usually trivial to achieve - just access sequential words in the global memory from sequential work items.

The Reikna-specific things are joining computations together into bigger ones, so that you could take advantage of the automatic temporary memory (http://reikna.publicfields.net/en/latest/api/core.html#reikna.core.computation.ComputationPlan.temp_array) and preparing all you can (in particular, compiling all the computations) before running the main integration loop.

dan89p...@gmail.com

unread,

Mar 16, 2019, 1:16:35 PM3/16/19

to reikna

It can be solved in momentum space, but that's not the method I'm using, because it cannot be extended to other PDEs I study. The integration method I'm using is called Exponential Time Differencing with RK4. Essentially, it alternates between real space nonlinear steps and Fourier space steps which involve element-wise multiplication/addition of matrices. So roughly half of the lines of code in the stepper are FFTs. Is local memory something I should think in this situation?

I'll take a look at the integrator code you linked -- it should at least give me an idea of how to join computations in Reikna. Thanks for the suggestions!

Reply all

Reply to author

Forward