CUDA/numba: kernel/thread organization for complex problems -- is concept understanding proper?

Brian Merchant

unread,

Apr 27, 2016, 5:52:49 PM4/27/16

to Numba Public Discussion - Public

Hi all,

CUDA's API natively allows threads and blocks to be referenced by up to three indices, allowing one to easily think of them in 1D, 2D, and 3D.

Of course, this type of indexing is just fancy indexing over a "1D list" of threads, so all threads can be addressed by a single integer.

So, if one wanted to work with a problem where 4 indices are natural (4D), then all one has to do is go from the 4D index, to a 1D index. Let me illustrate with an example. Say that you have P polygons (where P is a natural number), where each polygon has V vertices. We want to calculate the distance between each vertex, and store the results in a distance matrix that has 4 indices: r, s, t, u; so that if I wanted to get the distance from Polygon 0, Vertex 0 to Polygon 10, Vertex 5, I would get the (0, 0, 10, 5) element of the distance matrix. In this case, one could convert from a 4D (r, s, t, u) index to a 1D index i as follows: r*(V*P*V) + s*(P*V) + t*(V) + u = i.

I could give the problem of calculating entry (r, s, t, u) of the distance matrix to thread i.

(any conceptual errors so far?)

========================================================

Now say that we had a 4D matrix, where each element required more complex calculations than simply figuring out a distance. Would I then want to let more than one thread work on calculating each element, that is, I would assign each element to a "block" of threads? I am not sure if that question makes sense, because let's say I could have more than one thread working on each element---how would the threads figure out how to divvy up the task of calculating the element between them? Clearly, that has to come from me somehow, in the form of kernels.

Okay, let's say that this more complicated task can be written as the composition of 2 kernels, A and B. That is, A is executed first, then B is executed. Kernel B would be able to use the results of kernel A, but kernel A executes independently of other instances of kernel A. I can achieve this type of dependence by having a "sync thread" call between the execution of kernel A and kernel B---right? No, wrong! You can't call one kernel from inside another kernel, this gets into a whole other thing called "dynamic parallelism" (see: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism), which numba doesn't support yet (https://groups.google.com/a/continuum.io/forum/#!search/dynamic$20parallelism/numba-users/BRW0sWZkpeQ/DtpjNPOtnhYJ).

Instead, what I should do is something like this (tpb = threads per block, bpg = blocks per grid):

result_A_matrix = np.zeros(....)
result_B_matrix = np.zeros(....)

gpu_result_A = cuda.device_array_like(result_A_matrix)
kernelA[tpb, bpg](arg1, arg2, gpu_result_A) # gpu_result_B will store the results of kernelA calculations
kernelB[tpb, bpg](arg1, arg2, gpu_result_A, result_B_matrix) # gpu_result_A is an argument to kernelB, and results of this calc will be stored in result_B_matrix

An alternative is to write a single kernel that does what kernelA and kernelB do, but this kernel might be super long, and require "sync thread" calls to be placed strategically.

=====================================================================

Okay, great. So what's the point of blocks, or grids? Why care about blocks at all, unless you want to do some shared memory stuff, which GPUs already do caching for, making it unnecessary to handle at our level? Why care about grids, which can't even allow "block synchronization", in the way that blocks allow "thread synchronization"?

(thanks for reading thus far)

Brian

Diogo Silva

unread,

Apr 28, 2016, 4:04:49 AM4/28/16

to Numba Public Discussion - Public

This is an interesting case. I haven't checked the math of the 4D index. Let me just address some points:

First, I'm not so sure if there isn't something more to the indexing, because "for GPUs with CUDA compute capability 2.x the maximum number of threads is 1024 for the x- or y-dimensions,64 for the z-dimension, an overall maximum number of threads of 1024". The different cap size for the different dimensions is the think that is making me uncertain.

Concerning blocks, you seem to be looking at it from a purely logical level, but you must go lower than that to understand blocks. GPUs have (usually) multiple streaming processors (SM) which, in turn, have multiple simpler cores that execute the same "line of code" at the same time. Each of these cores executes a thread. A block of threads is assigned to a SM, but not all threads on a block are executed at the same time (unless it's a very small block). Blocks are further divided in warps and threads inside of these are the ones that get executed in parallel. When you do a thread sync, the cores drop the current warp and execute the other warps in the block to the same level.

> each element required more complex calculations than simply figuring out a distance. Would I then want to let more than one thread work on calculating each element, that is, I would assign each element to a "block" of threads?

I'd say no. Especially if the tasks are to be computed sequentially. You'll want all threads to execute mostly the same code (they always will anyway, what I mean here as that you don't want many branches), so that a whole kernel is not hanging because some small set of threads is running for really long. But you mostly figured that out:

> how would the threads figure out how to divvy up the task of calculating the element between them? Clearly, that has to come from me somehow, in the form of kernels.

> but kernel A executes independently of other instances of kernel A

What do you mean by this? Without dynamic parallelism, the host is the one responsible for control. You may only execute one kernel at a time and the host is the one that issues the kernel.

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/77da7741-80a6-474b-8209-1be6cc85221e%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Brian Merchant

unread,

Apr 28, 2016, 3:26:14 PM4/28/16

to Numba Public Discussion - Public

Hi Diogo,

> First, I'm not so sure if there isn't something more to the indexing, because "for GPUs with CUDA compute capability 2.x the maximum number of threads is 1024 for the x- or y-dimensions,64 for the z-dimension, an overall maximum number of threads of 1024". The different cap size for the different dimensions is the think that is making me uncertain.

Hmm. Maybe I misunderstood, but I thought I could have an arbitrary number of "thread IDs", since each thread ID is calculated based on the block and thread it is in, and I can have an arbitrary number of blocks[1] in a grid. It is the number of threads per block that is limited in the manner that you suggested? I suppose a side-effect of this system is that a thread ID may not refer to a unique "hardware thread" in the lifetime of the computation, since if a thread completes a calculation, and there are more calculations scheduled, then it may get a new task, with a new thread ID?

[1] http://docs.nvidia.com/cuda/cuda-c-programming-guide/#thread-hierarchy : The number of thread blocks in a grid is usually dictated by the size of the data being processed or the number of processors in the system, which it can greatly exceed."

> but kernel A executes independently of other instances of kernel A

What do you mean by this? Without dynamic parallelism, the host is the one responsible for control. You may only execute one kernel at a time and the host is the one that issues the kernel.

Ah, by this I mean that when I "execute" a kernel numba-side, what really happens is that many instances of that kernel are executed per thread, GPU-side?

Kind regards,
Brian

Diogo Silva

unread,

Apr 29, 2016, 8:01:06 AM4/29/16

to Numba Public Discussion - Public

Both the number of threads per block and the number of blocks per grid (in each dimension) have caps. The caps I gave above are for the number of threads in a block, for a certain CUDA capability. But the total number of blocks is also caped - most likely you'll never reach the total cap size, but you may well reach the cap of some dimension if you're working with very big datasets and depending how you're organizing your grid - this happened to me, and I had to devise a scheme to use the block grid in a different way than I had originally planned.

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/e89a2366-8429-41de-bae1-a14a1a5c74b7%40continuum.io.

Diogo Silva

unread,

Apr 29, 2016, 8:02:36 AM4/29/16

to Numba Public Discussion - Public

> Ah, by this I mean that when I "execute" a kernel numba-side, what really happens is that many instances of that kernel are executed per thread, GPU-side?

And yes, this is correct.

Stanley Seibert

unread,

May 1, 2016, 7:12:18 PM5/1/16

to Numba Public Discussion - Public

Hi, I wanted to jump in and give some high level thoughts, which might help:

Achieving high performance in parallel architectures is mostly about managing data locality (in a variety of forms). Poor locality costs time and energy, and idles compute resources while the memory system does all the work. Viewing the CUDA architecture (and your own code) through this lens helps explain a lot about the design.

Fundamentally, CUDA forces you to think about locality by organizing threads in a two-layer block structure:

Threads within a block can communicate easily with each other during kernel execution, and are guaranteed to be simultaneously active (available for scheduling) and co-located on the same multiprocessor.
Threads in different blocks cannot communicate easily (only global atomic operations are safe), may be running on different multiprocessors, and may be active at different times.

The primary benefit of shared memory (with all the caching) at this point is to act as a place where threads in a block can exchange data. The __syncthreads() function exists to prevent race conditions when accessing shared memory. The shared memory abstraction allows some threads to communicate much more efficiently than if any thread (regardless of block) could synchronize with any other thread. Organizing this communication can be the deciding factor in selecting a block size (along with shared memory constraints).

(Note: people often ask where is the __syncthreads equivalent for blocks? This is the kernel execution boundary itself. If you need to exchange data between threads in different blocks, then they need to write it to global memory and allow the kernel to terminate. The next kernel can assume the data from the previous kernel is synchronized and available. This sounds a little clunky, but when you launch CUDA kernels, they are queued by the driver asynchronously into a single CUDA stream and run sequentially with low overhead. This is why you never need to do a device synchronized between kernels on the same stream.)

However, a lot of parallel algorithms (like ufuncs) don't need any data exchange between threads in shared memory at all. In that case, the selection of block size is somewhat arbitrary. In these cases, the block size with the highest throughput depends frequently on micro-architectural details that are basically impossible to predict ahead of time. When I really care, I have my code scan through a range of reasonable block sizes (usually multiples of 32 or 64 from 64 up through 1024) in an "auto-tuning" phase before selecting a value to use for the rest of program execution.

Another aspect of locality is ensuring you maximize the "coalescing" of reads and writes. Whenever the threads in a warp perform a read or write, the load/store units on the multiprocessor attempt to "coalesce" the target addresses into memory transactions that access consecutive sequences of memory locations. The size of these transactions vary between CUDA architectures, but for global memory, I believe they are still 32-bytes. If you have an entire warp accessing consecutive float32 values in memory, then this can be serviced by 4 memory transactions. However, if the warps are accessing random locations, then it can take up to 32 memory transactions to service them all, dramatically lowering memory bandwidth. So this is another thing to consider when deciding how to map work to threads: the mapping which maximizes coalescing of memory access is generally the best.

Shared memory can also play a role in cases where you can confine your scatter/gather operations to a block that fits within the shared memory size. A common pattern seen in CUDA algorithms is:

All the threads in the block load some data into a shared memory array in a coalesced fashion
__syncthreads()
All the threads in the block then random access read the shared memory data as needed.

This is generally much faster than having threads random access arrays in global memory. Again, the needs here often determine the block size.

The multidimensional grid and block indexing (and limits therein) do seem to be a bit of an anachronism from the GPU's origin as a processor for raster algorithms. Personally, I've found it much easier to think about CUDA algorithms by always using 1D block and grid dimensions, and then dynamically mapping those indices to the work elements. Even when dealing with 1D data, I find it convenient to not require the size of my grid to match the size of my input or output array. In older versions of CUDA, I found that kernel launch time seemed to scale with the number of blocks, so for simple kernels it was better to pick a number of threads that oversubscribed the GPU by some amount (say, number of CUDA cores times 4 to 10) and then to use for loops and strided indexing to ensure that all the data was processed. It is still important for warps to access consecutive memory locations, but entire warps can jump by large strides in memory as long as they do it together.

OK, maybe that was a bit of a firehose brain dump, but hopefully that helps a bit. :)

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CAKiRKhSsaoGYFF4-G7O_O_yPkA_O6RqV%3DzK%2BcpbkyX08mpYGNQ%40mail.gmail.com.

Diogo Silva

unread,

May 2, 2016, 1:24:42 AM5/2/16

to Numba Public Discussion - Public

This is great - I wish someone had told me this when I was just starting. From a personal experience, I can tell memory coalescing can have a big impact on performance. As a consequence of not transposing a matrix before sending it to the GPU, I got much worse results when I was working on an implementation of K-Means.

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CADv3RKQhunzpnvuovYtpbGNEM5pPBscDgh%2BwF9Z%2BzYuxe%2BbxOw%40mail.gmail.com.

Reply all

Reply to author

Forward