--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/77da7741-80a6-474b-8209-1be6cc85221e%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.
> First, I'm not so sure if there isn't something more to the indexing, because "for GPUs with CUDA compute capability 2.x the maximum number of threads is 1024 for the x- or y-dimensions,64 for the z-dimension, an overall maximum number of threads of 1024". The different cap size for the different dimensions is the think that is making me uncertain.
> but kernel A executes independently of other instances of kernel AWhat do you mean by this? Without dynamic parallelism, the host is the one responsible for control. You may only execute one kernel at a time and the host is the one that issues the kernel.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/e89a2366-8429-41de-bae1-a14a1a5c74b7%40continuum.io.
The primary benefit of shared memory (with all the caching) at this point is to act as a place where threads in a block can exchange data. The __syncthreads() function exists to prevent race conditions when accessing shared memory. The shared memory abstraction allows some threads to communicate much more efficiently than if any thread (regardless of block) could synchronize with any other thread. Organizing this communication can be the deciding factor in selecting a block size (along with shared memory constraints).
(Note: people often ask where is the __syncthreads equivalent for blocks? This is the kernel execution boundary itself. If you need to exchange data between threads in different blocks, then they need to write it to global memory and allow the kernel to terminate. The next kernel can assume the data from the previous kernel is synchronized and available. This sounds a little clunky, but when you launch CUDA kernels, they are queued by the driver asynchronously into a single CUDA stream and run sequentially with low overhead. This is why you never need to do a device synchronized between kernels on the same stream.)
However, a lot of parallel algorithms (like ufuncs) don't need any data exchange between threads in shared memory at all. In that case, the selection of block size is somewhat arbitrary. In these cases, the block size with the highest throughput depends frequently on micro-architectural details that are basically impossible to predict ahead of time. When I really care, I have my code scan through a range of reasonable block sizes (usually multiples of 32 or 64 from 64 up through 1024) in an "auto-tuning" phase before selecting a value to use for the rest of program execution.
Another aspect of locality is ensuring you maximize the "coalescing" of reads and writes. Whenever the threads in a warp perform a read or write, the load/store units on the multiprocessor attempt to "coalesce" the target addresses into memory transactions that access consecutive sequences of memory locations. The size of these transactions vary between CUDA architectures, but for global memory, I believe they are still 32-bytes. If you have an entire warp accessing consecutive float32 values in memory, then this can be serviced by 4 memory transactions. However, if the warps are accessing random locations, then it can take up to 32 memory transactions to service them all, dramatically lowering memory bandwidth. So this is another thing to consider when deciding how to map work to threads: the mapping which maximizes coalescing of memory access is generally the best.
Shared memory can also play a role in cases where you can confine your scatter/gather operations to a block that fits within the shared memory size. A common pattern seen in CUDA algorithms is:
This is generally much faster than having threads random access arrays in global memory. Again, the needs here often determine the block size.
The multidimensional grid and block indexing (and limits therein) do seem to be a bit of an anachronism from the GPU's origin as a processor for raster algorithms. Personally, I've found it much easier to think about CUDA algorithms by always using 1D block and grid dimensions, and then dynamically mapping those indices to the work elements. Even when dealing with 1D data, I find it convenient to not require the size of my grid to match the size of my input or output array. In older versions of CUDA, I found that kernel launch time seemed to scale with the number of blocks, so for simple kernels it was better to pick a number of threads that oversubscribed the GPU by some amount (say, number of CUDA cores times 4 to 10) and then to use for loops and strided indexing to ensure that all the data was processed. It is still important for warps to access consecutive memory locations, but entire warps can jump by large strides in memory as long as they do it together.
OK, maybe that was a bit of a firehose brain dump, but hopefully that helps a bit. :)
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CAKiRKhSsaoGYFF4-G7O_O_yPkA_O6RqV%3DzK%2BcpbkyX08mpYGNQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/CADv3RKQhunzpnvuovYtpbGNEM5pPBscDgh%2BwF9Z%2BzYuxe%2BbxOw%40mail.gmail.com.