My idea is to split the xy planes by 16x16 blocks, and in each GPU block, there are 18x18 threads, so all threads can read in data to the shared memory within one GPU clock. However, the code below does not work as I expected. Can TA help with this lab 1 ?
__global__ void block2D_opt_2(float fac,float *A0,float *Anext, int nx, int ny, int nz) { //For optimization 2 int inx=threadIdx.x; int iny=threadIdx.y;
// coordinates at original nx * ny matrix int i = blockIdx.x*(blockDim.x-2)+inx; int j = blockIdx.y*(blockDim.y-2)+iny;
Ok, in these codes, you assigned your block size as 18x18 threads. but in each block, you will only compute 16x16 data for output. That means you need add if condition before you do the computation.
On Fri, Aug 6, 2010 at 2:38 PM, Xiaowei Zhan <zha...@gmail.com> wrote: > Hello,
> My idea is to split the xy planes by 16x16 blocks, and > in each GPU block, there are 18x18 threads, so all threads can read in > data to the shared memory within one GPU clock. > However, the code below does not work as I expected. > Can TA help with this lab 1 ?
Remember that your threads overlap the 16x16 tile. So the input index thread (0,0) should access isn't blockIdx*16 + tid, but blockIdx*16 + tid - 1.
What Li-wen said also applies. When computing out of the shared memory or writing output, you need to check not just that the output is inside the grid, but that the input is as well. More than that even, you only want thread inside the 16x16 centered tile (threads 1 to 16 in x and y) to compute and write output.
--John ================ John Stratton 217-621-9501 507 W Green St Apt 10 Champaign, IL 61820
On Fri, Aug 6, 2010 at 2:38 PM, Xiaowei Zhan <zha...@gmail.com> wrote: > Hello,
> My idea is to split the xy planes by 16x16 blocks, and > in each GPU block, there are 18x18 threads, so all threads can read in > data to the shared memory within one GPU clock. > However, the code below does not work as I expected. > Can TA help with this lab 1 ?