I am trying to be the first one to discuss the lab ex with some real code. The reason for doing this is that some bugs are really annoying and subtle, and they can be better fixed by showing the code.
My idea for implementing the kernel is attached in this thread. The current result is that it compiles but has wrong computation result. By a small-sized failing case (5-by-5-by-5), I found that everything seemed fine except that the shared memory variable which holds the current frame, sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] did not compute correctly with a certain (j,k). However, I got its right value by removing the i < nx - 1 condition from w_core_boundary. Really weird and I even does not understand why the modification would affect the result. Does anyone have a interest in looking at it?
Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for my kernel. I can attach my source code and the failing test case if anyone requests.
On Tue, Aug 3, 2010 at 2:38 PM, Wei Lu <learza2...@gmail.com> wrote: > Hey everyone,
> I am trying to be the first one to discuss the lab ex with some real code. > The reason for doing this is that some bugs are really annoying and subtle, > and they > can be better fixed by showing the code.
> My idea for implementing the kernel is attached in this thread. > The current result is that it compiles but has wrong computation result. By > a small-sized > failing case (5-by-5-by-5), I found that everything seemed fine except that > the shared memory > variable which holds the current frame, > *sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] * > did not compute correctly with a certain *(j,k)*. However, I got its right > value by removing > the i* < nx - 1* condition from *w_core_boundary*. > Really weird and I even does not understand why the modification would > affect the result. > Does anyone have a interest in looking at it?
> Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for > my kernel. > I can attach my source code and the failing test case if anyone requests.
> You seem to forget to copy data from current to bottom, and from top to
> current.
> Liwen
> On Tue, Aug 3, 2010 at 2:38 PM, Wei Lu <learza2...@gmail.com> wrote:
> > Hey everyone,
> > I am trying to be the first one to discuss the lab ex with some real code.
> > The reason for doing this is that some bugs are really annoying and subtle,
> > and they
> > can be better fixed by showing the code.
> > My idea for implementing the kernel is attached in this thread.
> > The current result is that it compiles but has wrong computation result. By
> > a small-sized
> > failing case (5-by-5-by-5), I found that everything seemed fine except that
> > the shared memory
> > variable which holds the current frame,
> > *sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] *
> > did not compute correctly with a certain *(j,k)*. However, I got its right
> > value by removing
> > the i* < nx - 1* condition from *w_core_boundary*.
> > Really weird and I even does not understand why the modification would
> > affect the result.
> > Does anyone have a interest in looking at it?
> > Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for
> > my kernel.
> > I can attach my source code and the failing test case if anyone requests.
Another
issue I saw is the syncthread() statement inside the if-statement.
Basically you must be sure all threads will go into the if-statement.
Otherwise, your kernel probably will hang there.
Xiao-Long
On 08/03/2010 02:45 PM, Liwen Chang wrote:
You seem to forget to copy data from current to bottom,
and from top to current.
On Tue, Aug 3, 2010 at 2:57 PM, Rayne <learza2...@gmail.com> wrote: > I think it has been implemented in the "Update shared memory frames" > block.
> On Aug 3, 2:45 pm, Liwen Chang <ddd...@gmail.com> wrote: > > You seem to forget to copy data from current to bottom, and from top to > > current.
> > Liwen
> > On Tue, Aug 3, 2010 at 2:38 PM, Wei Lu <learza2...@gmail.com> wrote: > > > Hey everyone,
> > > I am trying to be the first one to discuss the lab ex with some real > code. > > > The reason for doing this is that some bugs are really annoying and > subtle, > > > and they > > > can be better fixed by showing the code.
> > > My idea for implementing the kernel is attached in this thread. > > > The current result is that it compiles but has wrong computation > result. By > > > a small-sized > > > failing case (5-by-5-by-5), I found that everything seemed fine except > that > > > the shared memory > > > variable which holds the current frame, > > > *sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] * > > > did not compute correctly with a certain *(j,k)*. However, I got its > right > > > value by removing > > > the i* < nx - 1* condition from *w_core_boundary*. > > > Really weird and I even does not understand why the modification would > > > affect the result. > > > Does anyone have a interest in looking at it?
> > > Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL > for > > > my kernel. > > > I can attach my source code and the failing test case if anyone > requests.
> Another issue I saw is the syncthread() statement inside the if-statement. Basically you must be sure all threads will go into the if-statement. Otherwise, your kernel probably will hang there.
> Xiao-Long
> On 08/03/2010 02:45 PM, Liwen Chang wrote:
>> You seem to forget to copy data from current to bottom, and from top to current.
>> Liwen
>> On Tue, Aug 3, 2010 at 2:38 PM, Wei Lu <learza2...@gmail.com> wrote: >> Hey everyone,
>> I am trying to be the first one to discuss the lab ex with some real code. >> The reason for doing this is that some bugs are really annoying and subtle, and they >> can be better fixed by showing the code.
>> My idea for implementing the kernel is attached in this thread. >> The current result is that it compiles but has wrong computation result. By a small-sized >> failing case (5-by-5-by-5), I found that everything seemed fine except that the shared memory >> variable which holds the current frame, >> sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] >> did not compute correctly with a certain (j,k). However, I got its right value by removing >> the i < nx - 1 condition from w_core_boundary. >> Really weird and I even does not understand why the modification would affect the result. >> Does anyone have a interest in looking at it?
>> Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for my kernel. >> I can attach my source code and the failing test case if anyone requests.
> But in each iteration, you actually read 3-plane data from global memory.
> In the optimization 2, we want you to practice register tiling along z direction. > In it, you only need to read 1 plane data from global memory.
> Liwen
> On Tue, Aug 3, 2010 at 2:57 PM, Rayne <learza2...@gmail.com> wrote: > I think it has been implemented in the "Update shared memory frames" > block.
> On Aug 3, 2:45 pm, Liwen Chang <ddd...@gmail.com> wrote: > > You seem to forget to copy data from current to bottom, and from top to > > current.
> > Liwen
> > On Tue, Aug 3, 2010 at 2:38 PM, Wei Lu <learza2...@gmail.com> wrote: > > > Hey everyone,
> > > I am trying to be the first one to discuss the lab ex with some real code. > > > The reason for doing this is that some bugs are really annoying and subtle, > > > and they > > > can be better fixed by showing the code.
> > > My idea for implementing the kernel is attached in this thread. > > > The current result is that it compiles but has wrong computation result. By > > > a small-sized > > > failing case (5-by-5-by-5), I found that everything seemed fine except that > > > the shared memory > > > variable which holds the current frame, > > > *sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] * > > > did not compute correctly with a certain *(j,k)*. However, I got its right > > > value by removing > > > the i* < nx - 1* condition from *w_core_boundary*. > > > Really weird and I even does not understand why the modification would > > > affect the result. > > > Does anyone have a interest in looking at it?
> > > Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for > > > my kernel. > > > I can attach my source code and the failing test case if anyone requests.