I am trying to be the first one to discuss the lab ex with some real code.
The reason for doing this is that some bugs are really annoying and subtle, and they
can be better fixed by showing the code.
My idea for implementing the kernel is attached in this thread.
The current result is that it compiles but has wrong computation result. By a small-sized
failing case (5-by-5-by-5), I found that everything seemed fine except that the shared memory
variable which holds the current frame,
sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL]
did not compute correctly with a certain (j,k). However, I got its right value by removing
the i < nx - 1 condition from w_core_boundary.
Really weird and I even does not understand why the modification would affect the result.
Does anyone have a interest in looking at it?
Btw, I also changed the tx and ty in the
main.cu to BLOCK_SIZE_TOTAL for my kernel.
I can attach my source code and the failing test case if anyone requests.