My implementation on Lab1.2, with question and code attached

24 views
Skip to first unread message

Wei Lu

unread,
Aug 3, 2010, 3:38:20 PM8/3/10
to VSCSE Many-core Processors 2010
Hey everyone,

I am trying to be the first one to discuss the lab ex with some real code.
The reason for doing this is that some bugs are really annoying and subtle, and they
can be better fixed by showing the code.

My idea for implementing the kernel is attached in this thread.
The current result is that it compiles but has wrong computation result. By a small-sized
failing case (5-by-5-by-5), I found that everything seemed fine except that the shared memory
variable which holds the current frame,
sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] 
did not compute correctly with a certain (j,k). However, I got its right value by removing
the i < nx - 1 condition from w_core_boundary.
Really weird and I even does not understand why the modification would affect the result.
Does anyone have a interest in looking at it?

Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for my kernel.
I can attach my source code and the failing test case if anyone requests.

Thanks,
Wei

Liwen Chang

unread,
Aug 3, 2010, 3:45:04 PM8/3/10
to vscse-many-core...@googlegroups.com
You seem to forget to copy data from current to bottom, and from top to current.

Liwen
block2D_opt_2.png

Rayne

unread,
Aug 3, 2010, 3:57:07 PM8/3/10
to VSCSE Many-core Processors 2010
I think it has been implemented in the "Update shared memory frames"
block.

On Aug 3, 2:45 pm, Liwen Chang <ddd...@gmail.com> wrote:
> You seem to forget to copy data from current to bottom, and from top to
> current.
>
> Liwen
>
> On Tue, Aug 3, 2010 at 2:38 PM, Wei Lu <learza2...@gmail.com> wrote:
> > Hey everyone,
>
> > I am trying to be the first one to discuss the lab ex with some real code.
> > The reason for doing this is that some bugs are really annoying and subtle,
> > and they
> > can be better fixed by showing the code.
>
> > My idea for implementing the kernel is attached in this thread.
> > The current result is that it compiles but has wrong computation result. By
> > a small-sized
> > failing case (5-by-5-by-5), I found that everything seemed fine except that
> > the shared memory
> > variable which holds the current frame,
> > *sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] *
> > did not compute correctly with a certain *(j,k)*. However, I got its right
> > value by removing
> > the i* < nx - 1* condition from *w_core_boundary*.
> > Really weird and I even does not understand why the modification would
> > affect the result.
> > Does anyone have a interest in looking at it?
>
> > Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for
> > my kernel.
> > I can attach my source code and the failing test case if anyone requests.
>
> > Thanks,
> > Wei
>
>
>
>  block2D_opt_2.png
> 41KViewDownload

Xiao-Long Wu

unread,
Aug 3, 2010, 4:01:30 PM8/3/10
to vscse-many-core...@googlegroups.com, Liwen Chang
Another issue I saw is the syncthread() statement inside the if-statement. Basically you must be sure all threads will go into the if-statement. Otherwise, your kernel probably will hang there.

Xiao-Long

Liwen Chang

unread,
Aug 3, 2010, 4:01:35 PM8/3/10
to vscse-many-core...@googlegroups.com
Ok, I am wrong.

But in each iteration, you actually read 3-plane data from global memory.

In the optimization 2, we want you to practice register tiling along z direction.
In it, you only need to read 1 plane data from global memory.

Liwen

Wei Lu

unread,
Aug 3, 2010, 4:54:50 PM8/3/10
to vscse-many-core...@googlegroups.com
Yes, the program passed my previous failing case now. Thanks Xiao-Long.

Liwen


<Mail Attachment.png>


Wei Lu

unread,
Aug 3, 2010, 4:58:16 PM8/3/10
to vscse-many-core...@googlegroups.com
Thanks Liwen. 
Now I see that your suggestion is better because the read speed from register is faster than that from the share memory.
Reply all
Reply to author
Forward
0 new messages