My implementation on Lab1.2, with question and code attached

Wei Lu

unread,

Aug 3, 2010, 3:38:20 PM8/3/10

to VSCSE Many-core Processors 2010

Hey everyone,

I am trying to be the first one to discuss the lab ex with some real code.

The reason for doing this is that some bugs are really annoying and subtle, and they

can be better fixed by showing the code.

My idea for implementing the kernel is attached in this thread.

The current result is that it compiles but has wrong computation result. By a small-sized

failing case (5-by-5-by-5), I found that everything seemed fine except that the shared memory

variable which holds the current frame,

sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL]

did not compute correctly with a certain (j,k). However, I got its right value by removing

the i < nx - 1 condition from w_core_boundary.

Really weird and I even does not understand why the modification would affect the result.

Does anyone have a interest in looking at it?

Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for my kernel.

I can attach my source code and the failing test case if anyone requests.

Thanks,

Wei

Liwen Chang

unread,

Aug 3, 2010, 3:45:04 PM8/3/10

to vscse-many-core...@googlegroups.com

You seem to forget to copy data from current to bottom, and from top to current.

Liwen

block2D_opt_2.png

Rayne

unread,

Aug 3, 2010, 3:57:07 PM8/3/10

to VSCSE Many-core Processors 2010

I think it has been implemented in the "Update shared memory frames"
block.

On Aug 3, 2:45 pm, Liwen Chang <ddd...@gmail.com> wrote:
> You seem to forget to copy data from current to bottom, and from top to
> current.
>
> Liwen
>

> On Tue, Aug 3, 2010 at 2:38 PM, Wei Lu <learza2...@gmail.com> wrote:
> > Hey everyone,
>
> > I am trying to be the first one to discuss the lab ex with some real code.
> > The reason for doing this is that some bugs are really annoying and subtle,
> > and they
> > can be better fixed by showing the code.
>
> > My idea for implementing the kernel is attached in this thread.
> > The current result is that it compiles but has wrong computation result. By
> > a small-sized
> > failing case (5-by-5-by-5), I found that everything seemed fine except that
> > the shared memory
> > variable which holds the current frame,

> > *sh_current[(j + 1) % BLOCK_SIZE_TOTAL][i % BLOCK_SIZE_TOTAL] *
> > did not compute correctly with a certain *(j,k)*. However, I got its right
> > value by removing
> > the i* < nx - 1* condition from *w_core_boundary*.

> > Really weird and I even does not understand why the modification would
> > affect the result.
> > Does anyone have a interest in looking at it?
>
> > Btw, I also changed the tx and ty in the main.cu to BLOCK_SIZE_TOTAL for
> > my kernel.
> > I can attach my source code and the failing test case if anyone requests.
>
> > Thanks,
> > Wei
>
>
>

> block2D_opt_2.png
> 41KViewDownload

Xiao-Long Wu

unread,

Aug 3, 2010, 4:01:30 PM8/3/10

to vscse-many-core...@googlegroups.com, Liwen Chang

Another issue I saw is the syncthread() statement inside the if-statement. Basically you must be sure all threads will go into the if-statement. Otherwise, your kernel probably will hang there.

Xiao-Long

Liwen Chang

unread,

Aug 3, 2010, 4:01:35 PM8/3/10

to vscse-many-core...@googlegroups.com

Ok, I am wrong.

But in each iteration, you actually read 3-plane data from global memory.

In the optimization 2, we want you to practice register tiling along z direction.

In it, you only need to read 1 plane data from global memory.

Liwen

Wei Lu

unread,

Aug 3, 2010, 4:54:50 PM8/3/10

to vscse-many-core...@googlegroups.com

Yes, the program passed my previous failing case now. Thanks Xiao-Long.

Liwen

<Mail Attachment.png>

Wei Lu

unread,

Aug 3, 2010, 4:58:16 PM8/3/10

to vscse-many-core...@googlegroups.com

Thanks Liwen.

Now I see that your suggestion is better because the read speed from register is faster than that from the share memory.

Reply all

Reply to author

Forward