Linearscan or nvidia problem ?

3 views
Skip to first unread message

Diogo Sampaio

unread,
Feb 3, 2012, 3:02:25 PM2/3/12
to gpuo...@googlegroups.com
Hi oceloter's

I'm trying to fix some bugs in the linear scan registar allocation.

one that I just discovered seems that values converted with cvta can't
be stored in local memory. Any one knows any of thins? I didn't find
anything helpfull in the ptx_isa pdf. Testing showed that the error
occurs because of the "st.local.u64 [ocelot_ls_stack + 4], %r3;" If it
is comented, the code works, so it isn't because of the stack
declaration.

The original ptx code, printed by PTXOptimizar without any passes
works and looks as this:

.reg .s64 %r3;
-------------
1s basic block:
ld.param.u64 %r2, [_Z11kmeansPointPfiiiPiS_S_S0__param_4];
cvta.to.global.u64 %r3, %r2;
-------------
on a further basic block:
add.s64 %r58, %r3, %r57;
==============
Now after it runs linear scan over it, it genarates the loads and
stores so that it looks like this:

.reg .s64 %r3;
.reg .s64 %r152;
.local .u8 ocelot_ls_stack[40];
-------------
1s basic block:
ld.param.u64 %r2, [_Z11kmeansPointPfiiiPiS_S_S0__param_4];
cvta.to.global.u64 %r3, %r2;
st.local.u64 [ocelot_ls_stack + 4], %r3;
-------------
on a further basic block:
ld.local.s64 %r151, [ocelot_ls_stack + 4];
add.s64 %r58, %r151, %r57;
====================
Any ideas?

If it helps, I'm using cuda toolkit and sdk 4.1, gcc 4.6, the example
is taken from rodinia/kmeans

Attached is the application, the file
compute_20_linearscan_b7e5553193ccfe01 and
compute_20_unchanged_b7e5553193ccfe01 are the linearscan and not
optimized files generated by PTXOptimizer.
Warm regards
--
Diogo Nunes Sampaio

kmeans.tar.bz2

Gregory Diamos

unread,
Feb 3, 2012, 4:36:58 PM2/3/12
to gpuo...@googlegroups.com
On 02/03/2012 12:02 PM, Diogo Sampaio wrote:
> st.local.u64 [ocelot_ls_stack + 4], %r3;

I think your issue may be due to incorrect alignment. u64 stores need
to be 8-byte aligned,
and (ocelot_ls_stack + 4) probably isn't. Memory ops to unaligned
addresses are undefined, and
the low level optimizer might remove your code or the hardware might do
anything from trapping
to ignoring the instruction.

Greg

Diogo Sampaio

unread,
Feb 3, 2012, 4:48:22 PM2/3/12
to gpuo...@googlegroups.com
Ok, using 8 bytes for every type of variable solves it, thx. Now needs
to order the spills so they are always aligned.

> --
> You received this message because you are subscribed to the Google Groups
> "gpuocelot" group.
> To post to this group, send email to gpuo...@googlegroups.com.
> To unsubscribe from this group, send email to
> gpuocelot+...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/gpuocelot?hl=en.
>

--
Diogo Nunes Sampaio

Reply all
Reply to author
Forward
0 new messages