Good news!

Sun HuanHuan

unread,

Mar 27, 2012, 12:13:48 AM3/27/12

to asf...@googlegroups.com

Hi all,

Kepler shared much a lot of instructions with fermi. only a few new
instruction were added.

HuanHuan

Hou Yunqing

unread,

Mar 27, 2012, 12:36:36 AM3/27/12

to asf...@googlegroups.com

Hi HuanHuan,

That is... well, interesting. Kepler seems to have some sort of compiler/ISA-assisted scheduling. Wouldn't that require a new ISA? How much information do you have regarding this? Could it be that a few new instructions are introduced to support new features, but the underlying ISA has been totally reworked?

Thanks,

Yunqing

Sun HuanHuan

unread,

Mar 27, 2012, 12:54:23 AM3/27/12

to asf...@googlegroups.com

Hi, Yunqing,

I don't have much information. you can try the new 301 driver (4.2
driver) and 4.2 toolkit from
http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_win_64.msi
(RC) in which some information for sm_30 may be revealed.

Compiler assisted scheduling is always there: the cubin for our good old
sm_21 fermi have always two independent instructions (address end with
0x*0 and the next with 0x*8.

Some one from NV told me kepler shared much a same instruciton base
while I actually didn't test for it yet...

Huan

Sun HuanHuan

unread,

Mar 27, 2012, 1:29:10 AM3/27/12

to asf...@googlegroups.com

Hi, YQ,

Would you please do some ptxas jobs and then use cuobjdump to see if the
ISA or the binary encodings changed or not?

Thank you!

Huan

Hou Yunqing

unread,

Mar 27, 2012, 1:42:54 AM3/27/12

to asf...@googlegroups.com

Hmm... I haven't downloaded 4.2 toolkit yet. Do you mean that 4.2 already has full support for sm_30? Where do I find the 4.2 download page for linux distros? I checked the old NV developer site and didn't see 4.2. I'm rarely on Windows these days.

Yunqing

Sun HuanHuan

unread,

Mar 27, 2012, 1:52:30 AM3/27/12

to asf...@googlegroups.com

Hi,

They say programms run well under 4.2 RC, and,
please try the followings,

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_32_rhel5.5.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_32_sles11.0.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_32_suse11.2.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_32_ubuntu10.04.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_32_ubuntu11.04.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_64_fedora14.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_64_rhel5.5.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_64_rhel6.0.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_64_sles11.0.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_64_suse11.2.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_64_ubuntu10.04.run

http://developer.download.nvidia.com/compute/cuda/4_2/rc/toolkit/cudatoolkit_4.2.6_linux_64_ubuntu11.04.run

Hou Yunqing

unread,

Mar 27, 2012, 4:51:50 AM3/27/12

to asf...@googlegroups.com

Hi HuanHuan,

I just took a look at the code for sm_30 generated by nvcc/ptxas. The binary encoding has not changed for the few instructions that I checked (MOV, MOV32I, EXIT, ST). There are some changes in the content of c[0x0], though.

Here's the code generated for this function:

__global__ void k(int* output){*output=0x123;}

code for sm_30

Function : _Z1kPi

//notice that 0x0000 is not disassembled. I haven't checked what instruction it is.

/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44];

/*0010*/ /*0x00001de428004005*/ MOV R0, c [0x0] [0x140]; //launch argument starts at 0x140, it seems

/*0018*/ /*0x8c009de218000004*/ MOV32I R2, 0x123;

/*0020*/ /*0x00009c8590000000*/ ST [R0], R2;

/*0028*/ /*0x00001de780000000*/ EXIT;

/*0030*/ /*0xe0001de74003ffff*/ BRA 0x30;

/*0038*/ /*0x00001de440000000*/ NOP CC.T;

.......................

So... sm_30 really is Kepler? This is weird because nvcc doesn't work with sm_31... The released Kepler is GK104, which should be sm_31... but nvcc doesn't support sm_31 and supports sm_30 instead... well, that's as far as my speculation should go. asfermi needs to be updated before it can support Kepler, but I don't imagine that I can do it in any time soon.

Yunqing

Sun HuanHuan

unread,

Mar 27, 2012, 6:07:29 AM3/27/12

to asf...@googlegroups.com

On 3/27/2012 4:51 PM, Hou Yunqing wrote:
> Hi HuanHuan,
>
> I just took a look at the code for sm_30 generated by nvcc/ptxas. The
> binary encoding has not changed for the few instructions that I checked
> (MOV, MOV32I, EXIT, ST). There are some changes in the content of c[0x0],
> though.

This may be the ABI change. Can you test for "ptxas --abi-compile no"?
this will like remove the mov r1, c[0x0][0x44] (stack pointer set up).
Thus having the non abi-related instructions left, which should remain
unchanged.

only some new SIMD instructions added. see the new PTX 3.0 manual (which
is different from the same version 3.0 manual from cuda 4.1)

>
> Here's the code generated for this function:
> __global__ void k(int* output){*output=0x123;}
>
> code for sm_30
> Function : _Z1kPi
> //notice that 0x0000 is not disassembled. I haven't checked what
> instruction it is.
> /*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44];
> /*0010*/ /*0x00001de428004005*/ MOV R0, c [0x0] [0x140]; //launch
> argument starts at 0x140, it seems
> /*0018*/ /*0x8c009de218000004*/ MOV32I R2, 0x123;
> /*0020*/ /*0x00009c8590000000*/ ST [R0], R2;
> /*0028*/ /*0x00001de780000000*/ EXIT;
> /*0030*/ /*0xe0001de74003ffff*/ BRA 0x30;
> /*0038*/ /*0x00001de440000000*/ NOP CC.T;
> .......................
>
> So... sm_30 really is Kepler? This is weird because nvcc doesn't work with
> sm_31... The released Kepler is GK104, which should be sm_31... but nvcc
> doesn't support sm_31 and supports sm_30 instead... well, that's as far as
> my speculation should go. asfermi needs to be updated before it can support
> Kepler, but I don't imagine that I can do it in any time soon.

It's beause sm_31 (the former sm_23) is powerful than sm_30 (the former
sm_22). sm_30 remove much support for 32-bit int and double (reduced to
4% performance) and logical opertions. the 32-bit mad may be implemented
via float in micro-code (it took 24 cycles!) (like the good old days we
have no 8087 so we patch them by ALU?).

Only the float performance remained unchanged making sm_30 the great
rubbish for general non-float processing and yet a good card for gaming.

So it should be sm_30, with his big brother sm_31 on the way.

ptx_isa_3.0.pdf

Hou Yunqing

unread,

Mar 27, 2012, 7:19:39 AM3/27/12

to asf...@googlegroups.com

Thanks for so much info! I really wish someone could publish a somewhat comprehensive instruction latency list for sm_20... well, just a wish.

Sun HuanHuan

unread,

Mar 27, 2012, 9:48:50 AM3/27/12

to asf...@googlegroups.com

well. latency is nothing. what's really important is SP organization,
and what each group of SP's can do, and how well they do it (the
throughput).

Sun HuanHuan

unread,

Mar 27, 2012, 9:52:54 AM3/27/12

to asf...@googlegroups.com

Corrections.

As many instructions have a throught put of 32*n + 8 form.

So i guess integer mul/mad is NOT implment in micro-code. we have only 8
of 192SP's which are fully implemented. while the reset 184 are lame ones...

On 3/27/2012 7:19 PM, Hou Yunqing wrote:

Reply all

Reply to author

Forward