Fwd: GTX 680 (sm_30)

446 views
Skip to first unread message

Hou Yunqing

unread,
Apr 11, 2012, 1:18:14 AM4/11/12
to asfermi Google Group


---------- Forwarded message ----------
From: <EFom...@cso.com>
Date: Wed, Apr 11, 2012 at 1:04 PM
Subject: Re: GTX 680 (sm_30)
To: Hou Yunqing <hyq.n...@gmail.com>



Hello Yunqing,

I do have a moderate understanding of the ELF file format and please do forward this conversation to the group. Also, if you do come across your notes for the sm_20 cubin format, it would be totally awesome if you can send them my way.

Thanks, Everett
EFom...@cso.com



Hou Yunqing <hyq.n...@gmail.com>

04/10/2012 09:32 PM

To
EFom...@cso.com
cc
Subject
Re: GTX 680 (sm_30)





Hi Everett,

I won't have the time to do that, though if you wish to do so I can try to give some help. Are you familiar with the ELF object format? I think the changes (from sm_20 to sm_30) mostly take place at this level, as it seems that the sm_30 ISA is a minor extension of sm_20's. You'd have to spend some time investigating the cubins for sm_30 to comfirm that not too many things have been changed. I did some notes for the sm_20 cubin format and I'll probably have to dig it out from my Windows machine...

Also, a few other people on asfermi's mailing list (asf...@googlegroups.com) seem to be interested in sm_30 support. They might be able to help. Would you like me to forward this conversation to the maiilng group? 

Thanks,
Yunqing

On Wed, Apr 11, 2012 at 1:23 AM, <EFom...@cso.com> wrote:

Hello,


Would it be possible for you to add support for sm_30? Or you can guide me on how to do so. Your help is greatly appreciated.


Thanks Everett

EFom...@cso.com

This message and any attachments may include confidential and/or proprietary information and are solely for the intended recipient. If you are not the intended recipient, disclosure, copying, use, or distribution of the information included in this message is prohibited -- please notify the sender by replying to this message and then immediately and permanently delete this message.   Thank you.


This message and any attachments may include confidential and/or proprietary information and are solely for the intended recipient. If you are not the intended recipient, disclosure, copying, use, or distribution of the information included in this message is prohibited -- please notify the sender by replying to this message and then immediately and permanently delete this message.   Thank you.


Hou Yunqing

unread,
Apr 14, 2012, 11:07:05 PM4/14/12
to asf...@googlegroups.com
Hi Everett,

I've found my notes, though something that I wrote on paper is perhaps lost. I have attached the notes here. 

Also, I just realised that my code is a better documentation of the cubin format than my own notes which I took before I started coding. The most relevant parts in my code include:
1. WriteToCubinDirectOutput() in asfermi.cpp
2. Cubin.h/cpp
3. helper/helperCubin.cpp

Things that surely have changed in sm_30:
1. ELFH32.Flags defines which architecture the cubin is for. You can find out this for sm_30 using cueditor_v5 in the download page, or you could just use any of your favourite elf utility.
2. The content in .nv.info.kernelname is most likely to have changed somewhat. You can first get a rough understanding of what that section is using cuobjdump -elf, then you can look at the parts of my code that produce such sections (hpCubinStage3() in helperCubin.cpp). You'll need a hex editor to find out what has changed. 
3.  Constant0Section is most likely to have changed somewhat as well, because while in sm_2x arguments start at c[0x0][0x20], now they start at c[0x0][0x140].. at least there's a change in the size of Constant0Section for sm_30.

Anyway, things may appear messy at first, but after you've spent a few hours on it things should become clearer. If there's anything you can't make out on your own, feel free to ask here :)

Cheers
Yunqing
cubin_notes.txt

Everett Fominyen

unread,
Apr 15, 2012, 2:34:44 AM4/15/12
to asf...@googlegroups.com

Hello Yunqing,

Thank you very much for the notes and pointers on the Fermi architecture; they will be of great value for understanding the Kepler architecture. I will have some time off from work within the next couple of months and will thoroughly examine the asfermi source code. From there, I will make an attempt to figure out how much effort will be needed to add sm_30 support.
 
Also, I have a couple of questions for you:

1). On average, how much of a performance gain have you been able to achieve using asfermi versus using Nvidia's nvcc/ptxas compiler 4.x? Off course, assuming that the critical portions of your code were written using in-line PTX assembly.

2). Are there any tools such as "qhasm-cudasm" for help with register allocation when using asfermi. I have a project with over 5000 lines of PTX assembly code and it will be very difficult to keep track of the registers manually.

Thanks, Everett

Hou Yunqing

unread,
Apr 15, 2012, 4:03:27 AM4/15/12
to asf...@googlegroups.com
Hi Everett,

Before I start, I should be honest about one thing: I stopped working on asfermi before I could finish some important parts of it (Stage 2 is something I've always wanted to do but never had the time). Some of the other guys here in this mailing group probably have used asfermi more than I have.

1) I can't really say what is the "average". Sometimes I do ridiculous things (things that you rarely need/use in practice) without algorithmic changes and get 2x speedup. Sometimes it's like 1.2x to 1.3x. Honestly, now that Stage 2 is not done and not much information is available, more advanced optimisations are pretty hard. There's still a large room for improvement if all the things I listed under Stage 2 (http://code.google.com/p/asfermi/wiki/Plan#Microbenchmarking) can be done.

2) Just googled qhasm-cudasm, and it seems to me that it's not a register allocator... maybe I'm wrong. If you just want to use variable names, you can consider using asfermi along with a good text preprocessor.. cpp might be sufficient. But if you are looking for a real allocator... I don't know of any. But surely it won't be hard for someone to create one.

So far from what I've already seen sm_30's cubin seems to involve very little change. Maybe it will just take you 10 hrs to familiarise with asfermi (the 10 can be reduced to 2 if you focus on the cubin part without looking into the assembly process) and 10 hrs to code/test/debug. Or maybe another 5 if you need to revise a bit of ELF first. But if you want to add support for the new instructions then it might still take a bit longer.

BTW, what is it that has given you the desire to add support for sm_30 in asfermi?

Cheers,
Yunqing

JLai

unread,
Apr 19, 2012, 11:00:52 AM4/19/12
to asfermi

Hi Everett & Yunqing,

I want to add some comments to the first question.

1) i agree with Yunqing that normally algorithm level optimization
gives you the best speedup. (coalescing, eliminating bank conflicts,
loop-tiling, etc.)
Using Asfermi is like to drain the last bit of the hardware's
capability. With careful register allocation (less spilling) and
instruction scheduling (better overlapping and latency hiding), we may
get something around ~10% speedup. But it really depends on your
application.

I am actually working on adding sm_30 support to asfermi now. Too sad
that Yunqing doesn't have to do this and also, I am not familiar with
ELF format...
> On Sun, Apr 15, 2012 at 2:34 PM, Everett Fominyen <fomin...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Hello Yunqing,
> > Thank you very much for the notes and pointers on the Fermi architecture;
> > they will be of great value for understanding the Kepler architecture. I
> > will have some time off from work within the next couple of months and will
> > thoroughly examine the asfermi source code. From there, I will make an
> > attempt to figure out how much effort will be needed to add sm_30 support.
>
> > Also, I have a couple of questions for you:
>
> > 1). On average, how much of a performance gain have you been able to
> > achieve using asfermi versus using Nvidia's nvcc/ptxas compiler 4.x? Off
> > course, assuming that the critical portions of your code were written using
> > in-line PTX assembly.
>
> > 2). Are there any tools such as "qhasm-cudasm" for help with register
> > allocation when using asfermi. I have a project with over 5000 lines of PTX
> > assembly code and it will be very difficult to keep track of the registers
> > manually.
>
> > Thanks, Everett
> >>> EFomin...@cso.com
>
> >>>   *Hou Yunqing <hyq.neu...@gmail.com>*
>
> >>> 04/10/2012 09:32 PM
> >>>    To
> >>> EFomin...@cso.com
> >>> cc
> >>>   Subject
> >>> Re: GTX 680 (sm_30)
>
> >>> Hi Everett,
>
> >>> I won't have the time to do that, though if you wish to do so I can try
> >>> to give some help. Are you familiar with the ELF object format? I think the
> >>> changes (from sm_20 to sm_30) mostly take place at this level, as it seems
> >>> that the sm_30 ISA is a minor extension of sm_20's. You'd have to spend
> >>> some time investigating the cubins for sm_30 to comfirm that not too many
> >>> things have been changed. I did some notes for the sm_20 cubin format and
> >>> I'll probably have to dig it out from my Windows machine...
>
> >>> Also, a few other people on asfermi's mailing list (*
> >>> asf...@googlegroups.com* <asf...@googlegroups.com>) seem to be
> >>> interested in sm_30 support. They might be able to help. Would you like me
> >>> to forward this conversation to the maiilng group?
>
> >>> Thanks,
> >>> Yunqing
>
> >>> On Wed, Apr 11, 2012 at 1:23 AM, <*EFomin...@cso.com*<EFomin...@cso.com>>
> >>> wrote:
>
> >>> Hello,
>
> >>> Would it be possible for you to add support for sm_30? Or you can guide
> >>> me on how to do so. Your help is greatly appreciated.
>
> >>> Thanks Everett *
> >>> **EFomin...@cso.com* <EFomin...@cso.com>

JLai

unread,
Apr 19, 2012, 11:02:44 AM4/19/12
to asfermi

sorry, i mean "Yunqing doesn't have time to do this" :)

Dmitry N. Mikushin

unread,
Apr 19, 2012, 1:24:11 PM4/19/12
to asf...@googlegroups.com
Hi JLai,

> I am not familiar with ELF format...

Last summer I used "libelf by example" by Joseph Koshy. It should be
enough to get familiar with ELF basics.

- D.

2012/4/19 JLai <lai...@gmail.com>:

JLai

unread,
Apr 20, 2012, 10:47:59 AM4/20/12
to asfermi

Thanks~ Dmitry.

I some results, and folks, I need some comments here:)

I tried a small kernel like this;
void f(float *C, const float *A) {
const unsigned int idt = threadIdx.y*32+threadIdx.x;
C[idt] = A[idt]+2.0;
}

1) I used nvcc from 4.2RC to generate a .cubin with sm_30 and -m32.
2) Then I used cuobjdump (also from 4.2RC) to disassemble the binary.
3) I used asfermi (modified a little to add sm_30 support) to assemble
the assembly code from step 2), and then reuse cuobjdump to
disassemble the Asfermi-generated binary

Comparing the result from step 2) and 3) (the output is shown below),
A. The instruction at 0x0000 and 0x0040 (from step 2) are missing??
The text section size from step 2) is 16Bytes larger than step 3)
(128 vs 112)
B. Most instructions are well processed by asfermi except BRA,
step 2) /*0xe0001de74003ffff*/ BRA 0x60;
step 3) /*0x20001de740000000*/ BRA 0x60;
C. sm_30 code always add this piece of code after EXIT.
/*0060*/ BRA 0x60;
/*0068*/ NOP CC.T;
/*0070*/ NOP CC.T;
/*0078*/ NOP CC.T;
D. Other elf information is similiar too, except (I don't understand
the meaning of these attributes, any comments?)
step 2)
.nv.info.f PROGBITS
<0x1>
Attribute: EIATTR_PARAM_CBANK
Format: EIFMT_SVAL
Value: 0x9 0x80140
step 3)
.nv.info.f PROGBITS
<0x1>
Attribute: EIATTR_CBANK_PARAM_OFFSETS
Format: EIFMT_SVAL
Value: 0x0 0x4
<0x2>
Attribute: EIATTR_PARAM_CBANK
Format: EIFMT_SVAL
Value: 0x7 0x80020



Here is the result from step 2)
/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44];
/*0010*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0018*/ /*0x88001c042c000000*/ S2R R0, SR_Tid_Y;
/*0020*/ /*0x08009ca340000000*/ ISCADD R2, R0, R2, 0x5;
/*0028*/ /*0x10201c4340004005*/ ISCADD R0, R2, c [0x0]
[0x144], 0x2;
/*0030*/ /*0x00001c8580000000*/ LD R0, [R0];
/*0038*/ /*0x00209c4340004005*/ ISCADD R2, R2, c [0x0]
[0x140], 0x2;
/*0048*/ /*0x00001c005000d000*/ FADD R0, R0, 0x40000;
/*0050*/ /*0x00201c8590000000*/ ST [R2], R0;
/*0058*/ /*0x00001de780000000*/ EXIT;
/*0060*/ /*0xe0001de74003ffff*/ BRA 0x60;
/*0068*/ /*0x00001de440000000*/ NOP CC.T;
/*0070*/ /*0x00001de440000000*/ NOP CC.T;
/*0078*/ /*0x00001de440000000*/ NOP CC.T;

Then the result from step 3)
/*0000*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44];
/*0008*/ /*0x84009c042c000000*/ S2R R2, SR_Tid_X;
/*0010*/ /*0x88001c042c000000*/ S2R R0, SR_Tid_Y;
/*0018*/ /*0x08009ca340000000*/ ISCADD R2, R0, R2, 0x5;
/*0020*/ /*0x10201c4340004005*/ ISCADD R0, R2, c [0x0]
[0x144], 0x2;
/*0028*/ /*0x00001c8580000000*/ LD R0, [R0];
/*0030*/ /*0x00209c4340004005*/ ISCADD R2, R2, c [0x0]
[0x140], 0x2;
/*0038*/ /*0x00001c005000d000*/ FADD R0, R0, 0x40000;
/*0040*/ /*0x00201c8590000000*/ ST [R2], R0;
/*0048*/ /*0x00001de780000000*/ EXIT;
/*0050*/ /*0x20001de740000000*/ BRA 0x60;
/*0058*/ /*0x00001de440000000*/ NOP CC.T;
/*0060*/ /*0x00001de440000000*/ NOP CC.T;
/*0068*/ /*0x00001de440000000*/ NOP CC.T;


On Apr 19, 7:24 pm, "Dmitry N. Mikushin" <maemar...@gmail.com> wrote:
> Hi JLai,
>
> > I am not familiar with ELF format...
>
> Last summer I used "libelf by example" by Joseph Koshy. It should be
> enough  to get familiar with ELF basics.
>
> - D.
>
> 2012/4/19 JLai <laij...@gmail.com>:

Dmitry N. Mikushin

unread,
Apr 20, 2012, 11:16:11 AM4/20/12
to asf...@googlegroups.com
Hi JLai,

Nice study! Do you have 680 on short hand? It would be interesting to
know if the code works with C. -lines omitted. Maybe this is just an
endless loop on 0x60 to prevent kernel from executing random code
after the kernel body? I.e. security block.

The EIATTR_CBANK_PARAM_OFFSETS should be trivially the byte offset of
each individual kernel argument. Could you try to experiment and
change the kernel argument list to make sure?

Are you familiar with gdb / cuda-gdb ? You can launch kernel in
debugger an look around in its assembly: what do specific memory
addresses contain, etc. This way knowing the input data, you might be
able to guess the meaning of the second attribute.

- D.

2012/4/20 JLai <lai...@gmail.com>:

JLai

unread,
Apr 20, 2012, 11:30:56 AM4/20/12
to asfermi

To answer my own question...

The Instruction "BRA 0x60" is actually a new instruction that jumps to
its own position.

/*0048*/ /*0xe0001de74003ffff*/ BRA 0x48; //
Jumps to its own PC
/*0050*/ /*0x20001de740000000*/ BRA 0x60; //This is
the BRA instruction
/*0058*/ /*0x00001de780000000*/ EXIT;
/*0060*/ /*0xe0001de74003ffff*/ BRA 0x60; //
Jumps to its own PC
> ...
>
> read more »

JLai

unread,
Apr 20, 2012, 11:39:06 AM4/20/12
to asfermi

I don't have 680 now.. Hopefully soon.. So at this moment, I still
don't know whether the binary generated by modified asfermi will
actually work on 680.
Yes, it seems that the instruction 0x20001de740000000 is just a
infinite loop.

Ok. I'll try to make some experiments on the kernel arguments.
Thanks D.


On Apr 20, 5:16 pm, "Dmitry N. Mikushin" <maemar...@gmail.com> wrote:
> Hi JLai,
>
> Nice study! Do you have 680 on short hand? It would be interesting to
> know if the code works with C. -lines omitted. Maybe this is just an
> endless loop on 0x60 to prevent kernel from executing random code
> after the kernel body? I.e. security block.
>
> The EIATTR_CBANK_PARAM_OFFSETS should be trivially the byte offset of
> each individual kernel argument. Could you try to experiment and
> change the kernel argument list to make sure?
>
> Are you familiar with gdb / cuda-gdb ? You can launch kernel in
> debugger an look around in its assembly: what do specific memory
> addresses contain, etc. This way knowing the input data, you might be
> able to guess the meaning of the second attribute.
>
> - D.
>
> 2012/4/20 JLai <laij...@gmail.com>:
> ...
>
> read more »

Everett Fominyen

unread,
Apr 20, 2012, 1:29:45 PM4/20/12
to asf...@googlegroups.com
Hi Yunqing,

I am working on implementing integer factorization and discrete logarithm (ECDLP) algorithms using elliptic curves which heavily depend on modular multiplication. At present, I can perform approx. (2.231*10^9) 192-bit modular multiplications, with error checking routines, on an MSI GTX 570 OC (Graphics Clock 786 MHz / Shader Clock 1572 MHz). Also, I have an MSI GTX 680 (Graphics Clock 1006 MHz) and it is a bout 1.36x slower when compiled with the released version of the Nvidia compiler (version 4.2). According to this documentation on Nvidia's website (http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf), section 5.4.1 on pages 74-76, my code should be running at least 1.37x faster on the GTX 680 than on my GTX 570, not the other way around. I have already taken into account that on my GTX 680, "__syncthreads()" and "64-bit type conversions" will be 6x and 3x slower than on my GTX 570, so the Nvidia compiler is a suspect for the poor performance on Kepler.

In a nutshell, I will like to bypass Nvidia's PTXAS whenever possible for Kepler. If I could use asfermi and increase performance on the GTX 570 by 10 percent, and take that times 1.37 on the GTX 680, I will be very close to 3.5 billion 192-bit modular multiplications per second, with error checking, at stock clocks, that will be astonishing :-)

Thanks, Everett



 

Hou Yunqing

unread,
Apr 21, 2012, 12:00:48 AM4/21/12
to asf...@googlegroups.com
@Everett

Perhaps you should wait still longer for the real compute card to come out. GTX680 is a GK104 card which is designed like a GTX 460 (GF104). But surely you can try to get asfermi to work with sm_30 first. I highly doubt sm_30 will be the architecture for the real compute cards, but I suppose going from sm_30 to the real compute architecture will be trivial.

@JLai

It is unlikely to work on sm_30. cuobjdump is a big liar and it cheats. It has omitted two instructions in its output for sm_30. It used to give you wrong hex code for the instructions when you feed it with stuffs it can't recognise (it's like first you feed it with an instruction it can't recognise, then the hex code of the following instructions, which it recognises without problem, will be wrong), though looks like this time it's behaving slightly better. The instruction at 0x0, which cuobjdump ignores, is probably non-trivial. It's different almost everytime I compile a different kernel. I'm not sure what the thing at 0x40 is. 

The instructions at 0x0 and 0x40 are probably of the same type. They use the same opcode, do not use predicate, and... seem to have different modifiers and operands... I'm not sure if Kepler can run without those instructions... some new info: these instructions are control flow instructions. They have na=1110 and nb=000100 (see http://code.google.com/p/asfermi/wiki/nanb). I doubt Kepler can run without those instructions. So if you want sm_30 to work, you'd probably have to familiarise yourself with the control flow instructions that asfermi has already documented, and then try to comprehend this new instruction. I wrote a small tool (http://code.google.com/p/asfermi/source/browse/asfermiOld/utilities/cuprocess.cpp) to recover the correct binary encoding from the output of cuobjdump (because cuobjdump lies all the time). The input file to that cuprocess utility must be formatted in the same way as the output of cuobjdump... so I first get the hex code of the new instructions using cuobjdump -elf xxx.cubin (in the hex dump), then I copy the hex code and past it over the long 0xabcdabcdabcdabcd inside the line generated by cuobjdump. For example, the hex dump for 0x0 is:
0x82823007  0x23704282
So I just get any line generated by cuobjdump:
/*0008*/     /*0x42804007200002e0*/ MOV R1, c [0x0] [0x44];
And paste it over
/*0008*/     /*0x8282300723704282*/ MOV R1, c [0x0] [0x44];
Then run cuprocess to get:
/*0008*/     /*0x8282300723704282*/ MOV R1, c [0x0] [0x44];          1110000000001100010000010100000101000001010000100000111011000100

Alternatively you may use the cubinEditor (written in C#.NET) I did to do the job, but it's not so stable when it's run with Mono http://code.google.com/p/asfermi/wiki/Utilities
================


   step 2)     /*0xe0001de74003ffff*/     BRA 0x60;

   step 3)     /*0x20001de740000000*/     BRA 0x60;
There's no new instruction here. BRA uses relative offset in its binary encoding, though cuobjdump outputs the absolute offset from the beginning of the kernel instead. If the BRA in step 3 is at 0x60 (right now it's at 0x50), then you'll get 0xe0001de74003ffff instead of 0x20001de740000000.

=================

As for the EIATTR stuffs, they are documented (without much comment :] ) in trunk/helper/helperCubin.cpp, hpCubinStage3()
The EIATTR_CBANK_PARAM_OFFSETS that nvcc used to generate was really redundant, because that info is contained in EIATTR_KPARAM_INFO. Now looks like EIATTR_CBANK_PARAM_OFFSETS finally got stripped.

(line numbers I'm talking about reside in r745, helper/helperCubin.cpp)
To match this change, you'll have to adjust line 347 accordingly to correctly calculate the size for the new .nv.info.kernelname section. Then you'll just have to put line 350~354 into an if construct. Then at line 359 the OR op will have to be done with 0x0140 instead of 0x0020 if we're dealing with sm_30. 0x0140 is the new starting location of parameters in c0 bank. Also the size of the per-kernel Constant0 section would have to be adjusted accordingly.

Hope the above isn't too much info to digest :)

Yunqing

JLai

unread,
May 9, 2012, 8:53:03 AM5/9/12
to asfermi

Thanks Yunqing,

Now I have some insights from some experiments to share with you guys.
First, the binary code of Kepler GPU are separated into 7-instruction
groups.
Before each 7 instructions is always the special control flow
instruction (so this new kind of instruction will always appear at
0x0, 0x40, 0x80, ...)
I speculate that the instruction bits can be separated into 8 fields,
the identifier bits and 7 fields holding latency-related information
for the rest 7 instructions.

I still cannot figure out how the 7 fields are encoded. I don't have
the card.. If anyone can provide a remote-access account with GTX680,
it would be wonderful.
And as Yunqing said, it is probable that the binary cannot run on
Kepler without this special instruction since the dependence checking
hardware is simplified on Kepler to save space and power.

Junjie

On Apr 21, 6:00 am, Hou Yunqing <hyq.neu...@gmail.com> wrote:
> @Everett
>
> Perhaps you should wait still longer for the real compute card to come out.
> GTX680 is a GK104 card which is designed like a GTX 460 (GF104). But surely
> you can try to get asfermi to work with sm_30 first. I highly doubt sm_30
> will be the architecture for the real compute cards, but I suppose going
> from sm_30 to the real compute architecture will be trivial.
>
> @JLai
>
> It is unlikely to work on sm_30. cuobjdump is a big liar and it cheats. It
> has omitted two instructions in its output for sm_30. It used to give you
> wrong hex code for the instructions when you feed it with stuffs it can't
> recognise (it's like first you feed it with an instruction it can't
> recognise, then the hex code of the following instructions, which it
> recognises without problem, will be wrong), though looks like this time
> it's behaving slightly better. The instruction at 0x0, which cuobjdump
> ignores, is probably non-trivial. It's different almost everytime I compile
> a different kernel. I'm not sure what the thing at 0x40 is.
>
> The instructions at 0x0 and 0x40 are probably of the same type. They use
> the same opcode, do not use predicate, and... seem to have different
> modifiers and operands... I'm not sure if Kepler can run without those
> instructions... some new info: these instructions are control flow
> instructions. They have na=1110 and nb=000100 (seehttp://code.google.com/p/asfermi/wiki/nanb). I doubt Kepler can run without
> those instructions. So if you want sm_30 to work, you'd probably have to
> familiarise yourself with the control flow instructions that asfermi has
> already documented, and then try to comprehend this new instruction. I
> wrote a small tool (http://code.google.com/p/asfermi/source/browse/asfermiOld/utilities/c...)
> to recover the correct binary encoding from the output of cuobjdump
> (because cuobjdump lies all the time). The input file to that cuprocess
> utility must be formatted in the same way as the output of cuobjdump... so
> I first get the hex code of the new instructions using cuobjdump -elf
> xxx.cubin (in the hex dump), then I copy the hex code and past it over the
> long 0xabcdabcdabcdabcd inside the line generated by cuobjdump. For
> example, the hex dump for 0x0 is:
> 0x82823007  0x23704282
> So I just get any line generated by cuobjdump:
> /*0008*/     /*0x42804007200002e0*/ MOV R1, c [0x0] [0x44];
> And paste it over
> /*0008*/     /*0x8282300723704282*/ MOV R1, c [0x0] [0x44];
> Then run cuprocess to get:
> /*0008*/     /*0x8282300723704282*/ MOV R1, c [0x0] [0x44];
>  1110000000001100010000010100000101000001010000100000111011000100
>
> Alternatively you may use the cubinEditor (written in C#.NET) I did to do
> the job, but it's not so stable when it's run with Monohttp://code.google.com/p/asfermi/wiki/Utilities
> ================
>
>    step 2)     /*0xe0001de74003ffff*/     BRA 0x60;
>    step 3)     /*0x20001de740000000*/     BRA 0x60;
> There's no new instruction here. BRA uses relative offset in its binary
> encoding, though cuobjdump outputs the absolute offset from the beginning
> of the kernel instead. If the BRA in step 3 is at 0x60 (right now it's at
> 0x50), then you'll get 0xe0001de74003ffff instead of 0x20001de740000000.
>
> =================
>
> As for the EIATTR stuffs, they are documented (without much comment :] ) in
> trunk/helper/helperCubin.cpp, hpCubinStage3()
> The EIATTR_CBANK_PARAM_OFFSETS that nvcc used to generate was really
> redundant, because that info is contained in EIATTR_KPARAM_INFO. Now looks
> like EIATTR_CBANK_PARAM_OFFSETS finally got stripped.
>
> (line numbers I'm talking about reside in r745, helper/helperCubin.cpp)
> To match this change, you'll have to adjust line 347 accordingly to
> correctly calculate the size for the new .nv.info.kernelname section. Then
> you'll just have to put line 350~354 into an if construct. Then at line 359
> the OR op will have to be done with 0x0140 instead of 0x0020 if we're
> dealing with sm_30. 0x0140 is the new starting location of parameters in c0
> bank. Also the size of the per-kernel Constant0 section would have to be
> adjusted accordingly.
>
> Hope the above isn't too much info to digest :)
>
> Yunqing
>
> On Sat, Apr 21, 2012 at 1:29 AM, Everett Fominyen <fomin...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Hi Yunqing,
>
> > I am working on implementing integer factorization and discrete logarithm
> > (ECDLP) algorithms using elliptic curves which heavily depend on modular
> > multiplication. At present, I can perform approx. (2.231*10^9) 192-bit
> > modular multiplications, with error checking routines, on an MSI GTX 570 OC
> > (Graphics Clock 786 MHz / Shader Clock 1572 MHz). Also, I have an MSI GTX
> > 680 (Graphics Clock 1006 MHz) and it is a bout 1.36x slower when compiled
> > with the released version of the Nvidia compiler (version 4.2). According
> > to this documentation on Nvidia's website (
> >http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/...),
> > section 5.4.1 on pages 74-76, my code should be running at least 1.37x
> > faster on the GTX 680 than on my GTX 570, not the other way around. I have
> > already taken into account that on my GTX 680, "__syncthreads()" and
> > "64-bit type conversions" will be 6x and 3x slower than on my GTX 570, so
> > the Nvidia compiler is a suspect for the poor performance on Kepler.
>
> > In a nutshell, I will like to bypass Nvidia's PTXAS whenever possible for
> > Kepler. If I could use asfermi and increase performance on the GTX 570 by
> > 10 percent, and take that times 1.37 on the GTX 680, I will be very close
> > to 3.5 billion 192-bit modular multiplications per second, with error
> > checking, at stock clocks, that will be astonishing :-)
>
> > Thanks, Everett
>
> >> On Sun, Apr 15, 2012 at 2:34 PM, Everett Fominyen <fomin...@gmail.com>wrote:
>
> >>> Hello Yunqing,
> >>> Thank you very much for the notes and pointers on the Fermi
> >>> architecture; they will be of great value for understanding the Kepler
> >>> architecture. I will have some time off from work within the next couple
> >>> of months and will thoroughly examine the asfermi source code. From there,
> >>> I will make an attempt to figure out how much effort will be needed to add
> >>> sm_30 support.
>
> >>> Also, I have a couple of questions for you:
>
> >>> 1). On average, how much of a performance gain have you been able to
> >>> achieve using asfermi versus using Nvidia's nvcc/ptxas compiler 4.x? Off
> >>> course, assuming that the critical portions of your code were written using
> >>> in-line PTX assembly.
>
> >>> 2). Are there any tools such as "qhasm-cudasm" for help with register
> >>> allocation when using asfermi. I have a project with over 5000 lines of PTX
> >>> assembly code and it will be very difficult to keep track of the registers
> >>> manually.
>
> >>> Thanks, Everett
> >>>  On Sat, Apr 14, 2012 at 10:07 PM, Hou Yunqing <hyq.neu...@gmail.com>wrote:
>
> >>>> Hi Everett,
>
> >>>> I've found my notes, though something that I wrote on paper is perhaps
> >>>> lost. I have attached the notes here.
>
> >>>> Also, I just realised that my code is a better documentation of the
> >>>> cubin format than my own notes which I took before I started coding. The
> >>>> most relevant parts in my code include:
> >>>> 1. WriteToCubinDirectOutput() in asfermi.cpp
> >>>> 2. Cubin.h/cpp
> >>>> 3. helper/helperCubin.cpp
>
> >>>> Things that surely have changed in sm_30:
> >>>> 1. ELFH32.Flags defines which architecture the cubin is for. You can
> >>>> find out this for sm_30 using cueditor_v5 in the download page, or you
> >>>> could just use any of your favourite elf utility.
> >>>> 2. The content in .nv.info.kernelname is most likely to have changed
> >>>> somewhat. You can first get a rough understanding of what that section is
> >>>> using cuobjdump -elf, then you can look at the parts of my code that
> >>>> produce such sections (hpCubinStage3() in helperCubin.cpp). You'll need a
> >>>> hex editor to find out what has changed.
> >>>> 3.  Constant0Section is most likely to have changed somewhat as
>
> ...
>
> read more »

Everett Fominyen

unread,
May 9, 2012, 9:21:39 AM5/9/12
to asf...@googlegroups.com
Junjie,

What platform do you need an account on, Linux or Windows? I can set you up with a test account on either.

Thanks, Everett

Sent from my iPhone

Junjie Lai

unread,
May 9, 2012, 9:28:15 AM5/9/12
to asf...@googlegroups.com

Hi Everett,

That's great!
Linux is good for me.

Thanks!

Junjie

Everett Fominyen

unread,
May 9, 2012, 9:56:36 AM5/9/12
to asf...@googlegroups.com
I am currently on my way to work and I will have your test account setup in less than an hour from now. I will also send to you your connection and login info.

Specs for the test machine:
OS: Redhat Linux FC17
CPU: AMD 1090T
Memory: 16GB
GPU 0: GTX 680
GPU 2: GTX 570

Thanks, Everett
Sent from my iPhone

Junjie Lai

unread,
May 9, 2012, 10:06:51 AM5/9/12
to asf...@googlegroups.com

Super!
Thanks Everett.

My email: lai...@gmail.com
Junjie

Hou Yunqing

unread,
May 16, 2012, 2:08:10 AM5/16/12
to asf...@googlegroups.com
Hi guys,

I worked a while on the encoding of the new control flow instruction... got something to share with you

I say, for the sake of convenience, we give it a name. Here I'll call this instruction SCHI, for scheduling info
It's encoded like this:
1110 AAAAAA BBBBBBBB CCCCCCCC DDDDDDDD EEEEEEEE FFFFFFFF GGGGGGGG 000100
Each same-lettered series of bits is the information for one instruction.
So there's 1 group of 6 bits (AAAAAA) and 6 groups of 8 bits (the rest).
I haven't seen the lowest 2 bits being used in those 8-bit groups, so I suppose the 6-bit group contains everything that it needs to contain, with the lowest 2 bits of 0 truncated.

Here's my data:

organised in this way:
SCHI binary code
next 7 instructions

1110 011101 00000000 00000000 00000000 00000000 00000000 00000000 000100 
/*0048*/     /*0x00001de780000000*/ EXIT;
/*0050*/     /*0xe0001de74003ffff*/ BRA 0x50;
/*0058*/     /*0x00001de440000000*/ NOP CC.T;
/*0060*/     /*0x00001de440000000*/ NOP CC.T;
/*0068*/     /*0x00001de440000000*/ NOP CC.T;
/*0070*/     /*0x00001de440000000*/ NOP CC.T;
/*0078*/     /*0x00001de440000000*/ NOP CC.T;

1110 001000 00011101 00000000 00000000 00000000 00000000 00000000 000100 
/*0048*/     /*0x00009c8590000000*/ ST [R0], R2;
/*0050*/     /*0x00001de780000000*/ EXIT;
/*0058*/     /*0xe0001de74003ffff*/ BRA 0x58;
/*0060*/     /*0x00001de440000000*/ NOP CC.T;
/*0068*/     /*0x00001de440000000*/ NOP CC.T;
/*0070*/     /*0x00001de440000000*/ NOP CC.T;
/*0078*/     /*0x00001de440000000*/ NOP CC.T;

1110 000101 00001000 00011101 00000000 00000000 00000000 00000000 000100 
/*0048*/     /*0x10309c0348004005*/ IADD R2, R3, c [0x0] [0x144];
/*0050*/     /*0x00009c8590000000*/ ST [R0], R2;
/*0058*/     /*0x00001de780000000*/ EXIT;
/*0060*/     /*0xe0001de74003ffff*/ BRA 0x60;
/*0068*/     /*0x00001de440000000*/ NOP CC.T;
/*0070*/     /*0x00001de440000000*/ NOP CC.T;
/*0078*/     /*0x00001de440000000*/ NOP CC.T;

1110 001000 00000101 00001000 00011101 00000000 00000000 00000000 000100 
/*0048*/     /*0x10201c0350004005*/ IMUL.U32.U32 R0, R2, c [0x0] [0x144];
/*0050*/     /*0x10309c0348004005*/ IADD R2, R3, c [0x0] [0x144];
/*0058*/     /*0x00009c8590000000*/ ST [R0], R2;
/*0060*/     /*0x00001de780000000*/ EXIT;
/*0068*/     /*0xe0001de74003ffff*/ BRA 0x68;
/*0070*/     /*0x00001de440000000*/ NOP CC.T;
/*0078*/     /*0x00001de440000000*/ NOP CC.T;

1110 000101 00001000 00000101 00001000 00011101 00000000 00000000 000100 
/*0048*/     /*0x4000dc034800c000*/ IADD R3, R0, 0x10;
/*0050*/     /*0x08201c0350000000*/ IMUL.U32.U32 R0, R2, R2;
/*0058*/     /*0x10309c0348004005*/ IADD R2, R3, c [0x0] [0x144];
/*0060*/     /*0x00009c8590000000*/ ST [R0], R2;
/*0068*/     /*0x00001de780000000*/ EXIT;
/*0070*/     /*0xe0001de74003ffff*/ BRA 0x70;
/*0078*/     /*0x00001de440000000*/ NOP CC.T;

1110 000101 00001000 00000101 00000101 00001000 00011101 00000000 000100 
/*0048*/     /*0x4000dc034800c000*/ IADD R3, R0, 0x10;
/*0050*/     /*0x08201c0350000000*/ IMUL.U32.U32 R0, R2, R2;
/*0058*/     /*0x10309c0348004005*/ IADD R2, R3, c [0x0] [0x144];
/*0060*/     /*0x10001c0350004005*/ IMUL.U32.U32 R0, R0, c [0x0] [0x144];
/*0068*/     /*0x00009c8590000000*/ ST [R0], R2;
/*0070*/     /*0x00001de780000000*/ EXIT;
/*0078*/     /*0xe0001de74003ffff*/ BRA 0x78;

1110 101001 00000101 00001000 00000101 00000101 00001000 00011101 000100 
/*0048*/     /*0x08209c0350000000*/ IMUL.U32.U32 R2, R2, R2;
/*0050*/     /*0x4000dc034800c000*/ IADD R3, R0, 0x10;
/*0058*/     /*0x10201c0350004005*/ IMUL.U32.U32 R0, R2, c [0x0] [0x144];
/*0060*/     /*0x10309c0348004005*/ IADD R2, R3, c [0x0] [0x144];
/*0068*/     /*0x00001c0350000000*/ IMUL.U32.U32 R0, R0, R0;
/*0070*/     /*0x00009c8590000000*/ ST [R0], R2;
/*0078*/     /*0x00001de780000000*/ EXIT;
/*0080*/     /*0xe0001de74003ffff*/ BRA 0x80;
/*0088*/     /*0x00001de440000000*/ NOP CC.T;
/*0090*/     /*0x00001de440000000*/ NOP CC.T;
/*0098*/     /*0x00001de440000000*/ NOP CC.T;
/*00a0*/     /*0x00001de440000000*/ NOP CC.T;
/*00a8*/     /*0x00001de440000000*/ NOP CC.T;
/*00b0*/     /*0x00001de440000000*/ NOP CC.T;
/*00b8*/     /*0x00001de440000000*/ NOP CC.T;

My observation: NOP CC.T uses no scheduling info (8 straight 0s). Same goes for the self-pointing BRA, though EXIT has scheduling info.

Hope the above info can help you get a bit further.

Best wishes,
Yunqing

Hou Yunqing

unread,
May 16, 2012, 12:01:39 PM5/16/12
to asf...@googlegroups.com
So GK110 has emerged... looks like it's still gonna be sm_30 with the same SMX stuffs, but perhaps with full throughput for some instructions.

Everett Fominyen

unread,
May 16, 2012, 2:40:28 PM5/16/12
to asf...@googlegroups.com
Is seems like for now the only major GK110 (sm_35) improvements will be the throughput of the FP64 arithmetic instructions, 8x GK104's (sm_30) per SMX clock for clock. It will be a big deal for those who use the FP64 arithmetic instructions such as the Supercomputing folks. For the rest of us on the Fermi/Kepler architectures, the GeForce GPUs will completely put to shame the Tesla GPUs on other arithmetic instructions based on price/performance, oops, don't tell Nvidia I said that :-)

Hou Yunqing

unread,
May 16, 2012, 10:20:12 PM5/16/12
to asf...@googlegroups.com
So it's called sm_35? Confirmed? BTW I've heard from a few people that GTX 680 has crippled integer units as well, so perhaps the benefit is more than just for FP64? Not sure if it's GK104 that got crippled integer units or if it's just GTX 680... I don't expect the Tesla K10 board to come with reduced integer instruction throughput, though it may... 

The new SMX design with a 8:1 core to LD/ST ratio is perfect for global memory, but kills a lot of shared memory throughput...

Sent from my iPod

Everett Fominyen

unread,
May 16, 2012, 11:49:54 PM5/16/12
to asf...@googlegroups.com
Nvidia added support for GPUs with Compute Capability of 3.5 in CUDA 5.0 (PTX ISA VERSION 3.1) and I assume this will be for the GF110 based GPUs. As for the 32 bit integer arithmetic throughput on the GTX 680 (sm_30), my code now runs 1.45x faster than on my factory over clocked GTX 570 (Graphics Clock 786 MHz / Shader Clock 1572 MHz). So the rumors about the 32 bit integer arithmetic instructions being crippled are completely unjustified. However, I had to mess a lot with my code to reach that level of performance on the GTX 680.

JLai

unread,
Jun 27, 2012, 9:35:50 AM6/27/12
to asf...@googlegroups.com
Hi everyone,
I patched a little to the asfermi. now it can support the .cubin file format on Kepler.
but there is more..
1.
however, as we discussed before, In kepler, there is something new before each 7 instructions.
with the format of 0xAABBCCD7 0x2DEEFFGG.
According to my experiments so far, this new information is not instruction that can be executed.
more like something that the scheduler utilizes to better schedule instructions' execution.
(I tested LDS and LDS.64 's throughput. and the throughput can achieve 32 operations per clock per SM. 
 if the new information is executed, it would be 32* (7/8)=28 operations per clock per SM)

2.
if we change a little bit of the assembly code from fermi (change the parameter start address) with the patched asfermi, we can run the cubin without problem on Kepler.
But the performance is very poor. something like 20%~30% (not accurate)

3.
I contacted a friend in NVIDIA. and that friend asked some other guy in the compiler group. 
apparently, this information is hidden intentionally and NVIDIA doesn't want to disclose it.

4.
for some programs (with few instruction types), a tradeoff is to generate this information (I call them controlling bits) by ourselves.
for example,  for all the IMAD, I use 0x25 as their controlling bits.
By doing this, much better performance can be obtained. 
I didn't test many programs, but for one, the result is even better than the cubin nvcc generated.
(the actual execution instructions are the same)
but it is not guaranteed of course. and I think this is only a lucky exception.
i believe that for most programs, this way should be worse than the cubin that nvcc generates.

5.
however, asfermi can still help understand the underlying architectures of NVIDIA GPUs since we can control almost everything.
I hope someday that NVIDIA can make their assembly toolset public as cuobjdump...

Dmitry N. Mikushin

unread,
Jun 27, 2012, 9:57:44 AM6/27/12
to asf...@googlegroups.com
Dear Junjie,

Thank you for the great work!

Many of us would be much interested in investigating the control bits
notation (well, let's call it static latency scheduler) for complete
and efficient Kepler support.

Did you try to consult nouveau/envy-tools guys? They might be also
looking into Kepler and have some insight.

Also, what resources/support do you need to continue with this effort?

Thanks,
- Dima.

2012/6/27 JLai <lai...@gmail.com>:

JLai

unread,
Jun 27, 2012, 10:19:48 AM6/27/12
to asf...@googlegroups.com

Hi D.
I am quite busy this week. I suppose that I can post the patched Asfermi some time next week if you guys are interested.
So this is what I did:
1. modify the old assembly for Fermi (change all the c [0x0][0x20] to c[0x0][0x140] , ...)
2. use the Asfermi to generate .cubin, this .cubin should run on kepler without problem.
3. use some scripts to process the assembly and add !RawInstruction 0xXXXXXXX7 0x2XXXXXXX before each 7 instructions.
    within this script, we can control how to manipulate the control bits notation.
    use this processed assembly code to generate a new .cubin and test the performance.
    if we use the same controlling information as the one that is generated by nvcc, we don't lose performance. (that's nice.)

i don't know much of their work from nouveau/envy-tool guys. and actually, i am quite comfortable using the cuobjdump-output-like assembly that asfermi uses......
i have the 680 card now ( I worked on Everett's machine for several days, thanks again! )
it would be nice that we all dig a little deeper to see how the controlling information is encoded. My imagination to solve this puzzle is almost exhausted..

best,

Junjie

Dmitry N. Mikushin

unread,
Jun 28, 2012, 3:51:21 PM6/28/12
to asf...@googlegroups.com
Junjie,

Am I understanding correctly, that starting with Kepler, even knowing
the control bits, we cannot write arbitrary ISA code anymore, because
it must be suitably scheduled?

Hou Yunqing

unread,
Jun 28, 2012, 9:27:00 PM6/28/12
to asf...@googlegroups.com
Hi Junjie,

Yes please send us the patch! I'll give another try to understand those bits, in perhaps a few days.

Yunqing

JLai

unread,
Jul 3, 2012, 8:14:47 AM7/3/12
to asf...@googlegroups.com

Hi, very sorry for the late reply. (i was really busy)
On Kepler GPU,  we can still write arbitrary assembly code. no problem.
Even without the embedded scheduling information, we can still run the .cubin correctly.
but to get reasonable performance, proper control information has to be added into the assembly code after.
Is my explanation clear?

Junjie

On Thursday, June 28, 2012 9:51:21 PM UTC+2, mmarc__ wrote:
Junjie,

Am I understanding correctly, that starting with Kepler, even knowing
the control bits, we cannot write arbitrary ISA code anymore, because
it must be suitably scheduled?

Thanks,
- Dima.

2012/6/27 JLai <>:
>> 2012/6/27 JLai <>:

JLai

unread,
Jul 3, 2012, 8:15:55 AM7/3/12
to asf...@googlegroups.com

Hi Yunqing,
Very sorry for the late reply. 
OK. I'll upload the patch asap.
thanks.

junjie

Dmitry N. Mikushin

unread,
Jul 3, 2012, 8:32:04 AM7/3/12
to asf...@googlegroups.com
Hi Junjie,

Yes, thanks a lot!

- D.

2012/7/3 JLai <lai...@gmail.com>:

JLai

unread,
Jul 3, 2012, 8:50:11 AM7/3/12
to asf...@googlegroups.com

Hi, Yunqing,

I just cleaned up the source.

how to check in the code or should I send it to you?
thanks.

Junjie

JLai

unread,
Jul 3, 2012, 8:50:32 AM7/3/12
to asf...@googlegroups.com

no problem

junjie

Everett Fominyen

unread,
Jul 3, 2012, 9:26:59 AM7/3/12
to asf...@googlegroups.com
Yes it makes sense, thank you.

Sent from my iPhone

Hou Yunqing

unread,
Jul 4, 2012, 10:31:27 AM7/4/12
to asf...@googlegroups.com
Hi junjie,

I just added you as a committer on the project. If your modification doesn't break any thing, just check in to the trunk. If you're not so sure, you can also create a branch and check in to that.

Thanks,
Yunqing

Junjie Lai

unread,
Jul 4, 2012, 11:11:59 AM7/4/12
to asf...@googlegroups.com

Hi Yunqing,

I put the code under /branches/asfermi-p
Please tell me if there is any problem.

Thanks.
Junjie

Dmitry N. Mikushin

unread,
Jul 4, 2012, 11:33:34 AM7/4/12
to asf...@googlegroups.com
Hi Junjie,

Thank you, may I ask one small question on your branch:

--- trunk//RulesInstruction/RulesInstructionDataMovement.cpp 2012-07-04
17:13:23.899209814 +0200
+++ branches/asfermi-p//RulesInstruction/RulesInstructionDataMovement.cpp 2012-07-04
17:13:24.391209803 +0200
@@ -87,8 +87,9 @@
SetOperands(2,
&OPRRegister0ForMemory,
&OPRGlobalMemoryWithImmediate24);
- ModifierGroups[0].Initialize(true, 3,
+ ModifierGroups[0].Initialize(true, 4,
&MRLDCopCG,
+ &MRLDCopCS,
&MRLDCopLU,
&MRLDCopCV);
ModifierGroups[1].Initialize(true, 6,

- What changed here, is it portable?

- D.

2012/7/4 Junjie Lai <lai...@gmail.com>:

Junjie Lai

unread,
Jul 4, 2012, 12:40:20 PM7/4/12
to asf...@googlegroups.com

Hi D,

I added the support of one modifier to LDL instruction to test a real program on 680. 
(to see whether the .cubin that asfermi generates can give the same performance. and the program uses LDL.CS)
I didn't test whether it works on 580.
Thanks.

Junjie
--
Junjie Lai

JLai

unread,
Jul 5, 2012, 5:23:54 AM7/5/12
to asf...@googlegroups.com

Hi guys,

have you tried the modified asfermi?
works ok on your 680 card?

thanks.
junjie

Hou Yunqing

unread,
Jul 22, 2012, 4:01:52 AM7/22/12
to asf...@googlegroups.com
Hi everyone,

I'm looking into the scheduling bits of sm_30 right now. The encoding
of those bits does seem very confusing, especially when I have no
information regarding the instruction latencies. I think I need a GTX
680 to do a bit of test. Does anyone have a server to share? It'll be
even greater if the server has an sm_20 device.

Btw I made a little tool that puts the cubin disassembly and its
scheduling bits together. Source file is attached and usage is
included in the file.

Thanks,
Yunqing
1.cpp

Hou Yunqing

unread,
Jul 24, 2012, 12:47:10 AM7/24/12
to asf...@googlegroups.com
Hi Junjie and all,

I did a short test on Everett's server and some progress has been
made. It seems that:
1. An instruction's control bits has an effect upon the execution time
of the next instruction
eg.
CTRL BITS Inst
010001 MOV xxxxxx
000101 IADD xxxxxx //000101 affects the dispatch interval
between IADD and IMUL, not between MOV and IADD
001101 IMUL xxxxxx

2. (a)When the highest bit is set to 1, (the lower 5 bits +1)=dispatch
interval,
(c) so far it seems the minimal interval is 6 (with the lower 5 bits
being 10100, and when the lower 5 bits < 5 the interval'd still be 6),
(b) and 000000 corresponds to an interval of 32 clocks

000000 S2R R0, SR_ClockLo;
000000 S2R R1, SR_ClockLo; //difference between R1 and R0 will be 32
000101 S2R R2, SR_ClockLo; //difference will be 32
000000 S2R R3, SR_ClockLo; //difference between R3 and R2
will be 8+1=9

So far I've only tested the above with a single kernel:
extern "C"{
__global__ void k(unsigned *output)
{
register unsigned k1 = clock();
register unsigned k2 = clock();
register unsigned k3 = clock();
register unsigned k4 = clock();
output[0] = k2-k1;
output[1] = k3-k2;
output[2] = k4-k3;
}
}
I modified the control bits of the cubin generated using CUBINEditor
(http://code.google.com/p/asfermi/wiki/Utilities), and launched the
cubin kernel using something similar to this file:
http://code.google.com/p/asft/source/browse/trunk/tests/others/block_order/main.cu

The results here are rather preliminary; the effect of the highest bit
is still unclear, so are the things involved in dual-scheduling and so
on. I'm just putting it here in case you are still working on it too.

Cheers,
Yunqing


On Thu, Jul 5, 2012 at 5:23 PM, JLai <lai...@gmail.com> wrote:
>

Junjie Lai

unread,
Jul 24, 2012, 5:54:14 AM7/24/12
to asf...@googlegroups.com

Hi Yunqing,

Good work!

Here is some control info collected from a real program. (The bit order is different from your notation.)
Actually, most of them are 0x2X or 0x3X, with very few 0x04, 0x00.
I think you are right. Most of the time, with a high bit of 1, the rest of the bits are waiting intervals.

I have a question. Does anybody know the math pipeline latency of Kepler?

BAR ['31']
IMUL ['04', '24', '2d']
FMUL ['28', '20', '2d', '23', '2e', '22', '04', '27', '26']
LOP ['20']
IADD ['23', '28', '20', '04', '2e', '37']
S2R ['04', '28', '20', '30']
LD ['22', '24', '28', '2e', '27', '04', '23', '20', '37', '2f', '2a']
SHL ['2e', '20', '04']
LDL ['37', '33', '23', '04', '20', '24']
LDS ['23', '37', '04', '33', '20', '21', '2b', '27', '25', '22', '26', '2f', '32', '2e']
SHR ['28', '27']
MOV32I ['04']
STL ['28', '04', '20']
IMAD ['04', '25', '20', '2e', '2d', '23', '28']
FFMA ['23', '04', '33', '22', '2f', '20', '21', '24', '34', '2b', '35', '2d', '27', '31', '32', '26', '28', '37', '25', '30', '2e', '2c', '2a']
STS ['2e', '04', '25']
ISETP ['04']
ISCADD ['28', '2e', '04', '20']
SSY ['22']
MOV ['00', '20', '04', '21', '28', '25']
ST ['2e', '27', '20', '04']
EXIT ['2e']
NOP ['00']
BRA ['2e', '00']
--
Junjie Lai

Hou Yunqing

unread,
Jul 24, 2012, 7:11:43 AM7/24/12
to asf...@googlegroups.com
Hi Junjie,

We have limited information of pipeline latencies of Fermi, but I
think in this aspect there may a good deal of similarity between Fermi
and Kepler, given the great similarity between the two ISAs. I've
noticed that Kepler's IADD seems to have a latency of 9 clocks, which
is the same with the latency of Fermi's IADD (if I do not remember
wrongly).

Are you still working on those control bits? If not, I'll continue
looking into those bits and the scheduling pattern of ptxas, merge
sm_30 support to trunk, and add an SCHI pseudo-instruction to allow
explicit editing of the control bits, in perhaps one month (I would
like to say "a few days", but last time I said that, it actually took
me almost one month, so I guess the pattern continues :[ )

Cheers,
Yunqing

Hou Yunqing

unread,
Jul 25, 2012, 11:59:19 PM7/25/12
to asf...@googlegroups.com
Hi guys I've already merged sm_30 support to trunk and added support
for SCHI (http://code.google.com/p/asfermi/wiki/OpcodeExecution#SCHI).

Yunqing

JLai

unread,
Jul 26, 2012, 3:16:16 AM7/26/12
to asf...@googlegroups.com

Hi Yunqing,

I don't have much time to work on that recently :(
But I'll post something that I find here time to time.

And good work on the SCHI. That is very useful.

Junjie

Dmitry N. Mikushin

unread,
Jul 27, 2012, 7:00:31 AM7/27/12
to asf...@googlegroups.com
Dear colleagues,

I've just started testing libasfermi on sm_30 (GTX 680M). The cubin creation seems to be fine, and cuModuleGetFunction is able to  get function entries in it. However, cuModuleGetGlobal returns 500, when trying to load the address of constant variable. The same works with sm_20 without any problems. Do you see if any Kepler-related differences may be in charge?

Thanks,
- Dima.

2012/7/26 JLai <lai...@gmail.com>

Dmitry N. Mikushin

unread,
Jul 27, 2012, 12:21:11 PM7/27/12
to asf...@googlegroups.com
Sorry, I forgot the support for named constants only exists in libasfermi branch and was never reviewed for taking into trunk. So, it's logical that is has issues with Kepler. I need to recover the details in my mind...

- D.

2012/7/27 Dmitry N. Mikushin <maem...@gmail.com>

Dmitry N. Mikushin

unread,
Jul 27, 2012, 3:39:15 PM7/27/12
to asf...@googlegroups.com
Haha, apparently, in sm_30 .nv.constant2 is renamed to .nv.constant3 :) Does it really makes sense for them, or is it just for fun?

Hou Yunqing

unread,
Jul 22, 2013, 10:40:01 AM7/22/13
to asfermi Google Group
Since we have so much info about SCHI in this thread, I'll continue in this thread:

Junjie noted that apart from the 0x2x and 0x3x that we more or less understand, there're also some 0x0 and 0x4.
My tests show that
0x0 = 32 clocks.
0x4 = independence (can be co-issued with preceding instruction)
But there's still some hardware-level checking in place. For example, co-issuing doesn't work with S2R, but does with MOV

While sm_20 used to have a minimal scheduling interval of 3, on sm_30 there is no such limit. I managed to dispatch 10 instructions from a single warp within a 7-clock window (9 of them are MOV, all producing the correct result). So the previously reported minimal interval of 6 is only applicable to S2R, or maybe some other instructions.

Yunqing

Dmitry N. Mikushin

unread,
Jul 22, 2013, 10:50:24 AM7/22/13
to asf...@googlegroups.com
CUDA math libs is a large massive of valid sample sm_30/sm_35 code.
What if to use them to collect some broad statistics of correlation
between SCHI values and involved instructions?

- D.

2013/7/22 Hou Yunqing <hyq.n...@gmail.com>:
> --
> You received this message because you are subscribed to the Google Groups
> "asfermi" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to asfermi+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Hou Yunqing

unread,
Jul 22, 2013, 10:56:21 AM7/22/13
to asfermi Google Group
Yep. Will do that tomorrow.

Yunqing

Hou Yunqing

unread,
Jul 23, 2013, 12:49:56 AM7/23/13
to asfermi Google Group
I've looked through a part of the 32-bit libcublas (the entire library took too long to disassemble).

The scheduling info for each instruction is 8-bit, not 6-bit as I previously reported.
bit 7 is only set by TEXDEPBAR
bit 6 is to be investigated.

I've attached the result 

Yunqing
libcublas.out3.txt
Reply all
Reply to author
Forward
0 new messages