Thanks Yunqing,
Now I have some insights from some experiments to share with you guys.
First, the binary code of Kepler GPU are separated into 7-instruction
groups.
Before each 7 instructions is always the special control flow
instruction (so this new kind of instruction will always appear at
0x0, 0x40, 0x80, ...)
I speculate that the instruction bits can be separated into 8 fields,
the identifier bits and 7 fields holding latency-related information
for the rest 7 instructions.
I still cannot figure out how the 7 fields are encoded. I don't have
the card.. If anyone can provide a remote-access account with GTX680,
it would be wonderful.
And as Yunqing said, it is probable that the binary cannot run on
Kepler without this special instruction since the dependence checking
hardware is simplified on Kepler to save space and power.
Junjie
On Apr 21, 6:00 am, Hou Yunqing <
hyq.neu...@gmail.com> wrote:
> @Everett
>
> Perhaps you should wait still longer for the real compute card to come out.
> GTX680 is a GK104 card which is designed like a GTX 460 (GF104). But surely
> you can try to get asfermi to work with sm_30 first. I highly doubt sm_30
> will be the architecture for the real compute cards, but I suppose going
> from sm_30 to the real compute architecture will be trivial.
>
> @JLai
>
> It is unlikely to work on sm_30. cuobjdump is a big liar and it cheats. It
> has omitted two instructions in its output for sm_30. It used to give you
> wrong hex code for the instructions when you feed it with stuffs it can't
> recognise (it's like first you feed it with an instruction it can't
> recognise, then the hex code of the following instructions, which it
> recognises without problem, will be wrong), though looks like this time
> it's behaving slightly better. The instruction at 0x0, which cuobjdump
> ignores, is probably non-trivial. It's different almost everytime I compile
> a different kernel. I'm not sure what the thing at 0x40 is.
>
> The instructions at 0x0 and 0x40 are probably of the same type. They use
> the same opcode, do not use predicate, and... seem to have different
> modifiers and operands... I'm not sure if Kepler can run without those
> instructions... some new info: these instructions are control flow
> instructions. They have na=1110 and nb=000100 (seehttp://
code.google.com/p/asfermi/wiki/nanb). I doubt Kepler can run without
> those instructions. So if you want sm_30 to work, you'd probably have to
> familiarise yourself with the control flow instructions that asfermi has
> already documented, and then try to comprehend this new instruction. I
> wrote a small tool (
http://code.google.com/p/asfermi/source/browse/asfermiOld/utilities/c...)
> to recover the correct binary encoding from the output of cuobjdump
> (because cuobjdump lies all the time). The input file to that cuprocess
> utility must be formatted in the same way as the output of cuobjdump... so
> I first get the hex code of the new instructions using cuobjdump -elf
> xxx.cubin (in the hex dump), then I copy the hex code and past it over the
> long 0xabcdabcdabcdabcd inside the line generated by cuobjdump. For
> example, the hex dump for 0x0 is:
> 0x82823007 0x23704282
> So I just get any line generated by cuobjdump:
> /*0008*/ /*0x42804007200002e0*/ MOV R1, c [0x0] [0x44];
> And paste it over
> /*0008*/ /*0x8282300723704282*/ MOV R1, c [0x0] [0x44];
> Then run cuprocess to get:
> /*0008*/ /*0x8282300723704282*/ MOV R1, c [0x0] [0x44];
> 1110000000001100010000010100000101000001010000100000111011000100
>
> Alternatively you may use the cubinEditor (written in C#.NET) I did to do
> the job, but it's not so stable when it's run with Monohttp://
code.google.com/p/asfermi/wiki/Utilities
> ================
>
> step 2) /*0xe0001de74003ffff*/ BRA 0x60;
> step 3) /*0x20001de740000000*/ BRA 0x60;
> There's no new instruction here. BRA uses relative offset in its binary
> encoding, though cuobjdump outputs the absolute offset from the beginning
> of the kernel instead. If the BRA in step 3 is at 0x60 (right now it's at
> 0x50), then you'll get 0xe0001de74003ffff instead of 0x20001de740000000.
>
> =================
>
> As for the EIATTR stuffs, they are documented (without much comment :] ) in
> trunk/helper/helperCubin.cpp, hpCubinStage3()
> The EIATTR_CBANK_PARAM_OFFSETS that nvcc used to generate was really
> redundant, because that info is contained in EIATTR_KPARAM_INFO. Now looks
> like EIATTR_CBANK_PARAM_OFFSETS finally got stripped.
>
> (line numbers I'm talking about reside in r745, helper/helperCubin.cpp)
> To match this change, you'll have to adjust line 347 accordingly to
> correctly calculate the size for the new .nv.info.kernelname section. Then
> you'll just have to put line 350~354 into an if construct. Then at line 359
> the OR op will have to be done with 0x0140 instead of 0x0020 if we're
> dealing with sm_30. 0x0140 is the new starting location of parameters in c0
> bank. Also the size of the per-kernel Constant0 section would have to be
> adjusted accordingly.
>
> Hope the above isn't too much info to digest :)
>
> Yunqing
>
> On Sat, Apr 21, 2012 at 1:29 AM, Everett Fominyen <
fomin...@gmail.com>wrote:
>
>
>
>
>
>
>
> > Hi Yunqing,
>
> > I am working on implementing integer factorization and discrete logarithm
> > (ECDLP) algorithms using elliptic curves which heavily depend on modular
> > multiplication. At present, I can perform approx. (2.231*10^9) 192-bit
> > modular multiplications, with error checking routines, on an MSI GTX 570 OC
> > (Graphics Clock 786 MHz / Shader Clock 1572 MHz). Also, I have an MSI GTX
> > 680 (Graphics Clock 1006 MHz) and it is a bout 1.36x slower when compiled
> > with the released version of the Nvidia compiler (version 4.2). According
> > to this documentation on Nvidia's website (
> >
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/...),
> > section 5.4.1 on pages 74-76, my code should be running at least 1.37x
> > faster on the GTX 680 than on my GTX 570, not the other way around. I have
> > already taken into account that on my GTX 680, "__syncthreads()" and
> > "64-bit type conversions" will be 6x and 3x slower than on my GTX 570, so
> > the Nvidia compiler is a suspect for the poor performance on Kepler.
>
> > In a nutshell, I will like to bypass Nvidia's PTXAS whenever possible for
> > Kepler. If I could use asfermi and increase performance on the GTX 570 by
> > 10 percent, and take that times 1.37 on the GTX 680, I will be very close
> > to 3.5 billion 192-bit modular multiplications per second, with error
> > checking, at stock clocks, that will be astonishing :-)
>
> > Thanks, Everett
>
> >> On Sun, Apr 15, 2012 at 2:34 PM, Everett Fominyen <
fomin...@gmail.com>wrote:
>
> >>> Hello Yunqing,
> >>> Thank you very much for the notes and pointers on the Fermi
> >>> architecture; they will be of great value for understanding the Kepler
> >>> architecture. I will have some time off from work within the next couple
> >>> of months and will thoroughly examine the asfermi source code. From there,
> >>> I will make an attempt to figure out how much effort will be needed to add
> >>> sm_30 support.
>
> >>> Also, I have a couple of questions for you:
>
> >>> 1). On average, how much of a performance gain have you been able to
> >>> achieve using asfermi versus using Nvidia's nvcc/ptxas compiler 4.x? Off
> >>> course, assuming that the critical portions of your code were written using
> >>> in-line PTX assembly.
>
> >>> 2). Are there any tools such as "qhasm-cudasm" for help with register
> >>> allocation when using asfermi. I have a project with over 5000 lines of PTX
> >>> assembly code and it will be very difficult to keep track of the registers
> >>> manually.
>
> >>> Thanks, Everett
> >>> On Sat, Apr 14, 2012 at 10:07 PM, Hou Yunqing <
hyq.neu...@gmail.com>wrote:
>
> >>>> Hi Everett,
>
> >>>> I've found my notes, though something that I wrote on paper is perhaps
> >>>> lost. I have attached the notes here.
>
> >>>> Also, I just realised that my code is a better documentation of the
> >>>> cubin format than my own notes which I took before I started coding. The
> >>>> most relevant parts in my code include:
> >>>> 1. WriteToCubinDirectOutput() in asfermi.cpp
> >>>> 2. Cubin.h/cpp
> >>>> 3. helper/helperCubin.cpp
>
> >>>> Things that surely have changed in sm_30:
> >>>> 1. ELFH32.Flags defines which architecture the cubin is for. You can
> >>>> find out this for sm_30 using cueditor_v5 in the download page, or you
> >>>> could just use any of your favourite elf utility.
> >>>> 2. The content in .nv.info.kernelname is most likely to have changed
> >>>> somewhat. You can first get a rough understanding of what that section is
> >>>> using cuobjdump -elf, then you can look at the parts of my code that
> >>>> produce such sections (hpCubinStage3() in helperCubin.cpp). You'll need a
> >>>> hex editor to find out what has changed.
> >>>> 3. Constant0Section is most likely to have changed somewhat as
>
> ...
>
> read more »