I'm a 100% total newbie at writing assembly. But I figured it would
be a good exercise. And besides, this tiny chunk of code is
definitely in the critical path of something I'm working on. Any and
all advice would be appreciated.
I'm trying to rewrite the following function in x86 assembly:
This compiles, runs and produces the correct answers. But I have a
few issues with it:
1) If I declare this function inline, it gives me garbage (like
10^-304)
2) If I compile with -Wall, I get a warning that the function doesn't
return a value, which is absolutely true, but I don't know how to fix
it.
3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
it from some generated assembly. It would be nice to use M_PI,
4*atan(1) or something like that.
> This compiles, runs and produces the correct answers. But I have a
> few issues with it:
> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)
> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.
> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly. It would be nice to use M_PI,
> 4*atan(1) or something like that.
I can't help you with your questions because I would always write
something like this in assembly rather than C, but is there some
reason that you can't use SSE2 rather than x87 here? SSE2 should
be much faster if available in the context of your problem.
-- write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end
>> This compiles, runs and produces the correct answers. But I have a
>> few issues with it:
>> 1) If I declare this function inline, it gives me garbage (like
>> 10^-304)
>> 2) If I compile with -Wall, I get a warning that the function doesn't
>> return a value, which is absolutely true, but I don't know how to fix
>> it.
>> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
>> it from some generated assembly. It would be nice to use M_PI,
>> 4*atan(1) or something like that.
> I can't help you with your questions because I would always write
> something like this in assembly rather than C, but is there some
> reason that you can't use SSE2 rather than x87 here? SSE2 should
> be much faster if available in the context of your problem.
I'll second James' suggestion about SSE2!
Anyway, it seem that what you are trying to do is to take the difference between two angles and then make sure that said difference will be in the [-pi .. pi> range, right?
I.e. what is the rotation angle to get from theta2 to theta1?
Let's start by looking at the various alternatives:
if the signs of th1 and th2 are the same, then the difference _must_ be in range:
0 - pi = -pi
pi - 0 = pi
-0 - -pi = pi
-pi - 0 = -pi
It is only when the signs differ that you might need to add or subtract 2pi to bring it into range:
pi - -pi = 2pi
-pi - pi = -2pi
I don't see immediately how I can use this to speed it up though...
Anyway, trying your original algorithm:
movq xmm0,[theta1]
subsd xmm0,[theta2] ;; Result in [-2pi to 2pi]
movq xmm2,[plus_mask] ;; 0x7fffff...
andpd xmm2,xmm0 ;; ABS(diff), [0 to 2pi]
movq xmm3,[pi]
cmplesd xmm3,xmm2 ;; -1 mask if diff > pi
andpd xmm3,[twopi] ;; 0 or 2pi
subsd xmm2,xmm3 ;; [-pi to pi]
If the original subtraction sign was negative, then we must invert the sign of the result:
The code can be rescheduled a bit, and the mixture of 64-bit scalar and 128-bit packed operations must be checked that they don't introduce forwarding problems, but it seems like it should run in 10-15 cycles, depending upon the latency of the FP operations (SUBSD, CMPLESD, SUBSD)
I tried to figure out a way to use scaling and integer math, but that is likely to be slower.
Terje
-- - <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
<woess...@nospicedham.gmail.com> wrote:
>I'm a 100% total newbie at writing assembly. But I figured it would
>be a good exercise. And besides, this tiny chunk of code is
>definitely in the critical path of something I'm working on. Any and
>all advice would be appreciated.
>I'm trying to rewrite the following function in x86 assembly:
>This compiles, runs and produces the correct answers. But I have a
>few issues with it:
>1) If I declare this function inline, it gives me garbage (like
>10^-304)
>2) If I compile with -Wall, I get a warning that the function doesn't
>return a value, which is absolutely true, but I don't know how to fix
>it.
>3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
>it from some generated assembly. It would be nice to use M_PI,
>4*atan(1) or something like that.
Can't help with your compiler interface questions, but if
you are going to stick with the FPU, you can use FLDPI to
load the value of pi.
Best regards,
Bob Masta
DAQARTA v6.02
Data AcQuisition And Real-Time Analysis
www.daqarta.com Scope, Spectrum, Spectrogram, Sound Level Meter
Frequency Counter, FREE Signal Generator
Pitch Track, Pitch-to-MIDI Science with your sound card!
On Jan 13, 4:59 am, Terje Mathisen <"terje.mathisen at
tmsw.no"@giganews.com> wrote:
> I'll second James' suggestion about SSE2!
I'm open to using SSE2. The only reason I used x87 is that I started
with the assembly code that g++ generated. By default, it generates
x87 instructions. But I'm certainly willing to try with SSE2
instructions.
> Anyway, it seem that what you are trying to do is to take the
> difference between two angles and then make sure that said
> difference will be in the [-pi .. pi> range, right?
That's it exactly. It's such a simple thing, but I can't come up with
a really elegant way to do it. The code generated by g++ involves a
jump. But I think this should be possible without a jump by using a
conditional move.
> Anyway, trying your original algorithm:
I'll give your implementation a try. A big part of the challenge (for
me, at least) is figuring out how to get this in to a form that g++
will understand. I don't really care where or how this is
implemented, but I need to be able to call it from C++. And it really
should be inline, as well. Otherwise, all the efficiency gained by
tweaking the assembly will be lost. :-p
> On Jan 13, 4:59 am, Terje Mathisen <"terje.mathisen at
> tmsw.no"@giganews.com> wrote:
> > I'll second James' suggestion about SSE2!
> I'm open to using SSE2. The only reason I used x87 is that I started
> with the assembly code that g++ generated. By default, it generates
> x87 instructions. But I'm certainly willing to try with SSE2
> instructions.
> > Anyway, it seem that what you are trying to do is to take the
> > difference between two angles and then make sure that said
> > difference will be in the [-pi .. pi> range, right?
> That's it exactly. It's such a simple thing, but I can't come up with
> a really elegant way to do it. The code generated by g++ involves a
> jump. But I think this should be possible without a jump by using a
> conditional move.
> > Anyway, trying your original algorithm:
> I'll give your implementation a try. A big part of the challenge (for
> me, at least) is figuring out how to get this in to a form that g++
> will understand. I don't really care where or how this is
> implemented, but I need to be able to call it from C++. And it really
> should be inline, as well. Otherwise, all the efficiency gained by
> tweaking the assembly will be lost. :-p
> Thanks,
> Bill
There is a straight-forward algorithm using the fact that only one of
the bounds can be crossed...
Something like this:
(Inputs in %xmm0, and %xmm1, output in %xmm0)
I probably have some of the comparisons reversed by mistake... but you
get the idea. You can do both comparisons in parallel. Using sign
tricks doesn't seem to be profitable, as that increases the length of
the critical path.
Bill Woessner wrote:
> On Jan 13, 4:59 am, Terje Mathisen<"terje.mathisen at
> tmsw.no"@giganews.com> wrote:
>> I'll second James' suggestion about SSE2!
> I'm open to using SSE2. The only reason I used x87 is that I started
> with the assembly code that g++ generated. By default, it generates
> x87 instructions. But I'm certainly willing to try with SSE2
> instructions.
>> Anyway, it seem that what you are trying to do is to take the
>> difference between two angles and then make sure that said
>> difference will be in the [-pi .. pi> range, right?
> That's it exactly. It's such a simple thing, but I can't come up with
> a really elegant way to do it. The code generated by g++ involves a
> jump. But I think this should be possible without a jump by using a
> conditional move.
>> Anyway, trying your original algorithm:
> I'll give your implementation a try. A big part of the challenge (for
> me, at least) is figuring out how to get this in to a form that g++
> will understand. I don't really care where or how this is
> implemented, but I need to be able to call it from C++. And it really
> should be inline, as well. Otherwise, all the efficiency gained by
> tweaking the assembly will be lost. :-p
> I probably have some of the comparisons reversed by mistake... but you
> get the idea. You can do both comparisons in parallel. Using sign
> tricks doesn't seem to be profitable, as that increases the length of
> the critical path.
Very nice, and definitely much better than my approach!
:-)
Terje
-- - <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
> sfuerst wrote:
>> There is a straight-forward algorithm using the fact that only one of
>> the bounds can be crossed...
>> Something like this:
>> (Inputs in %xmm0, and %xmm1, output in %xmm0)
>> subsd %xmm1,%xmm0
>> movsd plusM_PI(%rip), %xmm1
>> movsd minusM_PI(%rip), %xmm2
>> cmpgtsd %xmm0, %xmm1
>> cmpltsd %xmm0, %xmm2
>> andpd minus2M_PI(%rip), %xmm1
>> andpd plus2M_PI(%rip), %xmm2
>> addsd %xmm1, %xmm0
>> addsd %xmm2, %xmm0
>> I probably have some of the comparisons reversed by mistake... but you
>> get the idea. You can do both comparisons in parallel. Using sign
>> tricks doesn't seem to be profitable, as that increases the length of
>> the critical path.
> Very nice, and definitely much better than my approach!
> :-)
I really liked your approach more because it doesn't involve as many
loads nor as many long-latency operations like ADDSD and CMPccSD.
Looking at the above code we see four such long-latency instructions
in the path and I think we can do better with:
On Jan 12, 9:38 pm, Bill Woessner <woess...@nospicedham.gmail.com>
wrote:
> I'm a 100% total newbie at writing assembly. But I figured it would
> be a good exercise. And besides, this tiny chunk of code is
> definitely in the critical path of something I'm working on. Any and
> all advice would be appreciated.
> I'm trying to rewrite the following function in x86 assembly:
The gas assembler format is ghastly so I converted your code to Nasm
creating the file DiffAngle.nasm as follows. I think it's easier to
read and despite not being inlined it seemed to be faster. Time for
100 million calls:
c function: 2.8 seconds
asm func: 1.2 seconds
I've not written floating point assembly code or linked C++ with
assembly before so there's a chance something is not right. I'll list
the steps I took so it can be recreated/challenged/corrected.
Hopefully this will help with the things you asked.
First the assembly code.
;
;DiffAngle
;
;Build with such as
; nasm -f elf32 DiffAngle.nasm -l DiffAngle.list
;
bits 32
cpu ppro
%define PI 3.1415926535897932384626433832795
%define TWO_PI 6.283185307179586476925286766559
%define NEG_TWO_PI -6.283185307179586476925286766559
> This compiles, runs and produces the correct answers. But I have a
> few issues with it:
I'm not sure your code is right. The constants seem to be the other
way round from what they are intended to be. I had to swap them over
to get the same results as your C program but it is late.... If
someone points out some faults I'll respond another day.
> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)
To try it out I made a test routine called dtest1.c. Build steps on
Linux for the whole thing were
> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.
In dtest1.c I had to include the following prototype
extern "C" {
double DiffAngle(double, double);
}
so that g++ didn't expect a mangled routine name.
> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly.
These can be defined much more easily in the assembly code. The "dq"
code defines what the assembler calls a quadword, 8 bytes, in
the .data section. For example,
two_pi: dq 6.283185307179586476925286766559
> It would be nice to use M_PI,
> 4*atan(1) or something like that.
I know 4*atan(1) is pi but I don't know what M_PI is supposed to be.
I've made no attempt to understand the maths of your solution; I just
copied your code. So both the blame for faults and the credit for
increased performance go to you. :-)
> I'm a 100% total newbie at writing assembly. But I figured it would
> be a good exercise. And besides, this tiny chunk of code is
> definitely in the critical path of something I'm working on. Any and
> all advice would be appreciated.
> I'm trying to rewrite the following function in x86 assembly:
Use two percent signs to create a percent sign.
Add a \n\t at the end, trust me, it helps if you want to read the generated asm.
Do not simply reference some stack location.
The basic problem is:
This inline asm is missing inputs and output
> );
> }
> This compiles, runs and produces the correct answers. But I have a
> few issues with it:
> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)
Which is no wonder, when the compiler inlines the function, your stack
references are totally bogus.
> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.
By creating an output from the inline asm and returning it from the function.
> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly. It would be nice to use M_PI,
> 4*atan(1) or something like that.
The x87 has a instruction to load pi, two pi is two loads and an add+pop. Neg is
mul -1. The trick is to create the constants, and keep them in the lower stack
register. For that you better work on an array of values (where SSE gets also
handy) because you have to leave the x87 stack the way you found it.
You have to understand how GCC handles inline ASM.
For GCC, inline ASM is just a bunch of text where GCC makes certain
substitutions (that's also why you should write every literal % as %%, % is a
substitution like in printf). It does not grok or understand your ASM. It can
make no deductions from what you write there.
To interface with the rest of the code, you have to tell GCC what you are doing
there, by means of inputs and outputs to your ASM, because the state of the
machine before and after the ASM is all GCC cares for.
Let's start simple:
asm (
"add %%eax, %%edx"
);
Is a valid inline ASM, but has the same problem as yours. The compiler will put
the txt literately into the output. If something meaningful is in eax or edx,
there is no guaranty. That GCC does not throw your ASM away (yes, GCC can
optimize ASM a little bit according to their input and output) comes from the
special rule that empty input/output/clobber means "clobbers everything".
From GCC points of view this is the same as
asm (
"yadda yadda yadda"
);
The canonical format of an inline ASM in GCC is:
asm {volatile} (
"instructions"
: {outputs}
: {inputs}
: {clobbers}
);
You do not always have to write all parts, you can leave empty last parts out.
A volatile can be used to give the compiler a little nudge, but the exact
semantic is complicated.
So for our example above one should write:
int a, b;
asm (
"add %1, %0"
: "=r" (a)
: "r" (b), "0" (a)
);
Besides of remaining problems (the compiler may see that the values are not
initialized when passed into the asm, so he may remove the asm), let's see what
we have here.
First, in the instruction section, we refer to our registers by using the number
of the operands. Counting starts at zero and goes from outputs to inputs.
Then there are the outputs. Here we have one output. with "=r" we say the
compiler he should expect an result (the =, all outputs must have an =, except
if you use special stuff like +) in an general purpose register (the r is the
code for that). In parenthesis we say that the compiler should arrange for the
variable a to be living in that register.
After that there are the inputs. First we say we want b to be in a general
purpose register. Then we say that a should be in the same spot as operand 0.
Normally you should also say GCC that you clobbered the eflags register (the
condition codes), by adding a cc clobber like so:
int a, b;
asm (
"add %1, %0"
: "=r" (a)
: "r" (b), "0" (a)
: "cc"
);
But this is implicit on x86, because it is a cc0 target.
So, now to your code.
First, you are entering a world of pain with x87. The stack based nature of the
x87 does not play well with inline ASM.
But more important: never return premature from your inline ASM...
I did not check the calc stuff. We tell the compiler that ret will be on the top
of the x87 stack, we put a temporary into the eax register because we clobber
it, we pass in all inputs as memory operands.
This is just a little crash introduction, there are a lot of details and caveats
to inline ASM, esp. the freedoms the compiler has with your inline ASM.
For all the nitty gritty details, look into the GCC handbook under inline assembly:
http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Extended-Asm.html and the chapters after that.
Whatever you do, don't forget the basic rule: GCC does not understand your ASM,
he goes strictly by the input/output/clobber. I you lie there to GCC what your
ASM is about, don't be disappointed when GCC does not obey your wish (and
resorting to giganto "whack-over-the-head" like volatile and a memory clobber
may bite you later).
In face of a inline ASM, GCC does not drop everything on the floor, crawls into
a hole and turns on dumb mode, it can make optimization decisions based on the
inputs/outputs.
NB: GCC views that inline ASM as one single instruction, this can have some
implications.
(the whole decision to implement inline ASMs the way they are in GCC is to
prevent problems other compiler which support inline ASM have with inline ASM,
for example has to understand every single instruction (also the seldom used
system level instructions) and falls over if a new instruction is encountered,
or turn to totally dumb mode. One can use inline ASM for fancy stuff (and i like
to use it lavishly because it has some upsides), but there are limits to it.
Basically it was intended to get that single special instruction the compiler
can not generate, mainly system level stuff)
> Thanks in advance,
If i would be you, i would forget about the whole thing.
This code is still "miserable", but in this case the compiler can do nothing
about it, because he does not understand ASM. (and esp. can not do constant
elimination). In loops more optimizations would be possible, but now you have to
do them by hand. And one big bummer is that you have to write two versions, one
for i386, and one for x86_4, because the later passes float args in SSE
registers, so the compiler would copy the functions arguments first to the stack
so you can load them from there to the x87.
You not being happy with the floating point code the compiler generates probably
stems from the fact that the compiler has to be very careful when it comes to
floats (NaNs, Inf, loss of precision, exes precision, etc.). So he prop. passes
all things of to the proper lib calls when he can not proof that emitting the
direct instructions can safely be done.
Additionally i expect that you are simply compiling for an generic x86 machine,
try to play with -march=pentium3 or -msse.
You may want to try -ffast-math, this unleashes GCC to get aggressive with your
floats. If -ffast-math generates wrong results in your program (they are not
really wrong, but simply effects from "imprecise" float arithmetic covered up
when not optimizing so aggressive), with a new enough GCC you can try:
__attribute__((optimize("fast-math")))
inline double DiffAngle_o(double theta1, double theta2)
{
double delta(theta1 - theta2);
N.B: you do have set -O? Prop. to -O2? Maybe even to -O3?
The only reason i would look into making this an asm is do diff a complete array
of angles with SSE vector instructions to calc several values at once, because,
while i
...
> I'm a 100% total newbie at writing assembly. But I figured it would
> be a good exercise. And besides, this tiny chunk of code is
> definitely in the critical path of something I'm working on. Any and
> all advice would be appreciated.
> I'm trying to rewrite the following function in x86 assembly:
> This compiles, runs and produces the correct answers. But I have a
> few issues with it:
> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)
That is because you actually require a real call to the function. If the
above assembly is inlined, the compiler doesn't really know where to put
the input and output variables.
I'm rewriting your C++ first, so I can put it into assembly more easily:
> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly. It would be nice to use M_PI,
> 4*atan(1) or something like that.
Just define it as new inputs and let the compiler worry. Like:
That doesn't do what you think it does. That cast converts the constant to
floating point -- it doesn't just copy the bits. There are only 56 bits of
mantissa available in a double, so some of your bits get lost.
> That doesn't do what you think it does. That cast converts the constant to
> floating point -- it doesn't just copy the bits. There are only 56 bits of
> mantissa available in a double, so some of your bits get lost.
I do know that,I was in a hurry and tried to get away with psudocode, but I should have stated that clearly.
Thanks for correcting my omissions!
Terje--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
Success! Thanks to everyone who helped me out. This was a great
little project to learn about inline assembly in GCC. For those of
you who are interested, here's the final product:
One final question - is it possible to declare TWO_PI and NEG_TWO_PI
as const? And, if so, what constraint would I use? Right now, if I
declare them const, I get the error "memory input 3 is not directly
addressable".
> Success! Thanks to everyone who helped me out. This was a great
> little project to learn about inline assembly in GCC. For those of
> you who are interested, here's the final product:
> One final question - is it possible to declare TWO_PI and NEG_TWO_PI
> as const? And, if so, what constraint would I use? Right now, if I
> declare them const, I get the error "memory input 3 is not directly
> addressable".
It appears to be a C vs. C++ issue. If I compile the code with gcc
(and a .c extension), it works fine. If I compile with g++, it fails
with this error:
AngleDiff.cc: In function ‘double DiffAngle(double, double)’:
AngleDiff.cc:20:8: error: memory input 3 is not directly addressable
However, it DOES work if TWO_PI is declared static (but not static
const). I also tried compiling with -fkeep-static-consts, which
didn't help either.
> "Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> ha scritto nel
> messaggio news:iqe6u8-os52.ln1@ntp6.tmsw.no...
>>> inline double delta(double th1, th2)
> why not 2 loops [one for normalize delta >pi and one for < -pi]?
> where i32 is int signed 32 bit
> and d64 is double 64 bits
> i check every possible error with this last?
> how many error do someone of you find in that?
> how many UB do you find if the last is one C program?