Help with Hand-Optimized Assembly

Bill Woessner

unread,

Jan 12, 2012, 4:38:44 PM1/12/12

to

I'm a 100% total newbie at writing assembly. But I figured it would
be a good exercise. And besides, this tiny chunk of code is
definitely in the critical path of something I'm working on. Any and
all advice would be appreciated.

I'm trying to rewrite the following function in x86 assembly:

inline double DiffAngle(double theta1, double theta2)
{
double delta(theta1 - theta2);

return std::abs(delta) <= M_PI ? delta : delta - copysign(2 * M_PI,
delta);
}

To my great surprise, I've actually been somewhat successful. Here's
what I have so far:

double DiffAngle(double theta1, double theta2)
{
asm(
"fldl 4(%esp);"
"fsubl 12(%esp);"
"fxam;"
"fnstsw %ax;"
"fldl TWO_PI;"
"testb $2, %ah;"
"fldl NEG_TWO_PI;"
"fcmovne %st(1), %st;"
"fstp %st(1);"
"fsubr %st(1), %st;"
"fldpi;"
"fld %st(2);"
"fabs;"
"fcomip %st(1), %st;"
"fstp %st(0);"
"fcmovbe %st(1), %st;"
"fstp %st(1);"
"rep;"
"ret;"
"NEG_TWO_PI:;"
".long 1413754136;"
".long 1075388923;"
"TWO_PI:;"
".long 1413754136;"
".long -1072094725;"
);
}

This compiles, runs and produces the correct answers. But I have a
few issues with it:

1) If I declare this function inline, it gives me garbage (like
10^-304)
2) If I compile with -Wall, I get a warning that the function doesn't
return a value, which is absolutely true, but I don't know how to fix
it.
3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
it from some generated assembly. It would be nice to use M_PI,
4*atan(1) or something like that.

Thanks in advance,
Bill

James Van Buskirk

unread,

Jan 12, 2012, 5:37:49 PM1/12/12

to

"Bill Woessner" <woes...@nospicedham.gmail.com> wrote in message
news:67ddafac-ae03-4ef1...@i26g2000vbt.googlegroups.com...

> This compiles, runs and produces the correct answers. But I have a
> few issues with it:

> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)
> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.
> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly. It would be nice to use M_PI,
> 4*atan(1) or something like that.

I can't help you with your questions because I would always write
something like this in assembly rather than C, but is there some
reason that you can't use SSE2 rather than x87 here? SSE2 should
be much faster if available in the context of your problem.

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Terje Mathisen

unread,

Jan 13, 2012, 4:59:50 AM1/13/12

to

James Van Buskirk wrote:
> "Bill Woessner"<woes...@nospicedham.gmail.com> wrote in message
> news:67ddafac-ae03-4ef1...@i26g2000vbt.googlegroups.com...
>
>> This compiles, runs and produces the correct answers. But I have a
>> few issues with it:
>
>> 1) If I declare this function inline, it gives me garbage (like
>> 10^-304)
>> 2) If I compile with -Wall, I get a warning that the function doesn't
>> return a value, which is absolutely true, but I don't know how to fix
>> it.
>> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
>> it from some generated assembly. It would be nice to use M_PI,
>> 4*atan(1) or something like that.
>
> I can't help you with your questions because I would always write
> something like this in assembly rather than C, but is there some
> reason that you can't use SSE2 rather than x87 here? SSE2 should
> be much faster if available in the context of your problem.
>

I'll second James' suggestion about SSE2!

Anyway, it seem that what you are trying to do is to take the difference
between two angles and then make sure that said difference will be in
the [-pi .. pi> range, right?

I.e. what is the rotation angle to get from theta2 to theta1?

Let's start by looking at the various alternatives:

if the signs of th1 and th2 are the same, then the difference _must_ be
in range:

0 - pi = -pi
pi - 0 = pi

-0 - -pi = pi
-pi - 0 = -pi

It is only when the signs differ that you might need to add or subtract
2pi to bring it into range:

pi - -pi = 2pi
-pi - pi = -2pi

I don't see immediately how I can use this to speed it up though...

Anyway, trying your original algorithm:

movq xmm0,[theta1]

subsd xmm0,[theta2] ;; Result in [-2pi to 2pi]
movq xmm2,[plus_mask] ;; 0x7fffff...

andpd xmm2,xmm0 ;; ABS(diff), [0 to 2pi]
movq xmm3,[pi]

cmplesd xmm3,xmm2 ;; -1 mask if diff > pi
andpd xmm3,[twopi] ;; 0 or 2pi
subsd xmm2,xmm3 ;; [-pi to pi]

If the original subtraction sign was negative, then we must invert the
sign of the result:

andpd xmm0,[signbits] ;; (-0.0 , -0.0)
xorpd xmm0, xmm2

The code can be rescheduled a bit, and the mixture of 64-bit scalar and
128-bit packed operations must be checked that they don't introduce
forwarding problems, but it seems like it should run in 10-15 cycles,
depending upon the latency of the FP operations (SUBSD, CMPLESD, SUBSD)

I tried to figure out a way to use scaling and integer math, but that is
likely to be slower.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Bob Masta

unread,

Jan 13, 2012, 8:30:56 AM1/13/12

to

Can't help with your compiler interface questions, but if
you are going to stick with the FPU, you can use FLDPI to
load the value of pi.

Best regards,

Bob Masta

DAQARTA v6.02
Data AcQuisition And Real-Time Analysis
www.daqarta.com
Scope, Spectrum, Spectrogram, Sound Level Meter
Frequency Counter, FREE Signal Generator
Pitch Track, Pitch-to-MIDI
Science with your sound card!

Bill Woessner

unread,

Jan 13, 2012, 12:45:51 PM1/13/12

to

On Jan 13, 4:59 am, Terje Mathisen <"terje.mathisen at

tmsw.no"@giganews.com> wrote:
> I'll second James' suggestion about SSE2!

I'm open to using SSE2. The only reason I used x87 is that I started
with the assembly code that g++ generated. By default, it generates
x87 instructions. But I'm certainly willing to try with SSE2
instructions.

> Anyway, it seem that what you are trying to do is to take the
> difference between two angles and then make sure that said
> difference will be in the [-pi .. pi> range, right?

That's it exactly. It's such a simple thing, but I can't come up with
a really elegant way to do it. The code generated by g++ involves a
jump. But I think this should be possible without a jump by using a
conditional move.

> Anyway, trying your original algorithm:

I'll give your implementation a try. A big part of the challenge (for
me, at least) is figuring out how to get this in to a form that g++
will understand. I don't really care where or how this is
implemented, but I need to be able to call it from C++. And it really
should be inline, as well. Otherwise, all the efficiency gained by
tweaking the assembly will be lost. :-p

Thanks,
Bill

sfuerst

unread,

Jan 13, 2012, 2:43:38 PM1/13/12

to

On Jan 13, 9:45 am, Bill Woessner <woess...@nospicedham.gmail.com>
wrote:

There is a straight-forward algorithm using the fact that only one of
the bounds can be crossed...

Something like this:
(Inputs in %xmm0, and %xmm1, output in %xmm0)

subsd %xmm1,%xmm0
movsd plusM_PI(%rip), %xmm1
movsd minusM_PI(%rip), %xmm2

cmpgtsd %xmm0, %xmm1
cmpltsd %xmm0, %xmm2

andpd minus2M_PI(%rip), %xmm1
andpd plus2M_PI(%rip), %xmm2

addsd %xmm1, %xmm0
addsd %xmm2, %xmm0

I probably have some of the comparisons reversed by mistake... but you
get the idea. You can do both comparisons in parallel. Using sign
tricks doesn't seem to be profitable, as that increases the length of
the critical path.

Steven

Terje Mathisen

unread,

Jan 13, 2012, 2:55:28 PM1/13/12

to

Inline C isn't too hard to write:

inline double delta(double th1, th2)
{
static double pi = 3.14159265357989;
static double zero_or_twopi[2] = {0, 3.14159265357989*2};
static double absmask = (double) 0x7fffffffffffffff;
static double signmask = -0.0;

double diff = th1-th2;
double adiff = (double) ((int64_t) diff & (int64_t) absmask);
int toolarge = adiff > pi;

adiff -= zero_or_twopi[toolarge];

adiff = (double) ((int64_t) adiff ^
((int64_t) signmask & (int64_t) diff));

return adiff;

Terje Mathisen

unread,

Jan 13, 2012, 3:41:07 PM1/13/12

to

sfuerst wrote:
> There is a straight-forward algorithm using the fact that only one of
> the bounds can be crossed...
>
> Something like this:
> (Inputs in %xmm0, and %xmm1, output in %xmm0)
>
> subsd %xmm1,%xmm0
> movsd plusM_PI(%rip), %xmm1
> movsd minusM_PI(%rip), %xmm2
>
> cmpgtsd %xmm0, %xmm1
> cmpltsd %xmm0, %xmm2
>
> andpd minus2M_PI(%rip), %xmm1
> andpd plus2M_PI(%rip), %xmm2
>
> addsd %xmm1, %xmm0
> addsd %xmm2, %xmm0
>
> I probably have some of the comparisons reversed by mistake... but you
> get the idea. You can do both comparisons in parallel. Using sign
> tricks doesn't seem to be profitable, as that increases the length of
> the critical path.

Very nice, and definitely much better than my approach!
:-)

James Van Buskirk

unread,

Jan 13, 2012, 6:30:46 PM1/13/12

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> wrote in message
news:5gh6u8-...@ntp6.tmsw.no...

> sfuerst wrote:
>> There is a straight-forward algorithm using the fact that only one of
>> the bounds can be crossed...

>> Something like this:
>> (Inputs in %xmm0, and %xmm1, output in %xmm0)

>> subsd %xmm1,%xmm0
>> movsd plusM_PI(%rip), %xmm1
>> movsd minusM_PI(%rip), %xmm2

>> cmpgtsd %xmm0, %xmm1
>> cmpltsd %xmm0, %xmm2

>> andpd minus2M_PI(%rip), %xmm1
>> andpd plus2M_PI(%rip), %xmm2

>> addsd %xmm1, %xmm0
>> addsd %xmm2, %xmm0

>> I probably have some of the comparisons reversed by mistake... but you
>> get the idea. You can do both comparisons in parallel. Using sign
>> tricks doesn't seem to be profitable, as that increases the length of
>> the critical path.

> Very nice, and definitely much better than my approach!
> :-)

I really liked your approach more because it doesn't involve as many
loads nor as many long-latency operations like ADDSD and CMPccSD.
Looking at the above code we see four such long-latency instructions
in the path and I think we can do better with:

subsd xmm0, xmm1 ; {clock 1}
movsd xmm2, [signbits] ; -0.0 {asynchronous}
movaps xmm3, xmm2
andps xmm2, xmm0 ; sign(0.0,delta) {clock 4}
andnps xmm3, xmm0 ; abs(delta) {clock 4}
xorps xmm2, [minustwopi] ; -sign(2*pi,delta) {clock 5}
cmplesd xmm3, [pi] ; -1 or 0 {clock 5}
addsd xmm2, xmm0 ; delta-sign(2*pi,delta) {clock 6}
andps xmm0, xmm3 ; delta or 0 {clock 8}
andnps xmm3, xmm2 ; 0 or delta-sign(2*pi,delta) {clock 9}
orps xmm0, xmm3 ; delta or delta-sign(2*pi,delta) {clock 10}

James Harris

unread,

Jan 13, 2012, 7:03:48 PM1/13/12

to

On Jan 12, 9:38 pm, Bill Woessner <woess...@nospicedham.gmail.com>
wrote:

> I'm a 100% total newbie at writing assembly. But I figured it would
> be a good exercise. And besides, this tiny chunk of code is
> definitely in the critical path of something I'm working on. Any and
> all advice would be appreciated.
>
> I'm trying to rewrite the following function in x86 assembly:
>
> inline double DiffAngle(double theta1, double theta2)
> {
> double delta(theta1 - theta2);
>
> return std::abs(delta) <= M_PI ? delta : delta - copysign(2 * M_PI,
> delta);
>
> }

The gas assembler format is ghastly so I converted your code to Nasm
creating the file DiffAngle.nasm as follows. I think it's easier to
read and despite not being inlined it seemed to be faster. Time for
100 million calls:

c function: 2.8 seconds
asm func: 1.2 seconds

I've not written floating point assembly code or linked C++ with
assembly before so there's a chance something is not right. I'll list
the steps I took so it can be recreated/challenged/corrected.
Hopefully this will help with the things you asked.

First the assembly code.

;
;DiffAngle
;
;Build with such as
; nasm -f elf32 DiffAngle.nasm -l DiffAngle.list
;

bits 32
cpu ppro

%define PI 3.1415926535897932384626433832795
%define TWO_PI 6.283185307179586476925286766559
%define NEG_TWO_PI -6.283185307179586476925286766559

global DiffAngle

section .code

DiffAngle:
fld qword [esp + 4]
fsub qword [esp + 12]
fxam
fnstsw ax
fld qword [two_pi]
test ah, 2
fld qword [neg_two_pi]
fcmovne st0, st1
fstp st1
fsubr st0, st1
fldpi
fld st2
fabs
fcomip st0, st1
fstp st0
fcmovbe st0, st1
fstp st1
ret 0

section .data

two_pi: dq NEG_TWO_PI ;NB wrong value
neg_two_pi: dq TWO_PI ;NB wrong value

I'm not sure your code is right. The constants seem to be the other
way round from what they are intended to be. I had to swap them over
to get the same results as your C program but it is late.... If
someone points out some faults I'll respond another day.

> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)

To try it out I made a test routine called dtest1.c. Build steps on
Linux for the whole thing were

nasm -f elf32 DiffAngle.nasm -l DiffAngle.list
g++ dtest1.c -c
g++ dtest1.o DiffAngle.o -o dtest1

> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.

In dtest1.c I had to include the following prototype

extern "C" {
double DiffAngle(double, double);
}

so that g++ didn't expect a mangled routine name.

> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly.

These can be defined much more easily in the assembly code. The "dq"
code defines what the assembler calls a quadword, 8 bytes, in
the .data section. For example,

two_pi: dq 6.283185307179586476925286766559

> It would be nice to use M_PI,
> 4*atan(1) or something like that.

I know 4*atan(1) is pi but I don't know what M_PI is supposed to be.
I've made no attempt to understand the maths of your solution; I just
copied your code. So both the blame for faults and the credit for
increased performance go to you. :-)

James

Jan Seiffert

unread,

Jan 13, 2012, 6:14:47 PM1/13/12

to

Bill Woessner schrieb:

> I'm a 100% total newbie at writing assembly. But I figured it would
> be a good exercise. And besides, this tiny chunk of code is
> definitely in the critical path of something I'm working on. Any and
> all advice would be appreciated.
>
> I'm trying to rewrite the following function in x86 assembly:
>
> inline double DiffAngle(double theta1, double theta2)
> {
> double delta(theta1 - theta2);
>
> return std::abs(delta) <= M_PI ? delta : delta - copysign(2 * M_PI,
> delta);
> }
>
> To my great surprise, I've actually been somewhat successful. Here's
> what I have so far:
>
> double DiffAngle(double theta1, double theta2)
> {
> asm(
> "fldl 4(%esp);"

Use two percent signs to create a percent sign.
Add a \n\t at the end, trust me, it helps if you want to read the generated asm.
Do not simply reference some stack location.

> "fsubl 12(%esp);"
> "fxam;"
> "fnstsw %ax;"
> "fldl TWO_PI;"
> "testb $2, %ah;"
> "fldl NEG_TWO_PI;"
> "fcmovne %st(1), %st;"
> "fstp %st(1);"
> "fsubr %st(1), %st;"
> "fldpi;"
> "fld %st(2);"
> "fabs;"
> "fcomip %st(1), %st;"
> "fstp %st(0);"
> "fcmovbe %st(1), %st;"
> "fstp %st(1);"
> "rep;"
> "ret;"

YIKES!!!
do not simply ret from your inline asm

> "NEG_TWO_PI:;"
> ".long 1413754136;"
> ".long 1075388923;"
> "TWO_PI:;"
> ".long 1413754136;"
> ".long -1072094725;"

There are other ways to get your constants.

The basic problem is:
This inline asm is missing inputs and output

> );
> }
>
> This compiles, runs and produces the correct answers. But I have a
> few issues with it:
>
> 1) If I declare this function inline, it gives me garbage (like
> 10^-304)

Which is no wonder, when the compiler inlines the function, your stack
references are totally bogus.

> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.

By creating an output from the inline asm and returning it from the function.

> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly. It would be nice to use M_PI,
> 4*atan(1) or something like that.
>

The x87 has a instruction to load pi, two pi is two loads and an add+pop. Neg is
mul -1. The trick is to create the constants, and keep them in the lower stack
register. For that you better work on an array of values (where SSE gets also
handy) because you have to leave the x87 stack the way you found it.

You have to understand how GCC handles inline ASM.
For GCC, inline ASM is just a bunch of text where GCC makes certain
substitutions (that's also why you should write every literal % as %%, % is a
substitution like in printf). It does not grok or understand your ASM. It can
make no deductions from what you write there.

To interface with the rest of the code, you have to tell GCC what you are doing
there, by means of inputs and outputs to your ASM, because the state of the
machine before and after the ASM is all GCC cares for.

Let's start simple:

asm (
"add %%eax, %%edx"
);

Is a valid inline ASM, but has the same problem as yours. The compiler will put
the txt literately into the output. If something meaningful is in eax or edx,
there is no guaranty. That GCC does not throw your ASM away (yes, GCC can
optimize ASM a little bit according to their input and output) comes from the
special rule that empty input/output/clobber means "clobbers everything".
From GCC points of view this is the same as

asm (
"yadda yadda yadda"
);

The canonical format of an inline ASM in GCC is:
asm {volatile} (
"instructions"
: {outputs}
: {inputs}
: {clobbers}
);

You do not always have to write all parts, you can leave empty last parts out.
A volatile can be used to give the compiler a little nudge, but the exact
semantic is complicated.

So for our example above one should write:
int a, b;
asm (
"add %1, %0"
: "=r" (a)
: "r" (b), "0" (a)
);

Besides of remaining problems (the compiler may see that the values are not
initialized when passed into the asm, so he may remove the asm), let's see what
we have here.

First, in the instruction section, we refer to our registers by using the number
of the operands. Counting starts at zero and goes from outputs to inputs.

Then there are the outputs. Here we have one output. with "=r" we say the
compiler he should expect an result (the =, all outputs must have an =, except
if you use special stuff like +) in an general purpose register (the r is the
code for that). In parenthesis we say that the compiler should arrange for the
variable a to be living in that register.

After that there are the inputs. First we say we want b to be in a general
purpose register. Then we say that a should be in the same spot as operand 0.

Normally you should also say GCC that you clobbered the eflags register (the
condition codes), by adding a cc clobber like so:
int a, b;
asm (
"add %1, %0"
: "=r" (a)
: "r" (b), "0" (a)
: "cc"
);
But this is implicit on x86, because it is a cc0 target.

So, now to your code.
First, you are entering a world of pain with x87. The stack based nature of the
x87 does not play well with inline ASM.
But more important: never return premature from your inline ASM...

What you want to do does more look like this:

double DiffAngle(double theta1, double theta2)
{

static double NEG_TWO_PI = -(2*3.14);
static double TWO_PI = 2*3.14;
double ret;
int t;

asm(
"fldl %2\n\t"
"fsubl %3\n\t"
"fxam\n\t"
"fnstsw %w1\n\t"
"fldl %4\n\t"
"testb $2, %h1\n\t"
"fldl %5\n\t"
"fcmovne %%st(1), %%st\n\t"
"fstp %%st(1)\n\t"
"fsubr %%st(1), %%st\n\t"
"fldpi\n\t"
"fld %%st(2)\n\t"
"fabs\n\t"
"fcomip %%st(1), %%st\n\t"
"fstp %%st(0)\n\t"
"fcmovbe %%st(1), %%st\n\t"
"fstp %%st(1)"
: /* %0 */ "=t" (ret),
/* %1 */ "=a" (t)
: /* %2 */ "m" (theta1),
/* %3 */ "m" (theta2),
/* %4 */ "m" (TWO_PI),
/* %5 */ "m" (NEG_TWO_PI)
);
return ret;
}

I did not check the calc stuff. We tell the compiler that ret will be on the top
of the x87 stack, we put a temporary into the eax register because we clobber
it, we pass in all inputs as memory operands.

This is just a little crash introduction, there are a lot of details and caveats
to inline ASM, esp. the freedoms the compiler has with your inline ASM.
For all the nitty gritty details, look into the GCC handbook under inline assembly:
http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/Extended-Asm.html
and the chapters after that.

Whatever you do, don't forget the basic rule: GCC does not understand your ASM,
he goes strictly by the input/output/clobber. I you lie there to GCC what your
ASM is about, don't be disappointed when GCC does not obey your wish (and
resorting to giganto "whack-over-the-head" like volatile and a memory clobber
may bite you later).
In face of a inline ASM, GCC does not drop everything on the floor, crawls into
a hole and turns on dumb mode, it can make optimization decisions based on the
inputs/outputs.
NB: GCC views that inline ASM as one single instruction, this can have some
implications.

(the whole decision to implement inline ASMs the way they are in GCC is to
prevent problems other compiler which support inline ASM have with inline ASM,
for example has to understand every single instruction (also the seldom used
system level instructions) and falls over if a new instruction is encountered,
or turn to totally dumb mode. One can use inline ASM for fancy stuff (and i like
to use it lavishly because it has some upsides), but there are limits to it.
Basically it was intended to get that single special instruction the compiler
can not generate, mainly system level stuff)

> Thanks in advance,

If i would be you, i would forget about the whole thing.
This code is still "miserable", but in this case the compiler can do nothing
about it, because he does not understand ASM. (and esp. can not do constant
elimination). In loops more optimizations would be possible, but now you have to
do them by hand. And one big bummer is that you have to write two versions, one
for i386, and one for x86_4, because the later passes float args in SSE
registers, so the compiler would copy the functions arguments first to the stack
so you can load them from there to the x87.

You not being happy with the floating point code the compiler generates probably
stems from the fact that the compiler has to be very careful when it comes to
floats (NaNs, Inf, loss of precision, exes precision, etc.). So he prop. passes
all things of to the proper lib calls when he can not proof that emitting the
direct instructions can safely be done.

Additionally i expect that you are simply compiling for an generic x86 machine,
try to play with -march=pentium3 or -msse.

You may want to try -ffast-math, this unleashes GCC to get aggressive with your
floats. If -ffast-math generates wrong results in your program (they are not
really wrong, but simply effects from "imprecise" float arithmetic covered up
when not optimizing so aggressive), with a new enough GCC you can try:
__attribute__((optimize("fast-math")))
inline double DiffAngle_o(double theta1, double theta2)

{
double delta(theta1 - theta2);

return std::abs(delta) <= M_PI ?
delta : delta - copysign(2 * M_PI,delta);
}

to only make this single function fast math.

N.B: you do have set -O? Prop. to -O2? Maybe even to -O3?

The only reason i would look into making this an asm is do diff a complete array
of angles with SSE vector instructions to calc several values at once, because,
while i love GCC, i do not trust him with vectorization.

> Bill

Greetings
Jan

Markus Wichmann

unread,

Jan 13, 2012, 7:27:12 AM1/13/12

to

That is because you actually require a real call to the function. If the
above assembly is inlined, the compiler doesn't really know where to put
the input and output variables.

I'm rewriting your C++ first, so I can put it into assembly more easily:

double DiffAngle(double theta1, double theta2)
{

double diff = theta1 - theta2;

if (abs(diff) <= M_PI)
return diff;
else if (diff < 0)
return diff + 2 * M_PI;
else
return diff - 2 * M_PI;

//Or, in a more SSE-like manor:
double subtract;
subtract = copysign(2*M_PI, diff);
if (abs(diff) <= M_PI) subtract = 0;
return diff - subtract;
}

Because you might want to rewrite the stuff anyway in SSE2, I'd change
it to something like:

double DiffAngle(double theta1, double theta2)
{

double res;
const uint64_t no_sign_mask = 0x7fffffffffffffff;
asm("movsd %1, %xmm0": : "m" (theta1) : );
asm("subsd %1, %xmm0": : "m" (theta2) : );
asm("movsd %xmm0, %xmm1" : : );
asm("movq %1, %xmm2" : : "m" (no_sign_mask) : );
asm("andpd %xmm2, %xmm0" : : ); //xmm0 = abs
asm("cmpgtsd %1, %xmm0": : "m" (M_PI) : ); //if abs(diff) <= M_PI
// %xmm0 = 0, else %xmm0 == 0xffff...
asm("movsd %1, %xmm3": : "m" (2 * M_PI) : );
asm("movsd %1, %xmm2": : "m" (~no_sign_mask) : );
asm("movsd %xmm1, %xmm4" : : : );
asm("andpd %xmm2, %xmm4" : : : );
asm("orpd %xmm4, %xmm3" : : : );
asm("andpd %xmm0, %xmm3" : : : );
asm("subsd %xmm3, %xmm1" : : :);
asm("movsd %xmm1, %0" : "=m" (res) : : );
return res;
}

Does that work for you? It's untested!

> 2) If I compile with -Wall, I get a warning that the function doesn't
> return a value, which is absolutely true, but I don't know how to fix
> it.

double ret;
asm("fld %1; fld %2; blablabla; fstp %0"
: "m" (theta1), "m" (theta2)
: "=m" (ret)
: );
return ret;

This should also clear your previous question.

> 3) I don't like how TWO_PI and NEG_TWO_PI are defined. I had to steal
> it from some generated assembly. It would be nice to use M_PI,
> 4*atan(1) or something like that.
>

Just define it as new inputs and let the compiler worry. Like:

double ret;
asm("fld %1; fld %2; blabla; fld %3; blabli; fld %4; bla; fstp %0"
: "=m" (ret)
: "m" (theta1), "m" (theta2), "m" (2*M_PI), "m" (-2*M_PI)
: );
return ret;

The "m" means "memory operand" (let the compiler worry about the
addresses!), the "=" means "write only operand".

> Thanks in advance,
> Bill

HTH,
Markus

Tim Roberts

unread,

Jan 14, 2012, 7:10:22 PM1/14/12

to

Terje Mathisen <"terje.mathisen at tmsw.no"@giganews.com> wrote:
>
>Inline C isn't too hard to write:

Have you tried this code?

>inline double delta(double th1, th2)
>{
> static double pi = 3.14159265357989;
> static double zero_or_twopi[2] = {0, 3.14159265357989*2};
> static double absmask = (double) 0x7fffffffffffffff;

That doesn't do what you think it does. That cast converts the constant to
floating point -- it doesn't just copy the bits. There are only 56 bits of
mantissa available in a double, so some of your bits get lost.

> static double signmask = -0.0;
>
> double diff = th1-th2;
> double adiff = (double) ((int64_t) diff & (int64_t) absmask);

Again here, this is going to convert diff to an integer, which means you'll
lose any bits to the right of the decimal point. It's not a bitwise copy.

To do bitwise manipulation, you need to fool the compiler:
double adiff = *(double*)(*(int64_t*)&diff & *(int64_t*)&absmask);

> adiff = (double) ((int64_t) adiff ^
> ((int64_t) signmask & (int64_t) diff));

Same problem occurs here.
--
Tim Roberts, ti...@probo.com
Providenza & Boekelheide, Inc.

Terje Mathisen

unread,

Jan 15, 2012, 8:21:55 AM1/15/12

to

Tim Roberts wrote:
> Terje Mathisen<"terje.mathisen at tmsw.no"@giganews.com> wrote:
>>
>> Inline C isn't too hard to write:
>
> Have you tried this code?
>
>> inline double delta(double th1, th2)
>> {
>> static double pi = 3.14159265357989;
>> static double zero_or_twopi[2] = {0, 3.14159265357989*2};
>> static double absmask = (double) 0x7fffffffffffffff;
>
> That doesn't do what you think it does. That cast converts the constant to
> floating point -- it doesn't just copy the bits. There are only 56 bits of
> mantissa available in a double, so some of your bits get lost.

I do know that,I was in a hurry and tried to get away with psudocode,
but I should have stated that clearly.

Thanks for correcting my omissions!

Terje--

Bill Woessner

unread,

Jan 19, 2012, 4:08:37 PM1/19/12

to

Success! Thanks to everyone who helped me out. This was a great
little project to learn about inline assembly in GCC. For those of
you who are interested, here's the final product:

inline double DiffAngle(double theta1, double theta2)
{

double TWO_PI __attribute__ ((aligned (16)))(2 * M_PI),
NEG_TWO_PI __attribute__ ((aligned (16)))(-TWO_PI);
double delta(theta1 - theta2);

asm("cmpnlesd %0, %1\n\t"
"cmpltsd %0, %2\n\t"
"andpd %3, %1\n\t"
"andpd %4, %2\n\t"
"addsd %1, %0\n\t"
"addsd %2, %0\n\t"
: "+x" (delta)
: "x" (M_PI),
"x" (-M_PI),
"m" (TWO_PI),
"m" (NEG_TWO_PI)
:
);

return delta;
}

One final question - is it possible to declare TWO_PI and NEG_TWO_PI
as const? And, if so, what constraint would I use? Right now, if I
declare them const, I get the error "memory input 3 is not directly
addressable".

Thanks,
Bill

io_x

unread,

Jan 20, 2012, 4:48:34 AM1/20/12

to

"Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> ha scritto nel
messaggio news:iqe6u8-...@ntp6.tmsw.no...

>> inline double delta(double th1, th2)

why not 2 loops [one for normalize delta >pi and one for < -pi]?

difPimPi(double x, y)
x-=y |>?#.2
.1: x>=-pi#.3|x+=2*pi|#.1
.2: x< +pi#.3|x-=2*pi|#.2
.3:
ret

or someting in one 'pseudocode' good enought for
represent the problem...

i32 difPimPi(d64 *r, x, y)
{ if(MemIsOut(r)) goto .e;
if(IsNanOrErr(x)||IsNanOrErr(y))
.e: return -1;
x-=y
if(CF()) goto .e;
if(x>0) goto .2; // or better if(SignFlag()>0) goto .2
.1: if(x>=-pi) goto .3; x+=2*pi; goto .1;
.2: if(x< pi) goto .3; x-=2*pi; goto .2;
.3: *r=x; return 0;
}

where i32 is int signed 32 bit
and d64 is double 64 bits

i check every possible error with this last?
how many error do someone of you find in that?
how many UB do you find if the last is one C program?

sfuerst

unread,

Jan 20, 2012, 1:28:04 PM1/20/12

to

On Jan 19, 1:08 pm, Bill Woessner <woess...@nospicedham.gmail.com>
wrote:

This works for me:

#include <math.h>

double DiffAngle(double theta1, double theta2)
{

static const double TWO_PI __attribute__ ((aligned (16))) = 2 * M_PI;
double delta = theta1 - theta2;

asm("cmpnlesd %0, %1\n"

"cmpltsd %0, %2\n"
"andpd %3, %1\n"
"andpd %3, %2\n"
"addsd %1, %0\n"
"subsd %2, %0\n"

: "+x" (delta)
: "x" (M_PI), "x" (-M_PI), "m" (TWO_PI)

:
);

return delta;
}

(Using another trick, where the variable containing -2pi can be
replaced by a subtraction of +2pi.)

Steven

Bill Woessner

unread,

Jan 20, 2012, 2:50:49 PM1/20/12

to

It appears to be a C vs. C++ issue. If I compile the code with gcc
(and a .c extension), it works fine. If I compile with g++, it fails
with this error:

AngleDiff.cc: In function ‘double DiffAngle(double, double)’:
AngleDiff.cc:20:8: error: memory input 3 is not directly addressable

However, it DOES work if TWO_PI is declared static (but not static
const). I also tried compiling with -fkeep-static-consts, which
didn't help either.

Thanks,
Bill

io_x

unread,

Jan 20, 2012, 4:51:21 PM1/20/12

to

"io_x" <a...@b.c.invalid> ha scritto nel messaggio
news:4f193789$0$1389$4faf...@reader1.news.tin.it...

>
> "Terje Mathisen" <"terje.mathisen at tmsw.no"@giganews.com> ha scritto nel
> messaggio news:iqe6u8-...@ntp6.tmsw.no...
>>> inline double delta(double th1, th2)
>
> why not 2 loops [one for normalize delta >pi and one for < -pi]?
>
> difPimPi(double x, y)
> x-=y |>?#.2
> .1: x>=-pi#.3|x+=2*pi|#.1
> .2: x< +pi#.3|x-=2*pi|#.2
> .3:
> ret
>
> or someting in one 'pseudocode' good enought for
> represent the problem...
>
> i32 difPimPi(d64 *r, x, y)
> { if(MemIsOut(r)) goto .e;

better than above
if(MemIsOut(r, sizeof( d64))) goto .e;
is difficult to know if a range is in the process?