routines using MMX / XMM - save/restore MMX/XMM registers

Lars Erdmann

unread,

Mar 10, 2012, 10:39:13 AM3/10/12

to

Hallo,

if I wanted to write a routine using MMX registers:
is it enough to do an "emms" on return or do I also need to do an "fsave" on
entry and and "frstor" on return from the function ?

likewise for XMM registers:
do I need to do an "fxsave" on entry and "fxrstor" on return from the
function ?

Or is it sufficient if the OS does the save/restore on context switch (only)
?

James Van Buskirk

unread,

Mar 10, 2012, 11:02:29 AM3/10/12

to

"Lars Erdmann" <lars.e...@nospicedham.arcor.de> wrote in message
news:4f5b75a1$0$7611$9b4e...@newsspool1.arcor-online.net...

This kind of stuff depends on the calling convention you are trying
to follow. I recommend:

http://www.agner.org/optimize/calling_conventions.pdf

--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end

Markus Wichmann

unread,

Mar 11, 2012, 5:02:55 AM3/11/12

to

OK, you are mixing up two things here:

- For one thing the OS preserves your registers for you. You do not need
to think about it, because you have no way to know a context switch is
going to happen, or has happened. Just hope that the OS recognizes that
your program is using those registers and saves them. Look up your OS's
ABI for more info (there could be OSes that require you to call finit or
emms before using MMX, XMM or the FPU, but I haven't seen those before).

- For another thing, preservation of registers across function calls:
The System V ABI for x86_64 specifies that no FPU, MMX or XMM register
is preserved across function calls. That is, all registers _but_ the
control words (control part of mxcsr and the x87 CW). However, the
kernel is not allowed to alter the state of those units in system calls,
so syscalls do preserve FPU, MMX and XMM state.

Also, the System V ABI says that the first up to 15 floats and doubles
passed to a function as argument go into the xmm registers and that xmm0
is used to return a float or double value, and st(0) is used to return a
long double.

HTH,
Markus

Lars Erdmann

unread,

Mar 11, 2012, 10:19:41 AM3/11/12

to

Hallo,

"Markus Wichmann" <null...@nospicedham.gmx.net> schrieb im Newsbeitrag
news:0b6u29-...@voyager.wichi.de.vu...

> On 10.03.2012 16:39, Lars Erdmann wrote:
>> Hallo,
>>
>> if I wanted to write a routine using MMX registers:
>> is it enough to do an "emms" on return or do I also need to do an "fsave"
>> on
>> entry and and "frstor" on return from the function ?
>>
>> likewise for XMM registers:
>> do I need to do an "fxsave" on entry and "fxrstor" on return from the
>> function ?
>>
>> Or is it sufficient if the OS does the save/restore on context switch
>> (only)
>> ?
>>
>>
>>
>

> - For another thing, preservation of registers across function calls:
> The System V ABI for x86_64 specifies that no FPU, MMX or XMM register
> is preserved across function calls. That is, all registers _but_ the
> control words (control part of mxcsr and the x87 CW). However, the
> kernel is not allowed to alter the state of those units in system calls,
> so syscalls do preserve FPU, MMX and XMM state.

What is the "System V ABI" ? Does it depend on any specific compiler (I'd
think so) ?

Anyway, this is not what:
http://www.agner.org/optimize/calling_conventions.pdf
says.

In fact, it says, obviously depending on compiler used, that in conjunction
with
the Microsoft 64-bit compiler the callee has to save XMM6-XMM15.
And for that case Microsoft obviously changed its mind along the way: first
you only needed to save
the lower 64 bits of XMM6-XMM15, with the newer compilers you now need to
preserve all
128 bits of XMM6-XMM15.

For 64-bit Linux (gcc compiler) this is obviously different as for that
compiler,
XMM0 - XMM15 are treated as scratch registers (the compiler does not expect
them to be preserved accross function calls).

I am beginning to believe that either I have to write the routines to
fulfill the "most restrictive preserve rule" of all
compilers it is to be used with or that I restrict myself to the use in
conjunction with only one compiler.

Fortunately I am in the lucky position that I only care about 32-bit
compilers. And all of the listed ones treat
ST(0)-ST(7), XMM0-XMM7 and YMM0-YMM7 as scratch pad register (no need to be
preserved).

In short: for me, no need to do FSAVE/FRSTOR (MMX) , FXSAVE/FXRSTOR (SSE) ,
XSAVE/XRSTOR (AVX) in an assembly routine.

Lars

Markus Wichmann

unread,

Mar 11, 2012, 4:57:05 PM3/11/12

to

On 11.03.2012 15:19, Lars Erdmann wrote:
> Hallo,
>
> "Markus Wichmann" <null...@nospicedham.gmx.net> schrieb im Newsbeitrag
> news:0b6u29-...@voyager.wichi.de.vu...

>> - For another thing, preservation of registers across function calls:
>> The System V ABI for x86_64 specifies that no FPU, MMX or XMM register
>> is preserved across function calls. That is, all registers _but_ the
>> control words (control part of mxcsr and the x87 CW). However, the
>> kernel is not allowed to alter the state of those units in system calls,
>> so syscalls do preserve FPU, MMX and XMM state.
>
> What is the "System V ABI" ? Does it depend on any specific compiler (I'd
> think so) ?
>

It is exactly what it says on the tin! It is an ABI several OSes use, in
particular unixoid OSes like Linux, BSD, and I think Mac OS X, too.

There is a main ABI which describes the ELF and basic premises and it
has been amended for specific processor architectures, including, but
not limited to, x86 and x86_64.

I should not think it is compiler dependent. Rather it is architecture
dependent.

The main document as well as its i386 and MIPS supplements can be found
at SCO (better download them while you still can):
<http://www.sco.com/developers/devspecs/>

The AMD64 supplement is available at <http://www.x86-64.org>.

> Anyway, this is not what:
> http://www.agner.org/optimize/calling_conventions.pdf
> says.
>
> In fact, it says, obviously depending on compiler used, that in conjunction
> with
> the Microsoft 64-bit compiler the callee has to save XMM6-XMM15.
> And for that case Microsoft obviously changed its mind along the way: first
> you only needed to save
> the lower 64 bits of XMM6-XMM15, with the newer compilers you now need to
> preserve all
> 128 bits of XMM6-XMM15.
>

Well, yeah, that's M$ for you. That's why the Unix people had a public
discussion about the ABI before releasing the new architecture.

> For 64-bit Linux (gcc compiler) this is obviously different as for that
> compiler,
> XMM0 - XMM15 are treated as scratch registers (the compiler does not expect
> them to be preserved accross function calls).
>

It shouldn't, as the ABI states that those registers are caller-saved.

>[...]
> Lars
>

Ciao,
Markus

Bernhard Schornak

unread,

Mar 12, 2012, 8:57:47 AM3/12/12

to

Markus Wichmann wrote:

<snip>

>> In fact, it says, obviously depending on compiler used, that in conjunction
>> with
>> the Microsoft 64-bit compiler the callee has to save XMM6-XMM15.
>> And for that case Microsoft obviously changed its mind along the way: first
>> you only needed to save
>> the lower 64 bits of XMM6-XMM15, with the newer compilers you now need to
>> preserve all
>> 128 bits of XMM6-XMM15.
>>
>
> Well, yeah, that's M$ for you. That's why the Unix people had a public
> discussion about the ABI before releasing the new architecture.
>
>> For 64-bit Linux (gcc compiler) this is obviously different as for that
>> compiler,
>> XMM0 - XMM15 are treated as scratch registers (the compiler does not expect
>> them to be preserved accross function calls).
>>
>
> It shouldn't, as the ABI states that those registers are caller-saved.

What a great improvement!

...
movdqa %xmm8, %xmm9
save ALL registers # except the four callee saved ones...
call ABI_conforming_function # returning XMM0 as result
RESTORE ALL registers
pand %xmm0, %xmm8
save ALL registers # except the four callee saved ones...
call ABI_conforming_function # returning XMM0 as result
RESTORE ALL registers
por %xmm0, %xmm9
paddb %xmm8, %xmm9
...

Due to forced saving/reloading, System V compliant code
is ~220 percent slower than its Windows-64 counterpart,
where only 12 (rather than 28) registers must be saved/
restored. Fortunately, these save/restore orgies can be
'hidden' in wrappers... ;)

Greetings from Augsburg

Bernhard Schornak

Markus Wichmann

unread,

Mar 12, 2012, 3:44:51 PM3/12/12

to

On 12.03.2012 13:57, Bernhard Schornak wrote:
>
> What a great improvement!
>

Over what exactly? I mean, at least it's stable, which is more than can
be said about the Win64 ABI.

>
> ...
> movdqa %xmm8, %xmm9
> save ALL registers # except the four callee saved ones...
> call ABI_conforming_function # returning XMM0 as result
> RESTORE ALL registers
> pand %xmm0, %xmm8
> save ALL registers # except the four callee saved ones...
> call ABI_conforming_function # returning XMM0 as result
> RESTORE ALL registers
> por %xmm0, %xmm9
> paddb %xmm8, %xmm9
> ...
>
> Due to forced saving/reloading, System V compliant code
> is ~220 percent slower than its Windows-64 counterpart,
> where only 12 (rather than 28) registers must be saved/
> restored. Fortunately, these save/restore orgies can be
> 'hidden' in wrappers... ;)
>

How often do you encounter a situation in which you have 28, or even 12,
floating-point registers allocated at the same time? And _then_ have to
call an external function! (I can understand heavy usage when doing some
heavy calculations, but usually you want to call external functions only
before or after those.)

Also, which 28 registers? There are only 16 XMM registers and code which
uses both XMM and FPU registers that intensively is _really_ rare. (I
know in the past I have written code for libm that moves its input from
xmm0 to st0 and the result back again. Did look like this in the end:

global sin, sinl, sinf
%ifdef PIC
%define WRAPD __abi_wrap_1arg_d wrt ..plt
%define WRAPF __abi_wrap_1arg_f wrt ..plt
%else
%define WRAPD __abi_wrap_1arg_d
%define WRAPF __abi_wrap_1arg_f
%endif
section .text

sin:
lea rax, [rip+sinm]
jmp WRAPD

sinf:
lea rax, [rip+sinm]
jmp WRAPF

sinl:
fld tword [rsp+8]
//FALL THROUGH

sinm:
fsin
fnstsw ax
sahf
jnp .r
;if we got here, the input magnitude is too big to calculate a sine
;no need to do the fprem1 trick, because the necessary information
;isn't in the input any more. I just return zero now.
fldz
fstp st1
.r: ret

And in another file:

global __abi_wrap_1arg_d, __abi_wrap_1arg_f
section .text
__abi_wrap_1arg_d:
sub rsp, 8
movsd [rsp], xmm0
fld qword [rsp]
call rax
fstp qword [rsp]
movsd xmm0, [rsp]
add rsp, 8
ret

__abi_wrap_1arg_f:
sub rsp, 4
movss [rsp], xmm0
fld dword [rsp]
call rax
fstp dword [rsp]
movss xmm0, [rsp]
add rsp, 4
ret

See? Both XMM and FPU registers in use, but not really in abundance.)

>
> Greetings from Augsburg
>
> Bernhard Schornak

Ciao,
Markus

Bernhard Schornak

unread,

Mar 12, 2012, 7:49:00 PM3/12/12

to

Markus Wichmann wrote:

> On 12.03.2012 13:57, Bernhard Schornak wrote:
>>
>> What a great improvement!
>>
>
> Over what exactly? I mean, at least it's stable, which is more than can
> be said about the Win64 ABI.

Which I do not like, either. On the other hand, it is
(against my own expectation) almost as stable as OS/2
(IMHO the best OS ever - until IBM gave it up...).

16 GPRs + 16 XMMs = 32. As you listed, 28 of them are
declared as 'volatile' in *nix systems, while only 12
are declared as 'volatile' in Win-64. It surely is no
issue in functions like those you posted, but some of
my functions (e.g. my DBE core) have thousand or more
lines with contiguous code. The DBE automatically re-
sizes memory blocks if new dynamic strings exceed the
currently allocated block size. Allocation requires a
call to a 'dirty' API function, where six GPR and six
XMM registers are overwritten with garbage. Without a
wrapper (doing some more work than just calling dirty
API functions), I had to reload eight of those twelve
registers at that point. This topic has much more ill
side effects, sufficient to fill entire books...

...
shrq $0x08, %r14 # r14 = sig
shrq $0x19, %r15 # r15 = sep
movl $0x0D, %ebx # RBX = loop_cnt
movq %rdi, %rcx # RCX = HNWD dlg
movl $0x1500, %edx # RDX = ID
xorq %r8, %r8 # R08 = FALSE
andq $0x01, %r14 # r14 = sig BOOL
andq $0x03, %r15 # r15 = sep INDEX
0:call _SBtn
incl %edx
decl %ebx
jns 0b
xorl %eax, %eax # RAX = 0
decq %r15 # R15 = -1, 1, 2
cmovs %eax, %r15d # R15 = 0, 1, 2
movl $0x03, %ebx # RBX = loop_cnt
movl $0x1515, %edx # RDX = ID
1:call _SBtn
incl %edx
decl %ebx
jne 1b
...

A snippet out of a dialog procedure, 'clicking' radio
buttons. Without a wrapper, the Win-64 version looked
like this:

...
0:pushq %rcx
pushq %rdx
pushq %r8
pushq %r9
pushq %r10
pushq %r11
call *__imp__SendDlgItemMessageA(%rip)
popq %r11
popq %r10
popq %r9
popq %r8
popq %rdx
popq %rcx
...

For System V, R12...R15 had to be saved and restored,
as well.

Just have a look what GCC's emits as output for AS to
get a clue how much -unnecessary- code could be saved
with 'clean' ABIs. You can reduce GCC's code by about
40 percent (running at least twice as fast) with some
simple changes, providing a 'clean' environment...

Markus Wichmann

unread,

Mar 13, 2012, 8:29:10 AM3/13/12

to

I take it that's compiled code?

In that case, the compiler decided that saving and restoring everything
is easier than just not using callee-saved registers. That, or the
compiler writers became sloppy. As I recall, there are patches for the
GCC to provide a code generator for AVR32. Those patches were made by
Atmel, the manufacturer of AVR32s. Since AVR32s are RISC processors,
they have a shitload of registers, but those patches look like they were
made on the spot, so the support for interrupt handlers had to be hacked
in. Now any interrupt handler made by GCC starts with

push r0
push r1
push r2
.
.
.

And ends in

.
.
.
pop r2
pop r1
pop r0

Regardless of whether those registers were used in the routine.

Of course the professor in my microcontroller class misunderstood that
and flagged the GCC as bad product for doing it that way, showing of
some comercial compiler only the Uni could afford. Which than displayed
a remarkable lack of optimization later, when it would generate more
efficient code for

if (x & 1)
x &= ~1;
else
x |= 1;

than for

x ^= 1;

>
> For System V, R12...R15 had to be saved and restored,
> as well.
>

For SysV, a callee has to save rbx, rsp, rbp, r12-r15, part of mxcsr and
the x87 CW. That's it. As always, all that stuff only has to be saved if
it could be changed within the callee. Now that leaves the compiler
wiggle room when compiling a leaf function, since rax, rcx, rdx, rsi,
rdi and r8-r11 can be used without having to be saved.

> Just have a look what GCC's emits as output for AS to
> get a clue how much -unnecessary- code could be saved
> with 'clean' ABIs. You can reduce GCC's code by about
> 40 percent (running at least twice as fast) with some
> simple changes, providing a 'clean' environment...
>

GCC usually generates more streamlined code than I would given the same
task. Moreover, it manages to create code for the FPU! I routinely give
up once I have to manage more than four stack registers in an iterative
algorithm...

Did you try -Os or -O3? GCC with explicitly disabled optimization (-O0)
generates horrible code, I know (last I checked, it was usually littered
with

mov eax, eax

and the like), but you can't fault a compiler with optimizers off for
generating suboptimal code, now, can you?

>
> Greetings from Augsburg
>
> Bernhard Schornak

Ciao,
Markus

Bernhard Schornak

unread,

Mar 14, 2012, 7:35:43 AM3/14/12

to

No. It is just a worst case assumption. The pushes are
faster with write combing, but the pops (or reads) are
time consuming if the contents of those registers have
to be present immediately.

Actually, GCC preserves required registers -somewhere-
on the stack and reloads them (over and over, again).

> In that case, the compiler decided that saving and restoring everything
> is easier than just not using callee-saved registers. That, or the
> compiler writers became sloppy. As I recall, there are patches for the
> GCC to provide a code generator for AVR32. Those patches were made by
> Atmel, the manufacturer of AVR32s. Since AVR32s are RISC processors,
> they have a shitload of registers, but those patches look like they were
> made on the spot, so the support for interrupt handlers had to be hacked
> in. Now any interrupt handler made by GCC starts with
>
> push r0
> push r1
> push r2
> .
> .
> .
>
> And ends in
>
> .
> .
> .
> pop r2
> pop r1
> pop r0
>
> Regardless of whether those registers were used in the routine.

Depends on the AVR memory subsystem how fast reads and
writes are performed. AMD's optimisation guide lists 1
clock latency for PUSH / POP for family 15 processors,
but the memory subsystem can't access more than two 64
bit memory locations per clock.

However, the question is: Why do ABIs force this dirty
programming style? It saved tons of redundant preserve
and restore orgies if each function had to return used
registers as they were passed to it. Permanent reloads
cost much more time than preserving and restoring used
resources.

> Of course the professor in my microcontroller class misunderstood that
> and flagged the GCC as bad product for doing it that way, showing of
> some comercial compiler only the Uni could afford. Which than displayed
> a remarkable lack of optimization later, when it would generate more
> efficient code for
>
> if (x& 1)

> x&= ~1;

> else
> x |= 1;
>
> than for
>
> x ^= 1;
>
>>
>> For System V, R12...R15 had to be saved and restored,
>> as well.
>>
>
> For SysV, a callee has to save rbx, rsp, rbp, r12-r15, part of mxcsr and
> the x87 CW. That's it. As always, all that stuff only has to be saved if
> it could be changed within the callee. Now that leaves the compiler
> wiggle room when compiling a leaf function, since rax, rcx, rdx, rsi,
> rdi and r8-r11 can be used without having to be saved.

Sorry, I recalled that one wrong.

>> Just have a look what GCC's emits as output for AS to
>> get a clue how much -unnecessary- code could be saved
>> with 'clean' ABIs. You can reduce GCC's code by about
>> 40 percent (running at least twice as fast) with some
>> simple changes, providing a 'clean' environment...
>>
>
> GCC usually generates more streamlined code than I would given the same
> task. Moreover, it manages to create code for the FPU! I routinely give
> up once I have to manage more than four stack registers in an iterative
> algorithm...
>
> Did you try -Os or -O3? GCC with explicitly disabled optimization (-O0)
> generates horrible code, I know (last I checked, it was usually littered
> with
>
> mov eax, eax
>
> and the like), but you can't fault a compiler with optimizers off for
> generating suboptimal code, now, can you?

I use GCC just as a 'frontend' for AS - all of my code
is AT&T style assembler. But I compile example code if
I don't understand how something works, so I have seen
-some- GCC output throughout the years. Switching from
DOS to OS/2 in 1993, good old A86 didn't work anymore,
so I had to switch to GCC/2, as well...