Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

aarch64 - 64-Bit ARM - return by value - GNU g++ inline assembler

71 views
Skip to first unread message

Frederick Virchanza Gotham

unread,
Jul 17, 2023, 10:24:26 AM7/17/23
to
I'm trying to write a very simple function in two or three aarch64 instructions as 'inline assembler' inside a C++ source file.

With the aarch64 calling convention on Linux, if a function returns a very large struct by value, then the address of where to store the return value is passed in the X8 register. This is out of the ordinary as far as calling conventions go. Every other calling convention, for example System V x86_64, Microsoft x64, cdecl, stdcall, arm32, pass the address of the return value in the first parameter. So for example with x86_64 on Linux, the RDI register contains the address of where to store the very large struct.

I want to try emulate this behaviour on aarch64 on Linux. When my assembler function is entered, I want it to do two things:
(1) Put the address of the indirect return object into the first parameter register, i.e. move X8 to X0
(2) Jump to a location specified by a global function pointer

So here's how I think my assembler function should look:

__asm("Invoke: \n"
" mov x0, x8 \n" // move return value address into 1st parameter
" mov x9, f \n" // Load address of code into register
" br x9 \n" // Jump to code
);

I don't know what's wrong here but it doesn't work. In the following complete C++ program, I use the class 'std::mutex' as it's a good example of a class that can't be copied or moved (I am relying on mandatory Return Value Optimisation).

Here is my entire program in one C++ file, could someone please help me write the assembler function properly? Am I supposed to be using the ADRP and LDR instructions instead of MOV?

#include <mutex> // mutex
#include <iostream> // cout, endl
using std::cout, std::endl;

void (*f)(void) = nullptr;

extern "C" void Invoke(void);

__asm("Invoke: \n"
" mov x0, x8 \n" // move return value address into 1st parameter
" mov x9, f \n" // Load address of code into register
" br x9 \n" // Jump to code
);

void Func(std::mutex *const p)
{
cout << "Address of return value: " << p << endl;
}

int main(void)
{
f = (void(*)(void))Func;

auto const p = reinterpret_cast<std::mutex (*)(void)>(Invoke);

auto retval = p();

cout << "Address of return value: " << &retval << endl;
}

Bonita Montero

unread,
Jul 17, 2023, 11:27:49 AM7/17/23
to
Why not simply accept the ARM ABI for that ?
For me it doesn't matter where the return value pointer is stored.

Frederick Virchanza Gotham

unread,
Jul 17, 2023, 12:14:10 PM7/17/23
to
On Monday, July 17, 2023 at 4:27:49 PM UTC+1, Bonita Montero wrote:
>
> Why not simply accept the ARM ABI for that ?
> For me it doesn't matter where the return value pointer is stored.


I'm writing a universal header file for all operating systems, all processors and all calling conventions, to invoke functions with guaranteed elision of copy/move operations when dealing with Named Return Value Optimisation. Currently the C++ Standard only mandates elision with RVO -- but not with NRVO.

Since 99% of calling conventions work the same way, it makes sense to model my system on 99% of systems and to make aarch64 be the exception. So I will try get the aarch64 implementation to behave like all the others. I started a thread on the Standard Proposals mailing list:

https://lists.isocpp.org/std-proposals/2023/07/7269.php

But anyway I need to figure out how to use inline assembler get the value from the X8 register. It should be really simple but I can't get it to work.

Bonita Montero

unread,
Jul 17, 2023, 12:21:22 PM7/17/23
to
Am 17.07.2023 um 18:13 schrieb Frederick Virchanza Gotham:

> I'm writing a universal header file for all operating systems, all processors and all calling conventions, to invoke functions with guaranteed elision of copy/move operations when dealing with Named Return Value Optimisation. Currently the C++ Standard only mandates elision with RVO -- but not with NRVO.

The amount of assembler code is always very small because what
the compiler can do is almost always good. So it shouldn't be
a problem to write the assembler part for each platform separately.
I do it like this, too.

> Since 99% of calling conventions work the same way, ...

This may be so from a core principles perspective, but the
implementations are very different when you look at it in
detail. Somehow what you are doing seems to me like you want
to make circles square.

Scott Lurndal

unread,
Jul 17, 2023, 12:25:42 PM7/17/23
to
Bonita Montero <Bonita....@gmail.com> writes:
>Why not simply accept the ARM ABI for that ?
>For me it doesn't matter where the return value pointer is stored.
>
>Am 17.07.2023 um 16:24 schrieb Frederick Virchanza Gotham:
>
>>
>> Here is my entire program in one C++ file, could someone please help me write the assembler function properly? Am I supposed to be using the ADRP and LDR instructions instead of MOV?


Messing around like you have been in the low-level stuff using
inline assembler is a recipe for disaster. It's not recommended.

AArch64 is a typical load-store RISC architecture. The MOV instruction
supports either register to register moves or immediate to register moves.

Only LDR/STR (and variants) provide access to memory.

Frederick Virchanza Gotham

unread,
Jul 17, 2023, 1:01:01 PM7/17/23
to
On Monday, July 17, 2023, Frederick Gotham wrote:

> __asm("Invoke: \n"
> " mov x0, x8 \n" // move return value address into 1st parameter
> " mov x9, f \n" // Load address of code into register
> " br x9 \n" // Jump to code
> );


Instead of using "inline assembler" inside a C++ source file, I instead tried to make a separate assembler file.

Here's what I have, but it still doesn't work, it's still segfaulting inside 'detail_Invoke':

.text

.global tl_p
.Addr_tl_p:
.xword tl_p

.global detail_Invoke
detail_Invoke:
adrp x9, [.Addr_tl_p]
ldr x9, [x9]
mov x10, x9
br x10

Scott Lurndal

unread,
Jul 17, 2023, 1:07:33 PM7/17/23
to
Frederick Virchanza Gotham <cauldwel...@gmail.com> writes:
>On Monday, July 17, 2023, Frederick Gotham wrote:
>
>> __asm("Invoke: \n"
>> " mov x0, x8 \n" // move return value address into 1st parameter
>> " mov x9, f \n" // Load address of code into register
>> " br x9 \n" // Jump to code
>> );
>
>
>Instead of using "inline assembler" inside a C++ source file, I instead tried to make a separate assembler file.
>
>Here's what I have, but it still doesn't work, it's still segfaulting inside 'detail_Invoke':
>

Execute your application using gdb, then when it faults, examine
the faulting instruction:

(gdb) x/i $pc

Then look at the registers:

(gdb) info reg

The cause of the fault should be evident from that data.

Frederick Virchanza Gotham

unread,
Jul 17, 2023, 1:31:28 PM7/17/23
to
On Monday, July 17, 2023 at 6:07:33 PM UTC+1, Scott Lurndal wrote:
>
> Execute your application using gdb, then when it faults, examine
> the faulting instruction:
>
> (gdb) x/i $pc
>
> Then look at the registers:
>
> (gdb) info reg
>
> The cause of the fault should be evident from that data.


My laptop has an x86_64 CPU. I'm using a cross-compiler and then running the executable in a CPU emulator. Hence can't debug.

Scott Lurndal

unread,
Jul 17, 2023, 1:51:24 PM7/17/23
to
Frederick Virchanza Gotham <cauldwel...@gmail.com> writes:
>On Monday, July 17, 2023 at 6:07:33=E2=80=AFPM UTC+1, Scott Lurndal wrote:
>>
>> Execute your application using gdb, then when it faults, examine=20
>> the faulting instruction:=20
>>=20
>> (gdb) x/i $pc=20
>>=20
>> Then look at the registers:=20
>>=20
>> (gdb) info reg=20
>>=20
>> The cause of the fault should be evident from that data.
>
>
>My laptop has an x86_64 CPU. I'm using a cross-compiler and then running th=
>e executable in a CPU emulator. Hence can't debug.

My day job is writing SoC emulators for AAarch64 CPUs. The emulator has
the same debug capabilities as any application debugger - breakpoints,
instruction disassembly, register and memory access. I've never encountered
an emulator (AMD's SimNow!, various in-house proprietary emulators, Imperas
ARM emulators, Synopsys Virtualizer, QEMU) which didn't offer such capabilities
as key features.

Given that we boot Linux on our emulator, running gdb itself under linux
on the emulator is another option.

Bonita Montero

unread,
Jul 17, 2023, 1:56:07 PM7/17/23
to
I guess you don't write any practical code with that but
some experiments.

Chris M. Thomasson

unread,
Jul 17, 2023, 2:30:20 PM7/17/23
to
Make sure to have Intel syntax and GAS syntax versions. Not sure if that
right for you, however, I needed to do that to handle different
assemblers. MASM and GAS primarily. Generally, I stayed away from inline
assembler and used externally assembled files.

Chris M. Thomasson

unread,
Jul 17, 2023, 2:34:35 PM7/17/23
to
On 7/17/2023 10:00 AM, Frederick Virchanza Gotham wrote:
> On Monday, July 17, 2023, Frederick Gotham wrote:
>
>> __asm("Invoke: \n"
>> " mov x0, x8 \n" // move return value address into 1st parameter
>> " mov x9, f \n" // Load address of code into register
>> " br x9 \n" // Jump to code
>> );
>
>
> Instead of using "inline assembler" inside a C++ source file, I instead tried to make a separate assembler file.
[...]

Imvho, this is a better route. Worked fine for me:

GAS AT&T syntax

http://web.archive.org/web/20060214112345/http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_gcc_asm.html


MASM Intel syntax

http://web.archive.org/web/20060214112539/http://appcore.home.comcast.net/appcore/src/cpu/i686/ac_i686_masm_asm.html

:^)



Frederick Virchanza Gotham

unread,
Jul 18, 2023, 4:35:05 AM7/18/23
to

Over on comp.unix.programmer, Adam Sampson gave me inline assembler that works:

__asm(".text\n"
"Invoke:\n"
" mov x1, x8\n"
" adr x9, f\n"
" ldr x9, [x9]\n"
" br x9\n"
);

For the sake of posting to this newsgroup, I simplified my problem just a tiny bit. Previously I told you that 'f' was a global variable defined as follows:

void (*f)(void);

but in actual fact it's:

thread_local void (*f)(void);

If I change it to thread_local then try to re-compile the inline assembler, I get a linker error:

R_AARCH64_ADR_PREL_LO21 used with TLS symbol f

Do you know what syntax I use to access the thread_local variable from assembler? Will I need to write a separate function as follows?

void (*getf(void))(void)
{
return f;
}

and then call that function from my assembler? I'm worried about corrupting the caller-saved registers, because I perform a 'br' rather than a 'blr' (i.e. I perform a jump rather than a function call). I suppose I could push all the caller-saved registers before invoking 'getf' and then pop afterward... which I know how to do on x86_64 but I'm new to all this aarch64 stuff.

Frederick Virchanza Gotham

unread,
Jul 18, 2023, 6:56:08 AM7/18/23
to
On Monday 17 July 2023, Frederick Gotham wrote:
>
> If I change it to thread_local then try to re-compile, I get a linker error:
>
> R_AARCH64_ADR_PREL_LO21 used with TLS symbol f
>
> Do you know what syntax I use to access the thread_local variable from assembler? Will I need to write a separate function as follows?


In order to try understand how thread_local variables are accessed from aarch64 assembler, I wrote the following dynamic shared library in C:

__thread void (*f)(void);

void (*g)(void);

void Func(void)
{
g = f;
}

I compiled this to 'libtest.so" and then used 'objdump' on it to see:

<Func>:
Line 01: stp x29, x30, [sp, #-16]!
Line 02: mrs x1, tpidr_el0
Line 03: mov x29, sp
Line 04: adrp x0, 20000 <__cxa_finalize>
Line 05: ldr x2, [x0, #16]
Line 06: add x0, x0, #0x10
Line 07: blr x2
Line 08: adrp x2, 1f000 <__FRAME_END__+0x1e8c8>
Line 09: ldr x2, [x2, #4032]
Line 10: ldr x0, [x1, x0]
Line 11: str x0, [x2]
Line 12: ldp x29, x30, [sp], #16
Line 13: ret

Line #2 appears to put the address of "thread local storage" inside the x1 register.
Lines #4-7 at first glance seem to call the function "__cxz_finalize" (which is the one that gets called at the end of a program to invoke all the destructors of global objects)... but really I just think that the number 0x20000 is being used as a base address to apply offsets to.
Lines #7 definitely is calling some function though, I don't know which one.
Lines #8-12, I'm not sure here... but I think they're moving the value of the thread_local variable 'f' into the global variable 'g'.

Can anyone please help me understand this? And explore how I would go about writing aarch64 to access a thread_local variable called 'f'?
0 new messages