[RFC] Add thunk support for x86

86 views
Skip to first unread message

Farid Zakaria

unread,
Feb 9, 2026, 2:36:54 PMFeb 9
to X86-64 System V Application Binary Interface
Hi!

This is a more specific discussion of the general "Making medium code-model handle large binaries" discussion [1].

I had begun trying to upstream thunk support for x86-64 in lld via [2]

I am hoping to see if we can get consensus on a basic agreement on what thunks for x86 may look like so that we can implement it.  I think the main point of discussion is which register to reserve for use in the thunk, spacing discussions and how to handle thunks that may still not reach their final target (thunk to a thunk?).

Fangrui had already give me some good feedback on the PR, I plan on addressing as well in the interim.


Florian Weimer

unread,
Feb 9, 2026, 4:20:35 PMFeb 9
to Farid Zakaria, X86-64 System V Application Binary Interface
* Farid Zakaria:
A couple of random comments:

The long thunk sequences may need a NOTRACK prefix on the JMP
instruction if we ever turn on IBT. Hopefully it won't need
an ENDBR64 marker, so the whole sequence stays below 16 bytes
for alignment purposes.

We should enable optional rewriting of long thunks to JMPABS at run
time, by some suitable markup.

Using r11 as temporary register appears to be the right choice. It's
desirable to define STO_X86_64_VARIANT_CC, so that the toolchain can
detect calling convention mismatches (if r11 is not usable for
procedure linkage, STO_X86_64_VARIANT_CC must be set on the function
symbol).

Farid Zakaria

unread,
Mar 2, 2026, 7:10:49 PM (13 days ago) Mar 2
to X86-64 System V Application Binary Interface
Sorry for the late reply, I have been working on the PR and thinking through the proposal.
Thank you for the feedback Florian. I will keep this advice going forward.

* NOTRACK prefix: I will look into this.
* Rewriting the long thunks, something I think we can look into with either Bolt as well.

Fangrui (MaskRay) gave me advice to flesh out the design here a little more. I plan on also putting forward a PR to https://gitlab.com/x86-psABIs/x86-64-ABI

Proposal:  Add range extension thunk implementation for x86-64 in the LLD linker.  Thunks allow calls and jumps to reach targets beyond the ±2GiB range of R_X86_64_PLT32 relocations.

ABI:

Short Thunk (5 bytes) for when the thunk itself is within ±2GiB of the target, a short thunk is used:
jmp rel32 

Long Thunk (23 bytes) for when the target is beyond ±2GiB from the thunk, a position-independent sequence is used:
movabsq $offset, %r11 
leaq (%rip), %r10 
addq %r10, %r11
jmpq *%r11 


%r10 and %r11 are caller-saved scratch registers in the x86-64 SysV ABI

Named symbol: __X86_64LongThunk_<symbol_name>
STT_SECTION symbol (local): __X86_64LongThunk_<section_name>_<filename>_<hex_offset>
Anonymous/empty name: __X86_64LongThunk__<hex_offset>

examples:
  • __X86_64LongThunk_high_target — thunk to symbol high_target
  • __X86_64LongThunk_.text.far_main.o_100 — thunk to offset 0x100 in .text.far section from main.o

* symbol names from STT_SECTION may still have duplicate names if filenames are repeated.

Implementation:
The linker will detect this during relocation processing and insert a thunk that the original call/jump targets instead.
The linker specifically partitions the output into 1GiB regions and inserts thunks at regular intervals as needed. This ensures all branches can reach the thunks. Thunks targeting the same destination are deduplicated within each partition. Thunks may be duplicated across partitions as needed.

Given a 1GiB partition, the theoretical maximum per partition is:
Long thunks: 1 GiB / 23 bytes = ~46 million thunks
Short thunks: 1 GiB / 5 bytes = ~214 million thunks

Jan Beulich

unread,
Mar 3, 2026, 3:22:04 AM (13 days ago) Mar 3
to Farid Zakaria, X86-64 System V Application Binary Interface
On 03.03.2026 01:10, Farid Zakaria wrote:
> Sorry for the late reply, I have been working on the PR and thinking
> through the proposal.
> Thank you for the feedback Florian. I will keep this advice going forward.
>
> * NOTRACK prefix: I will look into this.
> * Rewriting the long thunks, something I think we can look into with either
> Bolt as well.
>
> Fangrui (MaskRay) gave me advice to flesh out the design here a little
> more. I plan on also putting forward a PR
> to https://gitlab.com/x86-psABIs/x86-64-ABI
>
> *Proposal*: Add range extension thunk implementation for x86-64 in the LLD
> linker. Thunks allow calls and jumps to reach targets beyond the ±2GiB
> range of R_X86_64_PLT32 relocations.
>
> *ABI*:
>
> Short Thunk (5 bytes) for when the thunk itself is within ±2GiB of the
> target, a short thunk is used:
> jmp rel32
>
> Long Thunk (23 bytes) for when the target is beyond ±2GiB from the thunk, a
> position-independent sequence is used:
> movabsq $offset, %r11
> leaq (%rip), %r10
> addq %r10, %r11
> jmpq *%r11
>
> *%r10 and %r11 are caller-saved scratch registers in the x86-64 SysV ABI*

Is it a goal to avoid using the stack? Otherwise with a 24-byte thunk we
could get away with using just %r11 (seeing that %r10 has a psABI designated
purpose):

movabs $target - 1f, %r11
call 1f
1:
add %r11, (%rsp)
pop %r11
jmp *%r11

(Obviously without CET-SS it would then in principle also be possible to
use RET to shrink the size by 4 bytes, but of course this comes with its
own downsides.)

Thinking about it, if mixing code an data is acceptable, how about this
21-byte form clobbering only %r11:

1:
.quad target - 1b
thunk:
lea 1b(%rip), %r11
add (%r11), %r11
jmp *%r11

If mixing code and data wants avoiding, some / all .quad-s can be grouped
together, followed by some / all of the thunks (with suitable cache line
alignment arranged for at the transition boundaries).

(If position independence wasn't a requirement, things would of course
get yet easier / smaller.)

> Named symbol: __X86_64LongThunk_<symbol_name>
> STT_SECTION symbol (local):
> __X86_64LongThunk_<section_name>_<filename>_<hex_offset>
> Anonymous/empty name: __X86_64LongThunk__<hex_offset>
>
> examples:
>
> - *__X86_64LongThunk_high_target* — thunk to symbol high_target
> - *__X86_64LongThunk_.text.far_main.o_100 *— thunk to offset 0x100 in
> .text.far section from main.o
>
>
> ** symbol names from *STT_SECTION may still have duplicate names if
> filenames are repeated.

Do the thunks need names in the first place? Duplication can result not only
from STT_SECTION symbols, but also from other STB_LOCAL ones afaict.

> *Implementation:*
> The linker will detect this during relocation processing and insert a thunk
> that the original call/jump targets instead.
> The linker specifically partitions the output into 1GiB regions and inserts
> thunks at regular intervals as needed. This ensures all branches can reach
> the thunks. Thunks targeting the same destination are deduplicated within
> each partition. Thunks may be duplicated across partitions as needed.
>
> Given a 1GiB partition, the theoretical maximum per partition is:
> Long thunks: 1 GiB / 23 bytes = ~46 million thunks

Which may not be enough, as ...

> Short thunks: 1 GiB / 5 bytes = ~214 million thunks

... like this theoretical upper bound there can also be as many CALLs/JMPs
in a 1Gb region. To me, setting a fixed partition size looks inflexible. By
using as big a region as possible, the number of thunks needed my reduce.

Jan

Michael Matz

unread,
Mar 3, 2026, 8:34:32 AM (12 days ago) Mar 3
to Farid Zakaria, X86-64 System V Application Binary Interface
Hello,

On Mon, 2 Mar 2026, Farid Zakaria wrote:

> Long Thunk (23 bytes) for when the target is beyond ±2GiB from the thunk, a
> position-independent sequence is used:
> movabsq $offset, %r11
> leaq (%rip), %r10
> addq %r10, %r11
> jmpq *%r11
>
> *%r10 and %r11 are caller-saved scratch registers in the x86-64 SysV ABI*

Hmm? %r10 is caller-saved like the argument registers, but not scratch,
also like the argument regs. It holds the static chain for languages that
need one. %r11 can be used like you indicate, and _only_ it.


Ciao,
Michael.

Florian Weimer

unread,
Mar 3, 2026, 8:42:07 AM (12 days ago) Mar 3
to 'Jan Beulich' via X86-64 System V Application Binary Interface, Farid Zakaria, Jan Beulich
* via:

> Thinking about it, if mixing code an data is acceptable, how about this
> 21-byte form clobbering only %r11:
>
> 1:
> .quad target - 1b
> thunk:
> lea 1b(%rip), %r11
> add (%r11), %r11
> jmp *%r11
>
> If mixing code and data wants avoiding, some / all .quad-s can be grouped
> together, followed by some / all of the thunks (with suitable cache line
> alignment arranged for at the transition boundaries).

If those .quad-s are grouped, how far would this be away from multi-GOT,
implementation-wise?

Thanks,
Florian

Jan Beulich

unread,
Mar 3, 2026, 9:24:01 AM (12 days ago) Mar 3
to Florian Weimer, Farid Zakaria, 'Jan Beulich' via X86-64 System V Application Binary Interface
I was in fact wondering the same while writing the earlier reply (also the
similarity to PLT). GOT would have absolute addresses though, wouldn't it?

Jan

Michael Matz

unread,
Mar 3, 2026, 11:57:45 AM (12 days ago) Mar 3
to Florian Weimer, 'Jan Beulich' via X86-64 System V Application Binary Interface, Farid Zakaria, Jan Beulich
Hello,

On Tue, 3 Mar 2026, 'Florian Weimer' via X86-64 System V Application Binary Interface wrote:

> > .quad target - 1b
> > thunk:
> > lea 1b(%rip), %r11
> > add (%r11), %r11
> > jmp *%r11
> >
> > If mixing code and data wants avoiding, some / all .quad-s can be grouped
> > together, followed by some / all of the thunks (with suitable cache line
> > alignment arranged for at the transition boundaries).
>
> If those .quad-s are grouped, how far would this be away from multi-GOT,
> implementation-wise?

Pontentially far, depending on how the link editor is structured. What
singles out the contents of a GOT is that it contains absolute addresses
and hence requires runtime relocs (and therefore, with multi-GOT,
appropriate means to convey those relocs in the final ELF file; though in
standard ELF reloc entries for a single GOT or multiple GOTs are of course
part of the same runtime RELA(SZ) blob). The link editor hence needs
appropriate tracking for these disjoint output pieces that require output
relocs.

The above quads (if grouped or not) are simply fully relocated data blobs
in the output (that happen to contain self-relative offsets) and so, from
the link editor perspective, are just random input sections not
contributing to output relocs. Handling them would simply fall into
place once they're generated.


Ciao,
Michael.

H.J. Lu

unread,
Mar 3, 2026, 6:51:30 PM (12 days ago) Mar 3
to Michael Matz, Florian Weimer, 'Jan Beulich' via X86-64 System V Application Binary Interface, Farid Zakaria, Jan Beulich
Can we use multiple GOTs:

jmp *.L1@GOTPCREL(%rip)

Linker can optimize it to direct jump if it is reachable.

--
H.J.

Farid Zakaria

unread,
Mar 3, 2026, 11:57:06 PM (12 days ago) Mar 3
to X86-64 System V Application Binary Interface
(This is the original author, just with my work account)

All very good insights!
I was unfortunately misinformed that %r10 could also be a scratch register :(
Thank you for pointing that out Michael.

I liked the suggestion you offered:
.align 8
1:  .quad target - 1b
thunk:
    lea 1b(%rip), %r11
    addq (%r11), %r11
    jmpq *%r11


If we want to use dynamic relocations I think we can also use a non-PIC approach.
thunk:
    movabsq $target, %r11  # Load the absolute 64-bit address directly into r11
    jmpq *%r11             # Jump to it

> Do the thunks need names in the first place? Duplication can result not only
> from STT_SECTION symbols, but also from other STB_LOCAL ones afaict.

I don't think they need names as the ABI but having them definitely helps with testing and validation.

I also see a lot mention of multiple GOT. This is definitely the direction, we plan on heading. In fact, me and Grigory are looking at a partitioning scheme that involves multiple GOT. I am working to try and make a public version of our design document (a pre-RFC to the ABI PR) that I hope to share with you all.

With respect to thunks, I'm hoping to chip away at the design in pieces by first getting agreement on thunks and upstreaming that.

> ... like this theoretical upper bound there can also be as many CALLs/JMPs
> in a 1Gb region. To me, setting a fixed partition size looks inflexible. By
> using as big a region as possible, the number of thunks needed my reduce.

I am open to alternate layouts. I largely plan to take inspiration from what ARM has already done in lld but I don't think it needs to be cemented at all in the ABI.

Michael Matz

unread,
Mar 4, 2026, 9:37:08 AM (11 days ago) Mar 4
to Farid Zakaria, X86-64 System V Application Binary Interface
Hello,

On Tue, 3 Mar 2026, 'Farid Zakaria' via X86-64 System V Application Binary Interface wrote:

> All very good insights!
> I was unfortunately misinformed that %r10 could also be a scratch register
> :(
> Thank you for pointing that out Michael.
>
> I liked the suggestion you offered:
> .align 8
> 1: .quad target - 1b
> thunk:
> lea 1b(%rip), %r11
> addq (%r11), %r11
> jmpq *%r11
>
> If we want to use dynamic relocations I think we can also use a non-PIC
> approach.
> thunk:
> movabsq $target, %r11 # Load the absolute 64-bit address directly into
> r11
> jmpq *%r11 # Jump to it

I think it would be good to avoid dynamic relocs for what you want.
_Especially_ if the places to relocate are not grouped in a table like a
GOT. Your movabsq thunk for instance intersperses places to reloc with
instruction bytes and hance wastes space in unshared CoW pages (not much,
of course). See also below about some more things related to dynamic
relocs.

Actually I think the specific layout and contents of these thunks doesn't
really need specification in the psABI, what matters is that they can be
jumped to and then magically transfer control to the wanted destination
with clobbering only r11. The specific instruction sequence doesn't
matter psABI-wise. Even the existence of these thunks is borderline
implementation detail. Obviously it's good to discuss the details in
this forum, but not all of those then need to go into the document, or
only into an informative section.

Into the psABI need to go only the things that matter for
interoperability. E.g. if we need new ELF things: new flag constants, new
tags, new section types, new relocations, etc. Right now it looks like
you can extend the linker such that it takes .o files produced by current
compilers/assemblers, combine them into a final ELF file that uses
the thunks and that's loadable by current ld.so's. Even when they need
text segments larger than 2GB. That's without dynamic relocs and if
that's correct it's an indication that no normative extensions to the
psABI are needed.

When they need dynamic relocs we may need to think how that interworks
with current practice of having the read-exec segment not be writable
(outside TEXTREL, boo!). We then would have multiple disjoint ranges that
are writable during relocation processing, and then therefore a need to
somehow mark them, or handle them with (perhaps multiple) GNU_RELRO
segments or suchlike. Either way we'd need to do _something_ for
interoperability, and that would need specifying. I'd prefer to not have
to think about that :-)

> > Do the thunks need names in the first place? Duplication can result not
> only
> > from STT_SECTION symbols, but also from other STB_LOCAL ones afaict.
>
> I don't think they need names as the ABI but having them definitely helps
> with testing and validation.

Similar to names for PLT slots (those can even be synthesized on the fly
by disassemblers). No psABI material but helpful.

> > ... like this theoretical upper bound there can also be as many CALLs/JMPs
> > in a 1Gb region. To me, setting a fixed partition size looks inflexible.
> > By
> > using as big a region as possible, the number of thunks needed my reduce.
>
> I am open to alternate layouts. I largely plan to take inspiration from
> what ARM has already done in lld but I don't think it needs to be cemented
> at all in the ABI.

I think concrete sizes and layouts of, and strategies how to create such
partitions are no psABI material, they are implementation details.


Ciao,
Michael.

Farid Zakaria

unread,
Mar 4, 2026, 12:32:40 PM (11 days ago) Mar 4
to X86-64 System V Application Binary Interface
That makes a lot of sense.
I am a little green to this ABI process and was taking clues from Fangrui (MaskRay) via

"this patch cannot land until the ABI process is initiated"

I think since r11 is already considered scratch, that is a strong case that the thunks are implementation details.
Seems like that PR can be modified to use the correct sequence without modifications to the ABI.

This would be a good first step, while I think through multiple GOT & partitioning.
(One problem here will be definitely supporting multiple RELRO segments and adding support for this in glibc)

Maskray, how does that sound?
(with respect to making progress on the PR)

As for strategies for layout and size: I am focused on allowing us expand beyond 2GiB as at Meta we've hit this limit.
Improvements here would be extremely beneficial while we work to fine-tune a strategy that allows maximum scaling.
(tl;dr; I am probably motivated by a simpler design first that allows larger than 2GiB with some upper bound that we can continue to refine).

Arthur Eubanks

unread,
Mar 4, 2026, 4:54:49 PM (11 days ago) Mar 4
to X86-64 System V Application Binary Interface
I just want to provide general support for this, thanks for pushing this through! This is a huge part of the effort to redefine some code models to prevent relocation overflows in large binaries without huge performance hits with the large code model call sequence when the binary actually ends up being small enough.

Agreed that we don't need to put all the implementation details into the psABI docs, but we should change psABI portion that shows the large code model function call instruction sequence to be the same as the small/medium instruction sequence and mention that the linker will add thunks if necessary (I'm happy to make this change if you'd like).

Michael Matz

unread,
Mar 5, 2026, 8:58:20 AM (10 days ago) Mar 5
to Arthur Eubanks, X86-64 System V Application Binary Interface
Hello,

On Wed, 4 Mar 2026, 'Arthur Eubanks' via X86-64 System V Application Binary Interface wrote:

> I just want to provide general support for this, thanks for pushing this
> through! This is a huge part of the effort to redefine some code models to
> prevent relocation overflows in large binaries without huge performance
> hits with the large code model call sequence when the binary actually ends
> up being small enough.
>
> Agreed that we don't need to put all the implementation details into the
> psABI docs, but we should change psABI portion that shows the large code
> model function call instruction sequence to be the same as the small/medium
> instruction sequence and mention that the linker will add thunks if
> necessary (I'm happy to make this change if you'd like).

"will"? :-) I think such changes would want doing once there is a linker
that indeed does so. Even so, I would not change the current definition
of the large code model (we can't really, though we could probably
discourage it). I'd rather add another one, or maybe just mention that
the small code model doesn't preclude >2GB .text with linker support and
hint towards examples for that (i.e. what we discuss here) in some
non-normative part.


Ciao,
Michael.

Farid Zakaria

unread,
Mar 7, 2026, 4:08:30 PM (8 days ago) Mar 7
to X86-64 System V Application Binary Interface
I have opened up https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/67

I kept the changes hopefully concise. It includes the PIC example.

Please let me know what you think.

James Y Knight

unread,
Mar 9, 2026, 7:37:17 PM (6 days ago) Mar 9
to Farid Zakaria, X86-64 System V Application Binary Interface
I've just realized there's a problem with using r11: we might create a range-extension thunk on any call or jump, including those to non-global symbols. This means that any such jump or call must understand that r11 can now be clobbered.

There were already assumptions made about the calling convention on call/jump to global symbols which might go through a PLT. But, as the spec says "The standard calling sequence requirements apply only to global functions. Local functions that are not reachable from other compilation units may use different conventions." (Sidenote: that description could be improved, since users do use non-standard calling conventions for global functions, and that's fine as long as they meet certain, undocumented, constraints).

Despite emitting the same R_X86_64_PLT32 relocations in either case, jumps to local symbols (even in another section) are currently assumed to be transparent and to not clobber anything or have any calling-convention restrictions. This proposal effectively requires a novel change (for X86): it constraints the "calling-convention" used for any call or jump made through a symbolic relocation.

I believe this is relevant for some real-world compiler-generated code. For example, code generated by LLVM's -fbasic-block-sections option can potentially place every basic-block into its own section, and therefore the jumps between blocks will all be emitted with a relocation to a local symbol. Those branches are currently presumed not to clobber any registers...and now could unexpectedly clobber %r11, if they're placed far apart (as they would likely be, given that parts of the function will be hot and other parts cold).

I don't think this completely dooms the idea of using r11, but since it's an ABI change, the rollout will be significantly more complicated...

I'd note that because AArch64 always used range-extension thunks, its ABI reserved x16 and x17 for this purpose from the get-go, which sure is nice for them. :)


--
You received this message because you are subscribed to the Google Groups "X86-64 System V Application Binary Interface" group.
To unsubscribe from this group and stop receiving emails from it, send an email to x86-64-abi+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/x86-64-abi/9c0deecb-393d-4f96-b608-4bf597c5caa7n%40googlegroups.com.

Michael Matz

unread,
Mar 10, 2026, 10:12:05 AM (5 days ago) Mar 10
to James Y Knight, Farid Zakaria, X86-64 System V Application Binary Interface
Hello,

On Mon, 9 Mar 2026, 'James Y Knight' via X86-64 System V Application Binary Interface wrote:

> There were already assumptions made about the calling convention on
> call/jump to *global symbols* which might go through a PLT. But, as the
> spec says "The standard calling sequence requirements apply only to global
> functions. Local functions that are not reachable from other compilation
> units may use different conventions." (Sidenote: that description could be
> improved, since users *do use* non-standard calling conventions for global
> functions, and that's fine as long as they meet certain, undocumented,
> constraints).

Under the usual as-if rules when reading specs or standards I don't think
it appropriate to spell out all kinds of potential exceptions a user of
those specs might consider and perhaps use. If you're outside of the spec
it's assumed you know what you're doing and do it only in places where it
can be done :)

> Despite emitting the same R_X86_64_PLT32 relocations in either case, jumps
> to local symbols (even in another section) are currently assumed to be
> transparent and to not clobber anything or have any calling-convention
> restrictions.

If the compiler/assembler knows the target symbol is definitely local and
defined (like in your case for inter-function jumps) you could emit a PC32
reloc, not a PLT32 one, and while at it make it not symbol but section
based. The ranges can't be extended then, but that's fine because, as you
say, the thunk wouldn't work anyway.

It is something to be aware of, though, so thanks for bringing it up.
Ideally we'd devise a scheme were the linker could determine if a given
reloc-place<->symbol pair can be thunked or not. It could be the
relocation type or some indication on the symbol, but for the existing
use-cases it seems fine to start without. People do use different calling
conventions for some functions, as you say, but most of them are rather
special-case. So IMHO such scheme should be opt-out, i.e. by default
the linker would assume psABI conventions even for local functions (not
temporary .L symbols!) and hence thunkability.


Ciao,
Michael.

Jan Beulich

unread,
Mar 10, 2026, 10:20:37 AM (5 days ago) Mar 10
to Michael Matz, James Y Knight, Farid Zakaria, X86-64 System V Application Binary Interface
On 10.03.2026 15:12, Michael Matz wrote:
> On Mon, 9 Mar 2026, 'James Y Knight' via X86-64 System V Application Binary Interface wrote:
>> There were already assumptions made about the calling convention on
>> call/jump to *global symbols* which might go through a PLT. But, as the
>> spec says "The standard calling sequence requirements apply only to global
>> functions. Local functions that are not reachable from other compilation
>> units may use different conventions." (Sidenote: that description could be
>> improved, since users *do use* non-standard calling conventions for global
>> functions, and that's fine as long as they meet certain, undocumented,
>> constraints).
>
> Under the usual as-if rules when reading specs or standards I don't think
> it appropriate to spell out all kinds of potential exceptions a user of
> those specs might consider and perhaps use. If you're outside of the spec
> it's assumed you know what you're doing and do it only in places where it
> can be done :)
>
>> Despite emitting the same R_X86_64_PLT32 relocations in either case, jumps
>> to local symbols (even in another section) are currently assumed to be
>> transparent and to not clobber anything or have any calling-convention
>> restrictions.
>
> If the compiler/assembler knows the target symbol is definitely local and
> defined (like in your case for inter-function jumps) you could emit a PC32
> reloc, not a PLT32 one, and while at it make it not symbol but section
> based. The ranges can't be extended then, but that's fine because, as you
> say, the thunk wouldn't work anyway.

Another option being to make the thunk work. I already asked in an earlier
reply whether it is a requirement to avoid using the stack. If it can't be
used (seems likely when intra-function branches may need thunking), next
best thing I can think of is scratch space in TLS.

Jan

Florian Weimer

unread,
Mar 10, 2026, 10:57:24 AM (5 days ago) Mar 10
to 'Jan Beulich' via X86-64 System V Application Binary Interface, Michael Matz, James Y Knight, Jan Beulich, Farid Zakaria
* via:
That's not going to be async-signal-safe, I think.

Thanks,
Florian

James Y Knight

unread,
Mar 10, 2026, 12:10:43 PM (5 days ago) Mar 10
to Michael Matz, Farid Zakaria, X86-64 System V Application Binary Interface
On Tue, Mar 10, 2026 at 10:12 AM Michael Matz <ma...@suse.de> wrote:
Hello,

On Mon, 9 Mar 2026, 'James Y Knight' via X86-64 System V Application Binary Interface wrote:
> (Sidenote: that description could be
> improved, since users *do use* non-standard calling conventions for global
> functions, and that's fine as long as they meet certain, undocumented,
> constraints).

Under the usual as-if rules when reading specs or standards I don't think
it appropriate to spell out all kinds of potential exceptions a user of
those specs might consider and perhaps use.  If you're outside of the spec
it's assumed you know what you're doing and do it only in places where it
can be done :)

I would like to see the spec describe what the actual minimal requirements are when calling through a PLT. For example, it might say something like the following, in place of the current paragraph:
====
It is recommended that all functions use the standard calling sequence described below, but a non-standard calling convention may be used for a given function if the callers and callees agree. However, the following restrictions apply to any jump or call to a destination found via a symbolic relocation to a global symbol, as the symbol may resolve to a PLT or linker stub, instead of directly to the user-defined function:
- %rsp: upon jump/call, must point to correctly-aligned stack space, which may be written to.
- %r11: may be clobbered between caller and first instruction of callee.
- %rFlags DF flag: upon jump/call, must be 0 ("forward").
- %rFlags Status Flags: may be clobbered between caller and first instruction of callee.

All other integer, x87/MMX, and SSE/AVX registers may be used by a non-standard calling convention in whatever manner is desired.
====

If the compiler/assembler knows the target symbol is definitely local and
defined (like in your case for inter-function jumps) you could emit a PC32
reloc, not a PLT32 one, and while at it make it not symbol but section
based.  The ranges can't be extended then, but that's fine because, as you
say, the thunk wouldn't work anyway.

That's true, but I don't think this would actually be a viable solution. Firstly, that's not what is emitted by compilers today, so it doesn't help with compatibility with current compilers and object files -- it's still an ABI break. Secondly, teaching the linker how to do section layout such that PC32 text relocations all live within 2GB of their target, while simultaneously allowing the total size to grow beyond 2GB, seems like it would be extremely difficult. At the least, it would be completely different than any section layout algorithms used today.

If we need to take an ABI break anyhow, I think the easier answer would be to teach compilers that r11 can be clobbered by any calls/jumps which involve a symbolic relocation, e.g. jumps to outside of the current section. Compilers already need to do this on AArch64, so this is not a new situation they'd have to deal with.

We'd want to add a statement to the ABI saying so, as well. I don't know how we'd want to caveat the change such that only code built in the "new" mode has this restriction applied, but maybe something like:
====
The following restrictions apply to all jumps or calls which use an R_X86_64_PLT32 relocation in the XXXnewnameXXX code model, as the relocation may resolve to a linker-generated range-extension thunk instead of directly to the user-defined function:
- %r11: may be clobbered between caller and first instruction of callee.
====


H.J. Lu

unread,
Mar 10, 2026, 12:14:19 PM (5 days ago) Mar 10
to James Y Knight, Michael Matz, Farid Zakaria, X86-64 System V Application Binary Interface
Can

call *func@GOTPCREL(%rip)

work with multiple GOTs?


--
H.J.

Farid Zakaria

unread,
Mar 10, 2026, 1:06:18 PM (5 days ago) Mar 10
to X86-64 System V Application Binary Interface
Sending again, I think I replied to Author instead of all :( 

I created an example for myself to better understand Jame's point: https://godbolt.org/z/1j5jK77GP

Looking at the problem I think we have a few options for https://gitlab.com/x86-psABIs/x86-64-ABI/-/merge_requests/67 :
1. Augment the ABI for small/medium to restrict %r11 usage even for intra-function call (i.e. -fbasic-block-sections=all).
2. Keep the draft as is. The ABI changes are merely suggesting that text need not be restricted to 2GiB and the linker is free to employ strategies such as thunks (eventually this works with multiple GOT)
3. Keep the draft as is and the r11 example but highlight that this only works only if r11 is not used and spell out the condition where it may. 
4. Update the example to a register-less demo using GOTPCREL but this then requires multiple-GOT to be folded in now.

What do we think?
I like the idea of the ABI not being overly prescriptive. I kind of like the idea of (3).
I could not find any use of `-fbasic-block-sections` through my cursory search in our codebase, so the required restriction may be enough.

I also wanted to share two other branches I have been working on:

https://github.com/llvm/llvm-project/compare/main...fzakaria:llvm-project:new-code-model
This is the WIP branch of all the changes to a "new code-model". It doesn't incorporate a lot of the recent recommendations outlined here but it explores what multiple-GOT would look like.

https://github.com/llvm/llvm-project/compare/main...fzakaria:llvm-project:eh-frame-relax-reverse
Auto-expand gcc_except_table and eh_frame similar to https://github.com/llvm/llvm-project/pull/179089 in spirit.
This gets rid of a lot of relocation overflow from these sections in a way that works with precompiled code.
We probably don't want to partition gcc_except_table otherwise we then have to change libunwind.
Reply all
Reply to author
Forward
0 new messages