[RFC] Redefining the Large Code Model

Arthur Eubanks

unread,

May 7, 2026, 1:57:58 PMMay 7

to X86-64 System V Application Binary Interface

This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.

For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.

At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.

The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.

This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.

Function calls

The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.

Data references

In position independent code, instead of the long instruction sequence like https://godbolt.org/z/jcobvaocK, assume that there is a GOT within 2GB of any code in .ltext and emit the small code model extern global instruction sequence
mov rax, qword ptr [rip + i@GOTPCREL]
mov eax, dword ptr [rax]
which is relaxable to
lea rax, [rip + i]
mov eax, dword ptr [rax]
if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
mov eax, dword ptr [rip + i]
in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.
To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.
The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.

If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.

With multiple GOTs we'd like multiple RELRO segments

Currently it's unclear as to whether or not the spec allows multiple RELRO segments. This proposes that we explicitly allow multiple RELRO segments.
Various parts of the ecosystem would need to be updated to support multiple RELRO segments, the major one being glibc's dynamic loader (https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-reloc.c;h=26a1e7adfc4525525a1af8f8fa193dfa9e6b173b;hb=HEAD#l351).

This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.
https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.

Interoperability with precompiled small code model libraries

Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.
The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.

Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.
Currently the layout of sections for PIC binaries as laid out by lld is
.ltext .lrodata .rodata .text .data .bss .ldata .lbss
Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?

Performance of large binaries

We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.

Linker scripts

Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
.ltext : { PARTITION_WITH_GOTS(*.ltext) }
which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.

lld doesn't have an explicit default linker script; rather it's implemented in code

It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
.got.ltext.0 : { ... }
.ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
.got.ltext.1 : { ... }
.ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
... (up to some arbitrary N)

This is pretty hacky and unprincipled

Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself
We're still investigating other solutions here

TLS

As far as I'm aware, people are not running into TLS relocation overflows, so no change there.

Compatibility with the current large code model

The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.
The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.

Replacing the existing large code model

This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.

In summary, in order to for once and for all prevent relocation overflows but with reasonable performance we'd like to change the large code model. We want it to perform better, especially in cases where the final linked binary ends up being smaller, and be more compatible with prebuilt small code model code. The main proposed changes are thunks and partitioning .ltext with GOTs. If there is consensus that this makes sense, I'll send out a change to https://gitlab.com/x86-psABIs/x86-64-ABI updating the specification for the large code model.

Discussion is greatly appreciated, alongside a general "yes this is a good idea"/"no this is a bad idea".

H.J. Lu

unread,

May 7, 2026, 5:23:59 PMMay 7

to Arthur Eubanks, X86-64 System V Application Binary Interface

On Fri, May 8, 2026 at 1:57 AM 'Arthur Eubanks' via X86-64 System V
Application Binary Interface <x86-6...@googlegroups.com> wrote:
>
> This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.
>
>
> For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.
>
>
> At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.
>
>
> The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.
>
>
> This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.

I really like this approach.

>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.
>
> Data references
>
> In position independent code, instead of the long instruction sequence like https://godbolt.org/z/jcobvaocK, assume that
there is a GOT within 2GB of any code in .ltext and emit the small
code model extern global instruction sequence

I think in this large code model, call over GOT should apply to both
PIE and PDE.

> mov rax, qword ptr [rip + i@GOTPCREL]
> mov eax, dword ptr [rax]
> which is relaxable to
> lea rax, [rip + i]
> mov eax, dword ptr [rax]
> if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
> mov eax, dword ptr [rip + i]
> in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.
>
> To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.
>
> The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.
>

2 GOTPCREL relocations are needed. One is for instructions which can't
be relaxed and the other can. Should we add new GOTPCREL relocations
to tell the linker to prepare for multiple GOT so that the unsupported
linkers won't
generate corrupted outputs by accident.

> If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.

Does it mean that all access to symbols, data or functions, must go through
GOT? What about labels/symbols local to the function?

> With multiple GOTs we'd like multiple RELRO segments
>
> Currently it's unclear as to whether or not the spec allows multiple RELRO segments. This proposes that we explicitly allow multiple RELRO segments.
>
> Various parts of the ecosystem would need to be updated to support multiple RELRO segments, the major one being glibc's dynamic loader (https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-reloc.c;h=26a1e7adfc4525525a1af8f8fa193dfa9e6b173b;hb=HEAD#l351).
>
> This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.
>
> https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.
>
> Interoperability with precompiled small code model libraries
>
> Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.
>
> The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.
>
> Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.
>
> Currently the layout of sections for PIC binaries as laid out by lld is
> .ltext .lrodata .rodata .text .data .bss .ldata .lbss
>
> Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?

We can fine tune the code layout later.

> Performance of large binaries
>
> We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.

I think new relocations may help here.

> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled

We can deal with it later when implementing this in BFD linker.

> Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself
>
> We're still investigating other solutions here
>
> TLS
>
> As far as I'm aware, people are not running into TLS relocation overflows, so no change there.
>
> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.

Since glibc doesn't support the current large code model anyway:

https://bugzilla.redhat.com/show_bug.cgi?id=1713891

we don't need to consider compatibility with the current large code model.

> The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.
>
> Replacing the existing large code model
>
> This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.
>

That is true since glibc doesn't support it.

>
> In summary, in order to for once and for all prevent relocation overflows but with reasonable performance we'd like to change the large code model. We want it to perform better, especially in cases where the final linked binary ends up being smaller, and be more compatible with prebuilt small code model code. The main proposed changes are thunks and partitioning .ltext with GOTs. If there is consensus that this makes sense, I'll send out a change to https://gitlab.com/x86-psABIs/x86-64-ABI updating the specification for the large code model.
>
>
> Discussion is greatly appreciated, alongside a general "yes this is a good idea"/"no this is a bad idea".
>

Yes, this is an excellent idea.

--
H.J.

H.J. Lu

unread,

May 8, 2026, 10:49:00 PMMay 8

to Arthur Eubanks, X86-64 System V Application Binary Interface

One more thing. The C/C++ small model run-time libraries are
incompatible with the current large model. Will the new large
model be compatible with the C/C++ small model run-time libraries?

--
H.J.

Fangrui Song

unread,

May 11, 2026, 3:27:08 AMMay 11

to X86-64 System V Application Binary Interface

On Thursday, May 7, 2026 at 2:23:59 PM UTC-7 hjl....@gmail.com wrote:

On Fri, May 8, 2026 at 1:57 AM 'Arthur Eubanks' via X86-64 System V
Application Binary Interface <x86-6...@googlegroups.com> wrote:
>
> This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.
>
>
> For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.
>
>
> At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.
>
>
> The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.
>
>
> This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.

I really like this approach.

I like the range-extension-thunks + multi-GOT + multi-RELRO direction. It solves the two reach problems (.text to .text via thunks, .text to data via partitioned GOTs) by keeping the "only pay for what you use" property

One area the proposal doesn't cover: .text to .eh_frame/.eh_frame_hdr

>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.

I believe H.J. has previously floated using call *foo@GOTPCREL(%rip) (6 bytes, indirecting through a .got.plt slot) in place of thunks -- essentially -fno-plt extended to all calls, including local ones.

It trades one byte per callsite plus an indirect-branch predictor cost against not needing the linker to materialize and place thunks, and shifts the model from "pay-when-needed" (thunks only appear once reach is actually exceeded) to "pay-always at every callsite."

AFAIK nobody has actually compared the two on a realistic workload; it would be worth measuring before settling on thunks.

Yes, also musl's dynamic loader.

>
> This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.
>
> https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.
>
> Interoperability with precompiled small code model libraries
>
> Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.
>
> The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.
>
> Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.
>
> Currently the layout of sections for PIC binaries as laid out by lld is
> .ltext .lrodata .rodata .text .data .bss .ldata .lbss
>
> Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?

We can fine tune the code layout later.

> Performance of large binaries
>
> We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.

I think new relocations may help here.

Do we really need new GOTPCREL relocations? Unpatched linkers will run into relocation overflow issues and report errors, as expected.

> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled

We can deal with it later when implementing this in BFD linker.

A real design pass on linker scripts is overdue here.

INSERT AFTER / INSERT BEFORE already exist in GNU ld and lld and are exactly the kind of composable primitive worth leaning on.

Stepping back, this is my biggest concern about both proposals: they risk introducing knobs and defaults to lld that overfit company-specific binary shapes.

Meta's --sort-text-by-code-model is the clearest example tuned for their examples and doesn't fit into upstream lld.

Google's .ltext placement reasoning similarly bakes in an assumption about which segment cost dominates.

lld shouldn't be where workload-specific layout heuristics live. Prefer principled, composable primitives users wire up themselves -- and make .text.N/.got.N placement scriptable -- over shipping defaults that quietly assume one company's binary shape.

> Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself
>
> We're still investigating other solutions here
>
> TLS
>
> As far as I'm aware, people are not running into TLS relocation overflows, so no change there.
>
> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.

Since glibc doesn't support the current large code model anyway:

https://bugzilla.redhat.com/show_bug.cgi?id=1713891

we don't need to consider compatibility with the current large code model.

The GNU ld failure is in crtbeginS.o -- a GCC object file built with the small code model with a tight data reference.

This example actually links with lld, which places .rodata before .text .

> The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.
>
> Replacing the existing large code model
>
> This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.
>

That is true since glibc doesn't support it.

That said, I do agree that the current large code model is effectively unusable in practice, and redesigning it is the right call.

Farid Zakaria

unread,

May 11, 2026, 4:15:36 PMMay 11

to X86-64 System V Application Binary Interface

Hey Arthur!

Nice to see the RFC :)
I am traveling abroad so please excuse any omissions/mistakes in my response but I wanted to share some early feedback and early thoughts.

1. I continue to be confused & surprised at the performance of large code-model. I've reached the depth of my hardware knowledge but I would be interested to know if others have experience at which the performance does finally suffer. We've observed <2% CPU performance on various workloads without any new relaxations. I also explored new large code-model relaxations and got the difference potentially even less.

2. It might be good for us to consolidate each a section on how we think our proposals differ. I actually think they are nearly identical enough where it matters and hardly different. IIRC, the only difference was that at Meta we had thought to have a single text since with thunks there is no "large" code-type any longer and merely intersperse the GOT from the start unifying the code. After having come across a few bugs in lld, bolt & GCC where the lack of support for prefixing a section with "l" caused bugs, I am more in favor of this unification but I think it's ultimately minor. I am motivated in helping us get larger binaries and willing & happy to consolidate where it makes sense & as needed. Unifying the text and multiple GOT from the start means we no longer need to segment the binary into "hot/small" and "cold/large" areas FWIW.

3. We discussed that the thunks requires %r11 as a new scratch register even for calls within the same function. Does that break ABI usage potentially with prior compiled objects?

4. I noticed now we are mainly suffering from relocation overflows from .gcc_except_table -> .data since these two sections tend to bookend our binaries. gcc_except_table and eh_frame both include encodings that are merged from the TUs that comprise the binary. If any of these objects are small-code, then they are 4-bytes and restricted to the same 2GiB offset potentially. I've been pushing & exploring an approach to automatically expand all entries as needed to 8 bytes (outdated but relevant: https://github.com/llvm/llvm-project/compare/main...fzakaria:llvm-project:eh-frame-relax-reverse?expand=1). Without this functionality and similar support for eh_frame_hdr (https://github.com/llvm/llvm-project/pull/179089), it makes re-using small code model objects a non-starter.

Arthur Eubanks

unread,

May 11, 2026, 5:21:11 PMMay 11

to H.J. Lu, X86-64 System V Application Binary Interface

I think in this large code model, call over GOT should apply to both
PIE and PDE.

Do you mean we shouldn't have a separate PDE large code model? I agree with that, if you're using the large code model, you're already not getting maximal performance in all cases.

> mov rax, qword ptr [rip + i@GOTPCREL]
> mov eax, dword ptr [rax]
> which is relaxable to
> lea rax, [rip + i]
> mov eax, dword ptr [rax]
> if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
> mov eax, dword ptr [rip + i]
> in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.
>
> To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.
>
> The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.
>

2 GOTPCREL relocations are needed. One is for instructions which can't
be relaxed and the other can. Should we add new GOTPCREL relocations
to tell the linker to prepare for multiple GOT so that the unsupported
linkers won't
generate corrupted outputs by accident.

As maskray said, we won't have corrupted outputs, only relocation overflows. It does seem nice to have a way to indicate that only certain GOTPCREL relocations may require new GOTs. I'm unsure how necessary this is though, as we could either define GOTPCREL to point to a nearby GOT (main GOT if binary is small enough) and not introduce new relocations, or we could define GOTPCREL to point to the "main" GOT and a new GOTPCREL2 relocation to point to a nearby GOT (main GOT if binary is small enough). I don't see a huge benefit in new relocations, but perhaps prototyping this would give guidance.

If we do create new relocations, they also need REX/REX2 variants?

> If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.

Does it mean that all access to symbols, data or functions, must go through
GOT? What about labels/symbols local to the function?

We shouldn't worry about an individual section being larger than 2GB (at least for now), so references to labels/symbols within a section can be like the small code model. All other references yes must go through the GOT.

> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.

Since glibc doesn't support the current large code model anyway:

https://bugzilla.redhat.com/show_bug.cgi?id=1713891

we don't need to consider compatibility with the current large code model.

As maskray mentioned, this is because non-large data is taking up too much space which doesn't align with the small code model. I think the actual failure mode with the current large code model is slightly different. If you have too much text, which the current large code model puts in .text, then the small code model crtbeginS.o may have relocation overflows reaching across .text. The part of the proposal about compatibility with precompiled small code model code (e.g. crtbeginS.o) should address this by putting large code model text in .ltext, making sure small code model sections are placed contiguously, unaffected by linked in large code model symbols, and therefore don't span more than 2GB.

> One more thing. The C/C++ small model run-time libraries are
incompatible with the current large model. Will the new large

model be compatible with the C/C++ small model run-time libraries?

If I understand your question correctly, the part of the proposal about compatibility with precompiled small code model code should address this?

Arthur Eubanks

unread,

May 11, 2026, 7:13:44 PMMay 11

to Fangrui Song, X86-64 System V Application Binary Interface

One area the proposal doesn't cover: .text to .eh_frame/.eh_frame_hdr

eh_frame/eh_frame_hdr allow 8 byte values, even though most producers emit 4 byte values. I was under the impression that since the spec allowed 8 byte values, we don't need any spec changes but need to fix up producers. Is there anything else I'm missing?

https://refspecs.linuxfoundation.org/LSB_1.3.0/gLSB/gLSB/ehframehdr.html

>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.

I believe H.J. has previously floated using call *foo@GOTPCREL(%rip) (6 bytes, indirecting through a .got.plt slot) in place of thunks -- essentially -fno-plt extended to all calls, including local ones.
It trades one byte per callsite plus an indirect-branch predictor cost against not needing the linker to materialize and place thunks, and shifts the model from "pay-when-needed" (thunks only appear once reach is actually exceeded) to "pay-always at every callsite."

AFAIK nobody has actually compared the two on a realistic workload; it would be worth measuring before settling on thunks.

Yes, we should do this experiment. We can do it on a smaller binary to get the overhead, but to test on a larger binary we'd need this proposal prototyped.

> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled

We can deal with it later when implementing this in BFD linker.

A real design pass on linker scripts is overdue here.
INSERT AFTER / INSERT BEFORE already exist in GNU ld and lld and are exactly the kind of composable primitive worth leaning on.

We do use a linker script and we care about e.g. alignment of segments specified in linker scripts so we want a solution that works with linker scripts. As someone with linker script thoughts, what are your thoughts on something like the proposed PARTITION_WITH_GOTS inside a INSERT AFTER/BEFORE? I think I'd need to actually implement a prototype to understand the ramifications, e.g. ordering of section creation/address assignment within the linker.

Stepping back, this is my biggest concern about both proposals: they risk introducing knobs and defaults to lld that overfit company-specific binary shapes.
Meta's --sort-text-by-code-model is the clearest example tuned for their examples and doesn't fit into upstream lld.
Google's .ltext placement reasoning similarly bakes in an assumption about which segment cost dominates.

I did state segment count as my reason for putting .ltext there but that was kind of an arbitrary reason, costs of segments probably isn't going to matter too much; it was a "all else being equal it'd be nice to reduce the number of segments". I think the most important thing is that for binaries on the edge, we get as many relaxations as possible via section placement for smaller GOTs/startup dynamic relocations. Perhaps there's something else I'm missing that people may care about but that's the main thing that comes to mind, so IMO it seems reasonable to use number of relaxations as the primary goal for where to place sections. And ultimately for performance of binaries in general, as the proposal states, we'll move performance sensitive text and data into small sections.

lld shouldn't be where workload-specific layout heuristics live. Prefer principled, composable primitives users wire up themselves -- and make .text.N/.got.N placement scriptable -- over shipping defaults that quietly assume one company's binary shape.

I generally agree with this, but I feel like I'm missing something concrete with this statement in relation to this proposal. Are you saying that there shouldn't be a default placement of a section like .ltext and it should explicitly be put in a linker script? In my mind it seems unlikely that people will care about specific placement of .ltext as long as it's somewhere reasonable, but I might be wrong.

You're correct that knobs are frustrating and should be avoided when possible, but sometimes I think there's a pretty obvious "best" default, or it doesn't matter, and it's important to distinguish between those cases (not only for maintainers, but also users).

> I like the range-extension-thunks + multi-GOT + multi-RELRO direction. It solves the two reach problems (.text to .text via thunks, .text to data via partitioned GOTs) by keeping the "only pay for what you use" property

I'm glad people are supportive of the general approach.

H.J. Lu

unread,

May 11, 2026, 7:41:21 PMMay 11

to Arthur Eubanks, X86-64 System V Application Binary Interface

On Tue, May 12, 2026 at 5:21 AM Arthur Eubanks <aeub...@google.com> wrote:
>>
>> I think in this large code model, call over GOT should apply to both
>> PIE and PDE.
>
> Do you mean we shouldn't have a separate PDE large code model? I agree with that, if you're using the large code model, you're already not getting maximal performance in all cases.

I meant always generate PIE codes for large model even compiled with -fno-PIE.

H.J.

Arthur Eubanks

unread,

May 13, 2026, 5:08:11 PMMay 13

to Farid Zakaria, X86-64 System V Application Binary Interface

1. I continue to be confused & surprised at the performance of large code-model. I've reached the depth of my hardware knowledge but I would be interested to know if others have experience at which the performance does finally suffer. We've observed <2% CPU performance on various workloads without any new relaxations. I also explored new large code-model relaxations and got the difference potentially even less.

On smaller microbenchmarks I'm seeing lots of >20% regressions, and also many that are unaffected. It looks like ThinLTO builds tend to be more resistant, presumably because ThinLTO can do cross-module inlining, so all the hot code gets inlined and there are no large code model call instruction sequences.

On a larger benchmark, I'm seeing a similar ~2% regression, which is unusable at least for performance critical binaries.

2. It might be good for us to consolidate each a section on how we think our proposals differ. I actually think they are nearly identical enough where it matters and hardly different. IIRC, the only difference was that at Meta we had thought to have a single text since with thunks there is no "large" code-type any longer and merely intersperse the GOT from the start unifying the code. After having come across a few bugs in lld, bolt & GCC where the lack of support for prefixing a section with "l" caused bugs, I am more in favor of this unification but I think it's ultimately minor. I am motivated in helping us get larger binaries and willing & happy to consolidate where it makes sense & as needed. Unifying the text and multiple GOT from the start means we no longer need to segment the binary into "hot/small" and "cold/large" areas FWIW.

The small/large distinction is necessary for precompiled small code model libraries. I think your approach of sorting symbols that have PC32 relocations is more complicated and more work to implement in the linker than the small/large distinction, but I may be missing something. It's a huge plus to not have to modify precompiled small code model code linked into a large code model binary, which this proposal does. Also with handwritten assembly referencing globals defined in C files via PC32 relocations, e.g. in ffmpeg, we can just build those C sources with the small code model and everything works out.

3. We discussed that the thunks requires %r11 as a new scratch register even for calls within the same function. Does that break ABI usage potentially with prior compiled objects?

If you're in the small code model, you shouldn't need a thunk for jumps within a function since the small text section is <2GB, so it shouldn't affect precompiled small code model objects. Otherwise %r11 is caller saved so there shouldn't be issues?

4. I noticed now we are mainly suffering from relocation overflows from .gcc_except_table -> .data since these two sections tend to bookend our binaries. gcc_except_table and eh_frame both include encodings that are merged from the TUs that comprise the binary. If any of these objects are small-code, then they are 4-bytes and restricted to the same 2GiB offset potentially. I've been pushing & exploring an approach to automatically expand all entries as needed to 8 bytes (outdated but relevant: https://github.com/llvm/llvm-project/compare/main...fzakaria:llvm-project:eh-frame-relax-reverse?expand=1). Without this functionality and similar support for eh_frame_hdr (https://github.com/llvm/llvm-project/pull/179089), it makes re-using small code model objects a non-starter.

I realized the linker only generates eh_frame_hdr, not eh_frame, so to answer Fangrui's question above about eh_frame as well as gcc_except_table here: if we keep eh_frame/gcc_except_table in the small sections we should be able to continue using 4-byte entries in precompiled small code model libraries. The large code model text/data won't be in the way, as they are in ltext/ldata/lrodata on the edges of the binary. eh_frame/gcc_except_table will use 8-byte entries to reference large code model text/data. I think this ends up being a really nice side effect of the small/large distinction.

It's worth looking into how large eh_frame/gcc_except_table can get in a large binary. Will investigate.

H.J. Lu

unread,

May 13, 2026, 5:13:37 PMMay 13

to Arthur Eubanks, Farid Zakaria, X86-64 System V Application Binary Interface

On Thu, May 14, 2026 at 5:08 AM 'Arthur Eubanks' via X86-64 System V

Application Binary Interface <x86-6...@googlegroups.com> wrote:
>>

>> 1. I continue to be confused & surprised at the performance of large code-model. I've reached the depth of my hardware knowledge but I would be interested to know if others have experience at which the performance does finally suffer. We've observed <2% CPU performance on various workloads without any new relaxations. I also explored new large code-model relaxations and got the difference potentially even less.
>
>
> On smaller microbenchmarks I'm seeing lots of >20% regressions, and also many that are unaffected. It looks like ThinLTO builds tend to be more resistant, presumably because ThinLTO can do cross-module inlining, so all the hot code gets inlined and there are no large code model call instruction sequences.
>
> On a larger benchmark, I'm seeing a similar ~2% regression, which is unusable at least for performance critical binaries.
>
>>
>> 2. It might be good for us to consolidate each a section on how we think our proposals differ. I actually think they are nearly identical enough where it matters and hardly different. IIRC, the only difference was that at Meta we had thought to have a single text since with thunks there is no "large" code-type any longer and merely intersperse the GOT from the start unifying the code. After having come across a few bugs in lld, bolt & GCC where the lack of support for prefixing a section with "l" caused bugs, I am more in favor of this unification but I think it's ultimately minor. I am motivated in helping us get larger binaries and willing & happy to consolidate where it makes sense & as needed. Unifying the text and multiple GOT from the start means we no longer need to segment the binary into "hot/small" and "cold/large" areas FWIW.
>
>
> The small/large distinction is necessary for precompiled small code model libraries. I think your approach of sorting symbols that have PC32 relocations is more complicated and more work to implement in the linker than the small/large distinction, but I may be missing something. It's a huge plus to not have to modify precompiled small code model code linked into a large code model binary, which this proposal does. Also with handwritten assembly referencing globals defined in C files via PC32 relocations, e.g. in ffmpeg, we can just build those C sources with the small code model and everything works out.
>
>>
>> 3. We discussed that the thunks requires %r11 as a new scratch register even for calls within the same function. Does that break ABI usage potentially with prior compiled objects?
>
>
> If you're in the small code model, you shouldn't need a thunk for jumps within a function since the small text section is <2GB, so it shouldn't affect precompiled small code model objects. Otherwise %r11 is caller saved so there shouldn't be issues?

Please collect performance data with requiring -fno-plt for large model:

[hjl@gnu-tgl-3 tmp]$ cat x.c
extern void foo (void);

void
bar (void)
{
foo ();
}
[hjl@gnu-tgl-3 tmp]$ gcc -O2 -fno-plt x.c -S
[hjl@gnu-tgl-3 tmp]$ cat x.s
.file "x.c"
.text
.p2align 4
.globl bar
.type bar, @function
bar:
.LFB0:
.cfi_startproc
jmp *foo@GOTPCREL(%rip)
.cfi_endproc
.LFE0:
.size bar, .-bar
.ident "GCC: (GNU) 16.1.1 20260501 (Red Hat 16.1.1-1)"
.section .note.GNU-stack,"",@progbits
[hjl@gnu-tgl-3 tmp]$

>>
>>
>> 4. I noticed now we are mainly suffering from relocation overflows from .gcc_except_table -> .data since these two sections tend to bookend our binaries. gcc_except_table and eh_frame both include encodings that are merged from the TUs that comprise the binary. If any of these objects are small-code, then they are 4-bytes and restricted to the same 2GiB offset potentially. I've been pushing & exploring an approach to automatically expand all entries as needed to 8 bytes (outdated but relevant: https://github.com/llvm/llvm-project/compare/main...fzakaria:llvm-project:eh-frame-relax-reverse?expand=1). Without this functionality and similar support for eh_frame_hdr (https://github.com/llvm/llvm-project/pull/179089), it makes re-using small code model objects a non-starter.
>
>
> I realized the linker only generates eh_frame_hdr, not eh_frame, so to answer Fangrui's question above about eh_frame as well as gcc_except_table here: if we keep eh_frame/gcc_except_table in the small sections we should be able to continue using 4-byte entries in precompiled small code model libraries. The large code model text/data won't be in the way, as they are in ltext/ldata/lrodata on the edges of the binary. eh_frame/gcc_except_table will use 8-byte entries to reference large code model text/data. I think this ends up being a really nice side effect of the small/large distinction.
>
> It's worth looking into how large eh_frame/gcc_except_table can get in a large binary. Will investigate.
>

> --
> You received this message because you are subscribed to the Google Groups "X86-64 System V Application Binary Interface" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to x86-64-abi+...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/x86-64-abi/CAPW48sqxJ0eXPypz2M0ZLPQynSP12OTytgJi0dvctFBTOV5pJg%40mail.gmail.com.

--
H.J.

Farid Zakaria

unread,

May 13, 2026, 5:31:33 PMMay 13

to H.J. Lu, Arthur Eubanks, X86-64 System V Application Binary Interface

I see since the thunks are only for the large code model then we are making any precompiled large code-model non-compatible any longer? (OK by me but just trying to understand)
Since you can have large code-model now https://godbolt.org/z/b4edfqYdr that uses %r11 and assumes it's not scratch.

For eh_frame & gcc_except_table, we found that if you push .text close to 2GiB you tend to see relocation overflows no matter how to organize them as a function of being close to the limit.

Here is some output we have lld emit via internal patch to make it easier for users to understand what is causing pressure:

Memory layout:

0x061d5414 - .gcc_except_table (84.06 MiB)
0x062c6a64 - >>> SOURCE ---+
0x0b5e42a0 - .abcefgh.base (1 B) |
0x0b5e42b0 - protodesc_cold (409.80 KiB) |
0x0b64a9e0 - .eh_frame_hdr (25.04 MiB) |
0x0cf55430 - .eh_frame (149.53 MiB) |
0x164ddf3c - [gap] (4.00 KiB) |
0x164def40 - .cudaRegisterAll (1.15 MiB) |
0x16604888 - [gap] (4.01 KiB) |
0x16605890 - .cuda_something (24.53 MiB) |
0x17e8de48 - [gap] (4.18 KiB) |
0x17e8ef00 - .text (1.69 GiB) |
0x83f9c48c - .init (23 B) |
0x83f9c4a4 - .fini (9 B) |
0x83f9c4ad - .foobar (224 B) |
0x83f9c590 - .abc (4.74 KiB) |
0x83f9d888 - malloc_hook (585 B) |
0x83f9dae0 - .plt (53.72 KiB) |
0x83fab1c0 - [gap] (7.56 KiB) |
0x83fad000 - .tdata (4.65 KiB) |
0x83fae298 - .fini_array (88 B) |
0x83fae2f0 - .init_array (411.98 KiB) |
0x83faf000 - .tbss (373.34 KiB) |
0x8400c560 - [gap] (35.41 KiB) |
0x84015300 - .data.rel.ro (35.51 MiB) |
0x863973c0 - .ctors (48 B) |
0x863973f0 - .dynamic (1.52 KiB) |
0x86397a00 - .got (401.92 KiB) |
0x863fc1b0 - .relro_padding (3.58 KiB) |
0x863fd000 - [gap] (4.00 KiB) |
0x863fe000 - .data (6.60 MiB) |
0x863fe660 - >>> TARGET <--+

I was under the impression that, as a result I have to expand the eh_frame and gcc_except_table accordingly.
It felt easier just to force expand it all to sdata8 (or as needed)

>> 4. I noticed now we are mainly suffering from relocation overflows from .gcc_except_table -> .data since these two sections tend to bookend our binaries. gcc_except_table and eh_frame both include encodings that are merged from the TUs that comprise the binary. If any of these objects are small-code, then they are 4-bytes and restricted to the same 2GiB offset potentially. I've been pushing & exploring an approach to automatically expand all entries as needed to 8 bytes (outdated but relevant: https://github.com/llvm/llvm-project/compare/main...fzakaria:llvm-project:eh-frame-relax-reverse?expand=1 ). Without this functionality and similar support for eh_frame_hdr (https://github.com/llvm/llvm-project/pull/179089 ), it makes re-using small code model objects a non-starter.

>
>
> I realized the linker only generates eh_frame_hdr, not eh_frame, so to answer Fangrui's question above about eh_frame as well as gcc_except_table here: if we keep eh_frame/gcc_except_table in the small sections we should be able to continue using 4-byte entries in precompiled small code model libraries. The large code model text/data won't be in the way, as they are in ltext/ldata/lrodata on the edges of the binary. eh_frame/gcc_except_table will use 8-byte entries to reference large code model text/data. I think this ends up being a really nice side effect of the small/large distinction.
>
> It's worth looking into how large eh_frame/gcc_except_table can get in a large binary. Will investigate.
>
> --
> You received this message because you are subscribed to the Google Groups "X86-64 System V Application Binary Interface" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to x86-64-abi+...@googlegroups.com.

> To view this discussion visit https://urldefense.com/v3/__https://groups.google.com/d/msgid/x86-64-abi/CAPW48sqxJ0eXPypz2M0ZLPQynSP12OTytgJi0dvctFBTOV5pJg*40mail.gmail.com__;JQ!!Bt8RZUm9aw!4vV93VAhow2rM4szSBYK3HdSm7iAyleTSXHjib60Q1UGAkG5V9gj545z2kPu1TgsuOEirK1IF24QyC_R$ .

--
H.J.

H.J. Lu

unread,

May 13, 2026, 5:36:31 PMMay 13

to Farid Zakaria, Arthur Eubanks, X86-64 System V Application Binary Interface

On Thu, May 14, 2026 at 5:31 AM Farid Zakaria <fmza...@meta.com> wrote:
>
> I see since the thunks are only for the large code model then we are making any precompiled large code-model non-compatible any longer? (OK by me but just trying to understand)

-fno-plt should be compatible with small model and large model with multiple
GOT. Its downside is no lazy binding. We should avoid using %r11.

--
H.J.

Arthur Eubanks

unread,

May 13, 2026, 5:50:01 PMMay 13

to Farid Zakaria, H.J. Lu, X86-64 System V Application Binary Interface

For eh_frame & gcc_except_table, we found that if you push .text close to 2GiB you tend to see relocation overflows no matter how to organize them as a function of being close to the limit.

With the proposed large code model here .text should be fairly small, only precompiled small code model text assuming you build the rest of your source code with the large code model that goes in .ltext. So 4 byte values in eh_frame/gcc_except_table in precompiled code should be fine.

Arthur Eubanks

unread,

May 13, 2026, 5:51:07 PMMay 13

to H.J. Lu, Farid Zakaria, X86-64 System V Application Binary Interface

-fno-plt should be compatible with small model and large model with multiple
GOT. Its downside is no lazy binding. We should avoid using %r11.

Yes this is an interesting direction to investigate, will look into this.

H.J. Lu

unread,

May 13, 2026, 8:15:51 PMMay 13

to Arthur Eubanks, Farid Zakaria, X86-64 System V Application Binary Interface

We can compile glibc crt files with -fno-plt -mno-direct-extern-access.
The resulting crt files should be compatible with both small mode and
large model.

--
H.J.

Arthur Eubanks

unread,

May 14, 2026, 5:10:15 PMMay 14

to H.J. Lu, X86-64 System V Application Binary Interface

One thing I'm noticing with the PIC -fno-plt GOTPCRELX call is that its encoding is one byte longer than a PLT32 jump. So in a small binary after relaxation we pay an extra nop byte per call, which is slightly unfortunate.

H.J. Lu

unread,

May 14, 2026, 9:12:39 PMMay 14

to Arthur Eubanks, X86-64 System V Application Binary Interface

On Fri, May 15, 2026 at 5:10 AM Arthur Eubanks <aeub...@google.com> wrote:
>
> One thing I'm noticing with the PIC -fno-plt GOTPCRELX call is that its encoding is one byte longer than a PLT32 jump. So in a small binary after relaxation we pay an extra nop byte per call, which is slightly unfortunate.

Yes, this is the price to pay. Can you evaluate the impact of the
extra nop byte?

> On Wed, May 13, 2026 at 5:15 PM H.J. Lu <hjl....@gmail.com> wrote:
>>
>> On Thu, May 14, 2026 at 5:51 AM Arthur Eubanks <aeub...@google.com> wrote:
>> >>
>> >> -fno-plt should be compatible with small model and large model with multiple
>> >> GOT. Its downside is no lazy binding. We should avoid using %r11.
>> >
>> >
>> > Yes this is an interesting direction to investigate, will look into this.
>>
>> We can compile glibc crt files with -fno-plt -mno-direct-extern-access.
>> The resulting crt files should be compatible with both small mode and
>> large model.
>>
>> --
>> H.J.

--
H.J.

Arthur Eubanks

unread,

May 26, 2026, 12:51:48 PMMay 26

to H.J. Lu, X86-64 System V Application Binary Interface

Sorry I'm still gathering performance numbers.

However on the binary size side it's not great, it's a 0.5-1% increase which is quite a lot.

I did discover that clang hoists the address of the target under -fno-plt if we call a function multiple times: https://godbolt.org/z/7P8WW8EMq. A local hack to work around that significantly mitigates the binary size increase of -fno-plt in some cases (such as building clang itself) but has very little impact in most other binaries.

Another thing that came up when discussing this with some people was that even if we compile large code model code with -fno-plt, precompiled small code model code is not compiled with it. So precompiled code still uses PLT32 relocations for function calls. If the linker unconditionally relaxes PLT32 when the callee is linked into the same linkage unit we'll get relocation overflows to large code model functions. This can be worked around by not unconditionally relaxing PLT32 relocations (we did something similar for GOTPCRELX in https://reviews.llvm.org/D157020). In the case the relocation would overflows, we can fallback to the PLT stub. Or even better is to instead generate a thunk as discussed previously to avoid the PLT overhead. If I understand correctly the problem with thunks for arbitrary calls was the usage of r11, but in this case the PLT stub would also use r11. Does this make sense?

H.J. Lu

unread,

May 27, 2026, 1:04:44 AMMay 27

to Arthur Eubanks, X86-64 System V Application Binary Interface

On Wed, May 27, 2026 at 12:51 AM Arthur Eubanks <aeub...@google.com> wrote:
>
> Sorry I'm still gathering performance numbers.
>
> However on the binary size side it's not great, it's a 0.5-1% increase which is quite a lot.
>
> I did discover that clang hoists the address of the target under -fno-plt if we call a function multiple times: https://godbolt.org/z/7P8WW8EMq. A local hack to work around that significantly mitigates the binary size increase of -fno-plt in some cases (such as building clang itself) but has very little impact in most other binaries.
>
> Another thing that came up when discussing this with some people was that even if we compile large code model code with -fno-plt, precompiled small code model code is not compiled with it. So precompiled code still uses PLT32 relocations for function calls. If the linker unconditionally relaxes PLT32 when the callee is linked into the same linkage unit we'll get relocation overflows to large code model functions. This can be worked around by not unconditionally relaxing PLT32 relocations (we did something similar for GOTPCRELX in https://reviews.llvm.org/D157020). In the case the relocation would overflows, we can fallback to the PLT stub. Or even better is to instead generate a thunk as discussed previously to avoid the PLT overhead. If I understand correctly the problem with thunks for arbitrary calls was the usage of r11, but in this case the PLT stub would also use r11. Does this make sense?

The normal PLT doesn't use r11. I hope that the same glibc binaries
can be used
to create small and large model binaries by compiling a small subset
of glibc in large
model. It shouldn't be an issue for dynamic executables. But
building large model
static executables won't work unless the whole glibc is compiled with
large model.

> On Thu, May 14, 2026 at 6:12 PM H.J. Lu <hjl....@gmail.com> wrote:
>>
>> On Fri, May 15, 2026 at 5:10 AM Arthur Eubanks <aeub...@google.com> wrote:
>> >
>> > One thing I'm noticing with the PIC -fno-plt GOTPCRELX call is that its encoding is one byte longer than a PLT32 jump. So in a small binary after relaxation we pay an extra nop byte per call, which is slightly unfortunate.
>>
>> Yes, this is the price to pay. Can you evaluate the impact of the
>> extra nop byte?
>>
>> > On Wed, May 13, 2026 at 5:15 PM H.J. Lu <hjl....@gmail.com> wrote:
>> >>
>> >> On Thu, May 14, 2026 at 5:51 AM Arthur Eubanks <aeub...@google.com> wrote:
>> >> >>
>> >> >> -fno-plt should be compatible with small model and large model with multiple
>> >> >> GOT. Its downside is no lazy binding. We should avoid using %r11.
>> >> >
>> >> >
>> >> > Yes this is an interesting direction to investigate, will look into this.
>> >>
>> >> We can compile glibc crt files with -fno-plt -mno-direct-extern-access.
>> >> The resulting crt files should be compatible with both small mode and
>> >> large model.
>> >>
>> >> --
>> >> H.J.
>>
>>
>>
>> --
>> H.J.

--
H.J.

Arthur Eubanks

unread,

Jun 5, 2026, 1:47:54 PMJun 5

to H.J. Lu, X86-64 System V Application Binary Interface

I ran some smaller performance benchmarks and haven't seen any performance regressions. (I'm still having issues with the larger benchmark infrastructure)

One thing we specifically want is compatibility with precompiled small code model libraries that aren't built with -fno-plt. So we'd like to be in a situation where we don't have to recompile glibc with -fno-plt to make it compatible with the large code model. Precompiled small code model code can call into large code model code which right now linkers will unconditionally relax. We'd like to instead make it so that if the jump is too far, the linker instead redirects to a thunk that does an indirect call through the GOT, so no %r11 usage. These thunks can live in .plt alongside PLT stubs and would only be used for small code model jumps. -fno-plt in large code model code is still good.

Does this make sense?

Arthur Eubanks

unread,

Jun 9, 2026, 6:11:17 PMJun 9

to H.J. Lu, X86-64 System V Application Binary Interface

Another idea to prevent binary size increase in small binaries is to go back to thunks for all out of range jumps, but make the thunk an indirect call through the GOT which doesn't require %r11. That should make small binaries unimpacted, but still not have %r11 concerns in larger binaries.

H.J. Lu

unread,

Jul 6, 2026, 11:09:22 PM (8 days ago) Jul 6

to Arthur Eubanks, X86-64 System V Application Binary Interface

On Wed, Jun 10, 2026 at 6:11 AM Arthur Eubanks <aeub...@google.com> wrote:
>
> Another idea to prevent binary size increase in small binaries is to go back to thunks for all out of range jumps, but make the thunk an indirect call through the GOT which doesn't require %r11. That should make small binaries unimpacted, but still not have %r11 concerns in larger binaries.

Currently, the default C run-time is incompatible with the large model.
Changing the large model to access data and code symbols via GOT is
an improvement. Can we compile all *crt.o files with

-fno-plt -mno-direct-extern-access

for the small model so that they can be used to create dynamic
large model binaries?

--
H.J.

Arthur Eubanks

unread,

Jul 8, 2026, 7:44:04 PM (6 days ago) Jul 8

to H.J. Lu, X86-64 System V Application Binary Interface

On Mon, Jul 6, 2026 at 8:09 PM H.J. Lu <hjl....@gmail.com> wrote:

On Wed, Jun 10, 2026 at 6:11 AM Arthur Eubanks <aeub...@google.com> wrote:
>
> Another idea to prevent binary size increase in small binaries is to go back to thunks for all out of range jumps, but make the thunk an indirect call through the GOT which doesn't require %r11. That should make small binaries unimpacted, but still not have %r11 concerns in larger binaries.

Currently, the default C run-time is incompatible with the large model.
Changing the large model to access data and code symbols via GOT is
an improvement. Can we compile all *crt.o files with

-fno-plt -mno-direct-extern-access

for the small model so that they can be used to create dynamic
large model binaries?

I'm saying that we cannot recompile all prebuilt small code model libraries, so to avoid that, we should have the linker create thunks that do an indirect call (like a -fno-plt call, but inside a thunk) when the target is not reachable. This applies to all function calls, whether to or from text/ltext. Since we will have a GOT within 2GB of any text, the indirect call + GOT entry should be enough.

This takes care of both the precompiled small code model compatibility issue (including *crt.o), and not having to use -fno-plt on large code model function calls which increases binary size for small binaries compiled with large code model. Does that make sense?

H.J. Lu

unread,

Jul 8, 2026, 8:15:34 PM (6 days ago) Jul 8

to Arthur Eubanks, X86-64 System V Application Binary Interface

On Thu, Jul 9, 2026 at 7:44 AM Arthur Eubanks <aeub...@google.com> wrote:
>
> On Mon, Jul 6, 2026 at 8:09 PM H.J. Lu <hjl....@gmail.com> wrote:
>>
>> On Wed, Jun 10, 2026 at 6:11 AM Arthur Eubanks <aeub...@google.com> wrote:
>> >
>> > Another idea to prevent binary size increase in small binaries is to go back to thunks for all out of range jumps, but make the thunk an indirect call through the GOT which doesn't require %r11. That should make small binaries unimpacted, but still not have %r11 concerns in larger binaries.
>>
>> Currently, the default C run-time is incompatible with the large model.
>> Changing the large model to access data and code symbols via GOT is
>> an improvement. Can we compile all *crt.o files with
>>
>> -fno-plt -mno-direct-extern-access
>>
>> for the small model so that they can be used to create dynamic
>> large model binaries?
>
>
> I'm saying that we cannot recompile all prebuilt small code model libraries, so to avoid that, we should have the linker create thunks that do an indirect call (like a -fno-plt call, but inside a thunk) when the target is not reachable. This applies to all function calls, whether to or from text/ltext. Since we will have a GOT within 2GB of any text, the indirect call + GOT entry should be enough.

By the default run-time, I meant the crt files and other bits from
compiler and libc
which are used to create dynamic executable and shared libraries. They don't
include any other libraries. The default run-time should be a small
set of binaries.

Yes, linker should do whatever is needed to create a large model binary.

> This takes care of both the precompiled small code model compatibility issue (including *crt.o), and not having to use -fno-plt on large code model function calls which increases binary size for small binaries compiled with large code model. Does that make sense?

Does it work if the default run-time isn't compiled with
-mno-direct-extern-access?
Won't data access displacement beyond 2GB fail?

--
H.J.

Arthur Eubanks

unread,

Jul 14, 2026, 4:01:59 PM (8 hours ago) Jul 14

to H.J. Lu, X86-64 System V Application Binary Interface

> This takes care of both the precompiled small code model compatibility issue (including *crt.o), and not having to use -fno-plt on large code model function calls which increases binary size for small binaries compiled with large code model. Does that make sense?

Does it work if the default run-time isn't compiled with
-mno-direct-extern-access?
Won't data access displacement beyond 2GB fail?

What data access from the default runtime would have a displacement >2GB? The large code model data will be in large data sections that don't interfere with the small colocated text/data portion of the binary. So the runtime, which is built with the small code model, will live alongside other prebuilt small code model text/data, which should span much less than 2GB if most of your code is built with the large code model. The only problem would be if the small code model references things defined in large code model text/data directly with PC32 relocation.

H.J. Lu

unread,

Jul 14, 2026, 8:29:57 PM (4 hours ago) Jul 14

to Arthur Eubanks, X86-64 System V Application Binary Interface

[hjl@gnu-tgl-3 tmp]$ cat x.c

extern int foo;

int
func (void)
{
return foo;
}
[hjl@gnu-tgl-3 tmp]$ gcc -O2 -fno-plt -mno-direct-extern-access -c x.c
[hjl@gnu-tgl-3 tmp]$ objdump -dwr x.o

x.o: file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <func>:
0: 48 8b 05 00 00 00 00 mov 0x0(%rip),%rax # 7 <func+0x7>
3: R_X86_64_REX_GOTPCRELX foo-0x4
^^^^^^^^ This may overflow
2GB displacement.

This is where multi-GOT comes in. Linker should allocate
another GOT closer to this location when relocation overflows.

I suggest compiling the crt files and other bits from compiler

and libc which are used to create dynamic executable and

shared libraries with -fno-plt -mno-direct-extern-access.
These bits should work with both small model and large model.
We don't have to worry about where bits from these files are
placed.

7: 8b 00 mov (%rax),%eax
9: c3 ret
[hjl@gnu-tgl-3 tmp]$

--
H.J.

Reply all

Reply to author

Forward