This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.
For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.
At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.
The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.
This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.
Function calls
The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.
Data references
In position independent code, instead of the long instruction sequence like https://godbolt.org/z/jcobvaocK, assume that there is a GOT within 2GB of any code in .ltext and emit the small code model extern global instruction sequence
mov rax, qword ptr [rip + i@GOTPCREL]
mov eax, dword ptr [rax]
which is relaxable to
lea rax, [rip + i]
mov eax, dword ptr [rax]
if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
mov eax, dword ptr [rip + i]
in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.
To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.
The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.
If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.
With multiple GOTs we'd like multiple RELRO segments
Currently it's unclear as to whether or not the spec allows multiple RELRO segments. This proposes that we explicitly allow multiple RELRO segments.
Various parts of the ecosystem would need to be updated to support multiple RELRO segments, the major one being glibc's dynamic loader (https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-reloc.c;h=26a1e7adfc4525525a1af8f8fa193dfa9e6b173b;hb=HEAD#l351).
This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.
https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.
Interoperability with precompiled small code model libraries
Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.
The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.
Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.
Currently the layout of sections for PIC binaries as laid out by lld is
.ltext .lrodata .rodata .text .data .bss .ldata .lbss
Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?
Performance of large binaries
We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.
Linker scripts
Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
.ltext : { PARTITION_WITH_GOTS(*.ltext) }
which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
lld doesn't have an explicit default linker script; rather it's implemented in code
It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
.got.ltext.0 : { ... }
.ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
.got.ltext.1 : { ... }
.ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
... (up to some arbitrary N)
This is pretty hacky and unprincipled
Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself
We're still investigating other solutions here
TLS
As far as I'm aware, people are not running into TLS relocation overflows, so no change there.
Compatibility with the current large code model
The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.
The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.
Replacing the existing large code model
This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.
In summary, in order to for once and for all prevent relocation overflows but with reasonable performance we'd like to change the large code model. We want it to perform better, especially in cases where the final linked binary ends up being smaller, and be more compatible with prebuilt small code model code. The main proposed changes are thunks and partitioning .ltext with GOTs. If there is consensus that this makes sense, I'll send out a change to https://gitlab.com/x86-psABIs/x86-64-ABI updating the specification for the large code model.
Discussion is greatly appreciated, alongside a general "yes this is a good idea"/"no this is a bad idea".
On Fri, May 8, 2026 at 1:57 AM 'Arthur Eubanks' via X86-64 System V
Application Binary Interface <x86-6...@googlegroups.com> wrote:
>
> This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.
>
>
> For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.
>
>
> At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.
>
>
> The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.
>
>
> This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.
I really like this approach.
>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.
>
> This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.
>
> https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.
>
> Interoperability with precompiled small code model libraries
>
> Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.
>
> The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.
>
> Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.
>
> Currently the layout of sections for PIC binaries as laid out by lld is
> .ltext .lrodata .rodata .text .data .bss .ldata .lbss
>
> Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?
We can fine tune the code layout later.
> Performance of large binaries
>
> We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.
I think new relocations may help here.
> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled
We can deal with it later when implementing this in BFD linker.
> Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself
>
> We're still investigating other solutions here
>
> TLS
>
> As far as I'm aware, people are not running into TLS relocation overflows, so no change there.
>
> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.
Since glibc doesn't support the current large code model anyway:
https://bugzilla.redhat.com/show_bug.cgi?id=1713891
we don't need to consider compatibility with the current large code model.
> The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.
>
> Replacing the existing large code model
>
> This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.
>
That is true since glibc doesn't support it.
I think in this large code model, call over GOT should apply to both
PIE and PDE.
> mov rax, qword ptr [rip + i@GOTPCREL]
> mov eax, dword ptr [rax]
> which is relaxable to
> lea rax, [rip + i]
> mov eax, dword ptr [rax]
> if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
> mov eax, dword ptr [rip + i]
> in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.
>
> To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.
>
> The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.
>
2 GOTPCREL relocations are needed. One is for instructions which can't
be relaxed and the other can. Should we add new GOTPCREL relocations
to tell the linker to prepare for multiple GOT so that the unsupported
linkers won't
generate corrupted outputs by accident.
> If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.
Does it mean that all access to symbols, data or functions, must go through
GOT? What about labels/symbols local to the function?
> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.
Since glibc doesn't support the current large code model anyway:
https://bugzilla.redhat.com/show_bug.cgi?id=1713891
we don't need to consider compatibility with the current large code model.
One area the proposal doesn't cover: .text to .eh_frame/.eh_frame_hdr
>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.I believe H.J. has previously floated using call *foo@GOTPCREL(%rip) (6 bytes, indirecting through a .got.plt slot) in place of thunks -- essentially -fno-plt extended to all calls, including local ones.It trades one byte per callsite plus an indirect-branch predictor cost against not needing the linker to materialize and place thunks, and shifts the model from "pay-when-needed" (thunks only appear once reach is actually exceeded) to "pay-always at every callsite."AFAIK nobody has actually compared the two on a realistic workload; it would be worth measuring before settling on thunks.
> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled
We can deal with it later when implementing this in BFD linker.A real design pass on linker scripts is overdue here.INSERT AFTER / INSERT BEFORE already exist in GNU ld and lld and are exactly the kind of composable primitive worth leaning on.
Stepping back, this is my biggest concern about both proposals: they risk introducing knobs and defaults to lld that overfit company-specific binary shapes.Meta's --sort-text-by-code-model is the clearest example tuned for their examples and doesn't fit into upstream lld.Google's .ltext placement reasoning similarly bakes in an assumption about which segment cost dominates.
lld shouldn't be where workload-specific layout heuristics live. Prefer principled, composable primitives users wire up themselves -- and make .text.N/.got.N placement scriptable -- over shipping defaults that quietly assume one company's binary shape.