[RFC] Redefining the Large Code Model

40 views
Skip to first unread message

Arthur Eubanks

unread,
May 7, 2026, 1:57:58 PM (6 days ago) May 7
to X86-64 System V Application Binary Interface

This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.


For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.


At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.


The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.


This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.


  • Function calls

  • Data references

    • In position independent code, instead of the long instruction sequence like https://godbolt.org/z/jcobvaocK, assume that there is a GOT within 2GB of any code in .ltext and emit the small code model extern global instruction sequence
        mov rax, qword ptr [rip + i@GOTPCREL]
        mov eax, dword ptr [rax]
      which is relaxable to
        lea rax, [rip + i]
        mov eax, dword ptr [rax]
      if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
        mov eax, dword ptr [rip + i]
      in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.

    • To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.

    • The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.

      • If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.

    • With multiple GOTs we'd like multiple RELRO segments

    • This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.

    • https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.

  • Interoperability with precompiled small code model libraries

    • Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.

    • The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.

      • Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.

      • Currently the layout of sections for PIC binaries as laid out by lld is
        .ltext .lrodata .rodata .text .data .bss .ldata .lbss

      • Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?

  • Performance of large binaries

    • We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.

  • Linker scripts

    • Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions

    • The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT

    • One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
      .ltext : { PARTITION_WITH_GOTS(*.ltext) }
      which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.

      • lld doesn't have an explicit default linker script; rather it's implemented in code

    • It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
      .got.ltext.0 : { ... }
      .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
      .got.ltext.1 : { ... }
      .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
      ... (up to some arbitrary N)

      • This is pretty hacky and unprincipled

    • Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself

    • We're still investigating other solutions here

  • TLS

    • As far as I'm aware, people are not running into TLS relocation overflows, so no change there.

  • Compatibility with the current large code model

    • The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.

    • The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.

  • Replacing the existing large code model

    • This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.


In summary, in order to for once and for all prevent relocation overflows but with reasonable performance we'd like to change the large code model. We want it to perform better, especially in cases where the final linked binary ends up being smaller, and be more compatible with prebuilt small code model code. The main proposed changes are thunks and partitioning .ltext with GOTs. If there is consensus that this makes sense, I'll send out a change to https://gitlab.com/x86-psABIs/x86-64-ABI updating the specification for the large code model.


Discussion is greatly appreciated, alongside a general "yes this is a good idea"/"no this is a bad idea".

H.J. Lu

unread,
May 7, 2026, 5:23:59 PM (6 days ago) May 7
to Arthur Eubanks, X86-64 System V Application Binary Interface
On Fri, May 8, 2026 at 1:57 AM 'Arthur Eubanks' via X86-64 System V
Application Binary Interface <x86-6...@googlegroups.com> wrote:
>
> This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.
>
>
> For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.
>
>
> At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.
>
>
> The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.
>
>
> This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.

I really like this approach.

>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.
>
> Data references
>
> In position independent code, instead of the long instruction sequence like https://godbolt.org/z/jcobvaocK, assume that
there is a GOT within 2GB of any code in .ltext and emit the small
code model extern global instruction sequence

I think in this large code model, call over GOT should apply to both
PIE and PDE.

> mov rax, qword ptr [rip + i@GOTPCREL]
> mov eax, dword ptr [rax]
> which is relaxable to
> lea rax, [rip + i]
> mov eax, dword ptr [rax]
> if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
> mov eax, dword ptr [rip + i]
> in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.
>
> To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.
>
> The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.
>

2 GOTPCREL relocations are needed. One is for instructions which can't
be relaxed and the other can. Should we add new GOTPCREL relocations
to tell the linker to prepare for multiple GOT so that the unsupported
linkers won't
generate corrupted outputs by accident.

> If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.

Does it mean that all access to symbols, data or functions, must go through
GOT? What about labels/symbols local to the function?

> With multiple GOTs we'd like multiple RELRO segments
>
> Currently it's unclear as to whether or not the spec allows multiple RELRO segments. This proposes that we explicitly allow multiple RELRO segments.
>
> Various parts of the ecosystem would need to be updated to support multiple RELRO segments, the major one being glibc's dynamic loader (https://sourceware.org/git/?p=glibc.git;a=blob;f=elf/dl-reloc.c;h=26a1e7adfc4525525a1af8f8fa193dfa9e6b173b;hb=HEAD#l351).
>
> This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.
>
> https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.
>
> Interoperability with precompiled small code model libraries
>
> Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.
>
> The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.
>
> Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.
>
> Currently the layout of sections for PIC binaries as laid out by lld is
> .ltext .lrodata .rodata .text .data .bss .ldata .lbss
>
> Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?

We can fine tune the code layout later.

> Performance of large binaries
>
> We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.

I think new relocations may help here.

> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled

We can deal with it later when implementing this in BFD linker.

> Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself
>
> We're still investigating other solutions here
>
> TLS
>
> As far as I'm aware, people are not running into TLS relocation overflows, so no change there.
>
> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.

Since glibc doesn't support the current large code model anyway:

https://bugzilla.redhat.com/show_bug.cgi?id=1713891

we don't need to consider compatibility with the current large code model.

> The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.
>
> Replacing the existing large code model
>
> This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.
>

That is true since glibc doesn't support it.

>
> In summary, in order to for once and for all prevent relocation overflows but with reasonable performance we'd like to change the large code model. We want it to perform better, especially in cases where the final linked binary ends up being smaller, and be more compatible with prebuilt small code model code. The main proposed changes are thunks and partitioning .ltext with GOTs. If there is consensus that this makes sense, I'll send out a change to https://gitlab.com/x86-psABIs/x86-64-ABI updating the specification for the large code model.
>
>
> Discussion is greatly appreciated, alongside a general "yes this is a good idea"/"no this is a bad idea".
>

Yes, this is an excellent idea.


--
H.J.

H.J. Lu

unread,
May 8, 2026, 10:49:00 PM (4 days ago) May 8
to Arthur Eubanks, X86-64 System V Application Binary Interface
One more thing. The C/C++ small model run-time libraries are
incompatible with the current large model. Will the new large
model be compatible with the C/C++ small model run-time libraries?

--
H.J.

Fangrui Song

unread,
May 11, 2026, 3:27:08 AM (2 days ago) May 11
to X86-64 System V Application Binary Interface
On Thursday, May 7, 2026 at 2:23:59 PM UTC-7 hjl....@gmail.com wrote:
On Fri, May 8, 2026 at 1:57 AM 'Arthur Eubanks' via X86-64 System V
Application Binary Interface <x86-6...@googlegroups.com> wrote:
>
> This RFC is in a Google Doc (https://docs.google.com/document/d/1F-ok3vCRVXuLoRBoM45UZ3eHpNsOZr4K_8AjEfEDPjA) and also pasted here for convenience. Preferably high level comments are posted in this thread and comments about specific details go in the doc, but any way is fine.
>
>
> For background on the various x86-64 code models, https://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models does a great job describing them.
>
>
> At Google we tend to link our C++ binaries statically, and these keep getting larger over time for various reasons. This means we run into more and more relocation overflows. We've used the medium code model to add some headroom, but that's only a stopgap and we'd like a more permanent solution.
>
>
> The current x86-64 large code model prevents relocation overflows, but has pretty drastic performance implications. We'd like to change the large code model to produce more efficient instruction sequences both for smaller binaries where relocation overflows don't happen (only pay for what you use) and also larger binaries that have performance hotspots. It seems that there are very few, if any, users of the large code model, so we'd like to improve upon it instead of creating a new code model.
>
>
> This is an alternative to Meta's proposal that aims to solve the same issue (https://docs.google.com/document/d/1UspcVqzPNg99IDWkLlkVp5NdIYtNk0TENr3kmR_w8uQ), but we believe this is simpler to implement.

I really like this approach.


I like the range-extension-thunks + multi-GOT + multi-RELRO direction. It solves the two reach problems (.text to .text via thunks, .text to data via partitioned GOTs) by keeping the "only pay for what you use" property 

One area the proposal doesn't cover: .text to .eh_frame/.eh_frame_hdr 
 
>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.


I believe H.J. has previously floated using call *foo@GOTPCREL(%rip) (6 bytes, indirecting through a .got.plt slot) in place of thunks -- essentially -fno-plt extended to all calls, including local ones.
It trades one byte per callsite plus an indirect-branch predictor cost against not needing the linker to materialize and place thunks, and shifts the model from "pay-when-needed" (thunks only appear once reach is actually exceeded) to "pay-always at every callsite."

AFAIK nobody has actually compared the two on a realistic  workload; it would be worth measuring before settling on thunks.
Yes, also musl's dynamic loader.
 

>
> This has implementation implications when paired with thunks since both are generating code/data that needs to be inserted every 2/4GB; I don't have a prototype yet to see how problematic this is but I will be working on a prototype.
>
> https://github.com/llvm/llvm-project/pull/174508#issuecomment-3761064027 claims that CHERI supports multiple GOTs and CheriBSD supports multiple RELRO segments.
>
> Interoperability with precompiled small code model libraries
>
> Most of the world does not run into humongous binaries and so compiles their code with the default small code model. We'd like to make precompiled small code model libraries compatible with large code model built-from-source code.
>
> The small code model puts code and data in non-SHF_X86_64_LARGE sections like .text, .data, etc. The large code model (and medium code model for globals above a certain size) puts data in SHF_X86_64_LARGE sections like .ldata. This allows some level of compatibility between small and large code model object files. We'd like to place large code model code in a SHF_X86_64_LARGE .ltext section which is laid out far away from the "small" data/text sections, meaning it does not contribute to relocation pressure of small code model object files linked into the binary.
>
> Clang actually already does this as of https://github.com/llvm/llvm-project/pull/73037, and lld properly lays out .ltext as of https://github.com/llvm/llvm-project/pull/70358, but we'd like to standardize this.
>
> Currently the layout of sections for PIC binaries as laid out by lld is
> .ltext .lrodata .rodata .text .data .bss .ldata .lbss
>
> Ideally .ltext is placed somewhere between .lrodata and .ldata to maximize the number of relaxed data accesses, rather than e.g. before .lrodata where the distance to .ldata is further. Perhaps placing it after .bss and before .ldata will minimize the number of segments, since placing it between .rodata and .lrodata will split the RO segment?

We can fine tune the code layout later.

> Performance of large binaries
>
> We'd like to keep the small code model data access patterns for built-from-source hot code. This means that rather than the code model unconditionally affecting an entire TU, we'd like the option to choose which individual symbols are "small" or "large" in the compiler. As long as the compiler can see that less than 2GB of the binary is hot, it can mark all of those symbols as small.

I think new relocations may help here.

Do we really need new GOTPCREL relocations? Unpatched linkers will run into relocation overflow issues and report errors, as expected. 


> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled

We can deal with it later when implementing this in BFD linker.

A real design pass on linker scripts is overdue here.
INSERT AFTER / INSERT BEFORE already exist in GNU ld and lld and are exactly the kind of composable primitive worth leaning on.

Stepping back, this is my biggest concern about both proposals: they risk introducing knobs and defaults to lld that overfit company-specific binary shapes.
Meta's --sort-text-by-code-model is the clearest example tuned for their examples and doesn't fit into upstream lld.
Google's .ltext placement reasoning similarly bakes in an assumption about which segment cost dominates.
lld shouldn't be where workload-specific layout heuristics live. Prefer principled, composable primitives users wire up themselves -- and make .text.N/.got.N placement scriptable -- over shipping defaults that quietly assume one company's binary shape.
 


> Worst case we can claim that linker scripts don't work with partitioned text GOTs and that the implementation has to be done within the linker itself
>
> We're still investigating other solutions here
>
> TLS
>
> As far as I'm aware, people are not running into TLS relocation overflows, so no change there.
>
> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.

Since glibc doesn't support the current large code model anyway:

https://bugzilla.redhat.com/show_bug.cgi?id=1713891

we don't need to consider compatibility with the current large code model.


The GNU ld failure is in crtbeginS.o -- a GCC object file built with the small code model with a tight data reference.
This example actually links with lld, which places .rodata before .text .

 
> The main change in terms of compatibility is that currently large code model text goes in .text but will go in .ltext with this proposal, which won't be an issue. The previous large code model assumed that all code linked into the binary was built with the large code model, and the .text/.ltext separation is a relaxation of that assumption.
>
> Replacing the existing large code model
>
> This proposal assumes that nobody is depending on specifics of the current implementation of the large code model. If this isn't true, this proposal can be a new code model instead of replacing the existing large code model.
>

That is true since glibc doesn't support it.

That said, I do agree that the current large code model is effectively unusable in practice, and redesigning it is the right call.

Farid Zakaria

unread,
May 11, 2026, 4:15:36 PM (2 days ago) May 11
to X86-64 System V Application Binary Interface
Hey Arthur!

Nice to see the RFC :)
I am traveling abroad so please excuse any omissions/mistakes in my response but I wanted to share some early feedback and early thoughts.

1. I continue to be confused & surprised at the performance of large code-model. I've reached the depth of my hardware knowledge but I would be interested to know if others have experience at which the performance does finally suffer. We've observed <2% CPU performance on various workloads without any new relaxations. I also explored new large code-model relaxations and got the difference potentially even less.

2. It might be good for us to consolidate each a section on how we think our proposals differ. I actually think they are nearly identical enough where it matters and hardly different. IIRC, the only difference was that at Meta we had thought to have a single text since with thunks there is no "large" code-type any longer and merely intersperse the GOT  from the start unifying the code. After having come across a few bugs in lld, bolt & GCC where the lack of support for prefixing a section with "l" caused bugs, I am more in favor of this unification but I think it's ultimately minor. I am motivated in helping us get larger binaries and willing & happy to consolidate where it makes sense & as needed. Unifying the text and multiple GOT from the start means we no longer need to segment the binary into "hot/small" and "cold/large" areas FWIW.

3. We discussed that the thunks requires %r11 as a new scratch register even for calls within the same function. Does that break ABI usage potentially with prior compiled objects?

4. I noticed now we are mainly suffering from relocation overflows from .gcc_except_table -> .data since these two sections tend to bookend our binaries. gcc_except_table and eh_frame both include encodings that are merged from the TUs that comprise the binary. If any of these objects are small-code, then they are 4-bytes and restricted to the same 2GiB offset potentially. I've been pushing & exploring an approach to automatically expand all entries as needed to 8 bytes (outdated but relevant: https://github.com/llvm/llvm-project/compare/main...fzakaria:llvm-project:eh-frame-relax-reverse?expand=1). Without this functionality and similar support for eh_frame_hdr (https://github.com/llvm/llvm-project/pull/179089), it makes re-using small code model objects a non-starter.

Arthur Eubanks

unread,
May 11, 2026, 5:21:11 PM (2 days ago) May 11
to H.J. Lu, X86-64 System V Application Binary Interface
I think in this large code model, call over GOT should apply to both
PIE and PDE.
Do you mean we shouldn't have a separate PDE large code model? I agree with that, if you're using the large code model, you're already not getting maximal performance in all cases. 

>   mov rax, qword ptr [rip + i@GOTPCREL]
>   mov eax, dword ptr [rax]
> which is relaxable to
>   lea rax, [rip + i]
>   mov eax, dword ptr [rax]
> if the RIP-relative distance is encodable. This requires the static linker to split large text sections and emit multiple GOTs in larger binaries. We should emit this instruction sequence even for static globals that are known to be in the same DSO since they may be far away and unrelaxable. The small code model emits
>   mov eax, dword ptr [rip + i]
> in the case of a static global, meaning in smaller binaries with the large code model we only pay an extra lea for static globals over the small code model. In larger binaries this results in extra GOT entries for unrelaxed data accesses and GOT loads, but I believe that's a good tradeoff for "only pay for what you use". This, alongside thunks, also obsoletes the requirement of having a "PIC register" available throughout the function which uses up a register.
>
> To avoid the confusion of having multiple discontiguous .ltext sections with the same name, we can number extras like .ltext.1.
>
> The extra GOTs will only be used for GOTPCREL/GOTPCRELX relocations since those relocations directly give you the address of the relevant GOT entry. Other GOT relocations can be used together in a sequence and we want to ensure that the same GOT is consistently used, so those will all point to the "main" GOT.
>

2 GOTPCREL relocations are needed.  One is for instructions which can't
be relaxed and the other can.   Should we add new GOTPCREL relocations
to tell the linker to prepare for multiple GOT so that the unsupported
linkers won't
generate corrupted outputs by accident.
As maskray said, we won't have corrupted outputs, only relocation overflows. It does seem nice to have a way to indicate that only certain GOTPCREL relocations may require new GOTs. I'm unsure how necessary this is though, as we could either define GOTPCREL to point to a nearby GOT (main GOT if binary is small enough) and not introduce new relocations, or we could define GOTPCREL to point to the "main" GOT and a new GOTPCREL2 relocation to point to a nearby GOT (main GOT if binary is small enough). I don't see a huge benefit in new relocations, but perhaps prototyping this would give guidance.

If we do create new relocations, they also need REX/REX2 variants?


> If we instead have every GOT-referencing relocation reference a nearby GOT, that may break functions that are split and use multiple GOT-referencing instructions together as they may reference different GOTs. 32-bit relocations aside from GOTPCREL/GOTPCRELX shouldn't be used in this large code model.

Does it mean that all access to symbols, data or functions, must go through
GOT? What about labels/symbols local to the function?
We shouldn't worry about an individual section being larger than 2GB (at least for now), so references to labels/symbols within a section can be like the small code model. All other references yes must go through the GOT.

> Compatibility with the current large code model
>
> The current large code model's instruction sequences are compatible with the new proposed large code model layout, so in the rare case there are large code model object files laying around, they will continue to link.

Since glibc doesn't support the current large code model anyway:

https://bugzilla.redhat.com/show_bug.cgi?id=1713891

we don't need to consider compatibility with the current large code model. 

As maskray mentioned, this is because non-large data is taking up too much space which doesn't align with the small code model. I think the actual failure mode with the current large code model is slightly different. If you have too much text, which the current large code model puts in .text, then the small code model crtbeginS.o may have relocation overflows reaching across .text. The part of the proposal about compatibility with precompiled small code model code (e.g. crtbeginS.o) should address this by putting large code model text in .ltext, making sure small code model sections are placed contiguously, unaffected by linked in large code model symbols, and therefore don't span more than 2GB.

> One more thing.  The C/C++ small model run-time libraries are
incompatible with the current large model.   Will the new large
model be compatible with the C/C++ small model run-time libraries?

If I understand your question correctly, the part of the proposal about compatibility with precompiled small code model code should address this?

Arthur Eubanks

unread,
May 11, 2026, 7:13:44 PM (2 days ago) May 11
to Fangrui Song, X86-64 System V Application Binary Interface
One area the proposal doesn't cover: .text to .eh_frame/.eh_frame_hdr 
 
eh_frame/eh_frame_hdr allow 8 byte values, even though most producers emit 4 byte values. I was under the impression that since the spec allowed 8 byte values, we don't need any spec changes but need to fix up producers. Is there anything else I'm missing?

 
>
> Function calls
>
> The thunk discussion in https://groups.google.com/g/x86-64-abi/c/hz28LNnlBEc is exactly what we want for these proposed large code model changes. We can simply emit a normal call instruction that the linker will fix up if necessary, which doesn't have any overhead if the binary is small.
>
> In lld there are some unresolved issues with AArch64 thunks creation, e.g. https://github.com/llvm/llvm-project/issues/61250; we'll make sure to address these.


I believe H.J. has previously floated using call *foo@GOTPCREL(%rip) (6 bytes, indirecting through a .got.plt slot) in place of thunks -- essentially -fno-plt extended to all calls, including local ones.
It trades one byte per callsite plus an indirect-branch predictor cost against not needing the linker to materialize and place thunks, and shifts the model from "pay-when-needed" (thunks only appear once reach is actually exceeded) to "pay-always at every callsite."

AFAIK nobody has actually compared the two on a realistic  workload; it would be worth measuring before settling on thunks. 
 
Yes, we should do this experiment. We can do it on a smaller binary to get the overhead, but to test on a larger binary we'd need this proposal prototyped.

> Linker scripts
>
> Linker scripts currently don't really fit the model of partitioning a section an unknown number of times, there are a couple of proposed solutions
>
> The linker itself needs some special handling of partitioned text/GOT sections since we have to associate GOTPCREL relocations with the correct GOT
>
> One solution to fit this into GNU ld's default linker script is to introduce a new linker script directive PARTITION_WITH_GOTS like
> .ltext : { PARTITION_WITH_GOTS(*.ltext) }
> which from a linker script point of view maps all input .ltext sections to one .ltext output section; this "virtual" .ltext section may actually be multiple .ltext.num/.lgot.num sections in the final binary.
>
> lld doesn't have an explicit default linker script; rather it's implemented in code
>
> It may also be possible to explicitly list out some N ltext/GOT sections, saying that the first 2GB of .ltext goes in .ltext.0, etc
> .got.ltext.0 : { ... }
> .ltext.0 : { PARTITION(*.ltext, 0, 2GB, .got.ltext.0) }
> .got.ltext.1 : { ... }
> .ltext.1 : { PARTITION(*.ltext, 2GB, 4GB, .got.ltext.1) }
> ... (up to some arbitrary N)
>
> This is pretty hacky and unprincipled

We can deal with it later when implementing this in BFD linker.

A real design pass on linker scripts is overdue here.
INSERT AFTER / INSERT BEFORE already exist in GNU ld and lld and are exactly the kind of composable primitive worth leaning on.

We do use a linker script and we care about e.g. alignment of segments specified in linker scripts so we want a solution that works with linker scripts. As someone with linker script thoughts, what are your thoughts on something like the proposed PARTITION_WITH_GOTS inside a INSERT AFTER/BEFORE? I think I'd need to actually implement a prototype to understand the ramifications, e.g. ordering of section creation/address assignment within the linker.
 

Stepping back, this is my biggest concern about both proposals: they risk introducing knobs and defaults to lld that overfit company-specific binary shapes.
Meta's --sort-text-by-code-model is the clearest example tuned for their examples and doesn't fit into upstream lld.
Google's .ltext placement reasoning similarly bakes in an assumption about which segment cost dominates.
 
I did state segment count as my reason for putting .ltext there but that was kind of an arbitrary reason, costs of segments probably isn't going to matter too much; it was a "all else being equal it'd be nice to reduce the number of segments". I think the most important thing is that for binaries on the edge, we get as many relaxations as possible via section placement for smaller GOTs/startup dynamic relocations. Perhaps there's something else I'm missing that people may care about but that's the main thing that comes to mind, so IMO it seems reasonable to use number of relaxations as the primary goal for where to place sections. And ultimately for performance of binaries in general, as the proposal states, we'll move performance sensitive text and data into small sections.

lld shouldn't be where workload-specific layout heuristics live. Prefer principled, composable primitives users wire up themselves -- and make .text.N/.got.N placement scriptable -- over shipping defaults that quietly assume one company's binary shape.

I generally agree with this, but I feel like I'm missing something concrete with this statement in relation to this proposal. Are you saying that there shouldn't be a default placement of a section like .ltext and it should explicitly be put in a linker script? In my mind it seems unlikely that people will care about specific placement of .ltext as long as it's somewhere reasonable, but I might be wrong.

You're correct that knobs are frustrating and should be avoided when possible, but sometimes I think there's a pretty obvious "best" default, or it doesn't matter, and it's important to distinguish between those cases (not only for maintainers, but also users).

> I like the range-extension-thunks + multi-GOT + multi-RELRO direction. It solves the two reach problems (.text to .text via thunks, .text to data via partitioned GOTs) by keeping the "only pay for what you use" property 

I'm glad people are supportive of the general approach.

H.J. Lu

unread,
May 11, 2026, 7:41:21 PM (2 days ago) May 11
to Arthur Eubanks, X86-64 System V Application Binary Interface
On Tue, May 12, 2026 at 5:21 AM Arthur Eubanks <aeub...@google.com> wrote:
>>
>> I think in this large code model, call over GOT should apply to both
>> PIE and PDE.
>
> Do you mean we shouldn't have a separate PDE large code model? I agree with that, if you're using the large code model, you're already not getting maximal performance in all cases.

I meant always generate PIE codes for large model even compiled with -fno-PIE.

H.J.
Reply all
Reply to author
Forward
0 new messages