RELLEB: A compact relocation format

693 views
Skip to first unread message

Fangrui Song

unread,
Mar 9, 2024, 8:49:45 PMMar 9
to Generic System V Application Binary Interface
Hi, I'd like to propose a LEB128-based relocation format to replace REL and RELA formats.

The natural size guideline for control structures presents significant drawbacks for relocations.
Since relocations are typically processed sequentially, they don't gain the same random-access advantages.
The large 24-byte Elf64_Rela structure highlights the drawback.

I've written a detailed analysis of the size problem and my solution at https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf ,
and developed a LLVM/Clang/lld prototype at https://github.com/MaskRay/llvm-project/tree/demo-relleb (clang -mrelleb, llvm-readelf -r, ld.lld)

Below, you'll find an overview of the size issue, followed by excerpts from my blog post that detail the "RELLEB relocation format" and my "RELLEB proposal for the generic ABI."

---

In contrast, the WebAssembly object file format uses LEB128 encoding for relocations and other constrol structures, offering a significant size advantage over ELF.

In my Clang+lld prototype,
The total relocation section size has decreased from 28768176 to 5318839, 18.4% of the original size.
The .o size has decreased by 20.5%.

Applying zstd (SHF_COMPRESSED) would reduce SHT_RELLEB section size by 26.4%.
The overall decrease of only 1.088% in `.o` file sizes indicates that RELLEB offers a reasonable compromise.

---

## Use cases

* Dynamic relocations
* Marker relocations (R_RISCV_RELAX, etc)
* `.llvm_addrsig`
* `.llvm.call-graph-profile`
* `.debug_names`

Refer to https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#use-cases

---

## RELLEB relocation format

The 1990 ELF paper _ELF: An Object File to Mitigate Mischievous Misoneism_ says "ELF allows extension and redefinition for other control structures."
Inspired by WebAssembly, let's explore RELLEB, a new and more compact relocation format designed to replace RELA.
Our emphasis is on simplicity over absolute minimal encoding. See the end of the article for a detailed format description.

A `SHT_RELLEB` section (preferred name: `.relleb<name>`) holds compact relocation entries that decode to `Elf32_Rela` or `Elf64_Rela` depending on the object file class (32-bit or 64-bit).
Its content begins with a ULEB128-encoded relocation count, followed by entries encoding `r_offset`, `r_type`, `r_symidx`, and `r_addend`.

Here are key design choices:

*Relocation count (ULEB128)*:

This allows for efficient retrieval of the relocation count without decoding the entire section.
While a `uint32_t` (like [`SHT_HASH`](https://www.sco.com/developers/gabi/latest/ch5.dynamic.html#hash)) could be used, ULEB128 aligns with subsequent entries, removes endianness differences, and offers a slight size advantage in most cases when the number of symbols can be encoded in one to three bytes.

*Delta encoding for `r_offset` (ULEB128)*:

Section offsets can be large, and relocations are typically ordered. Storing the difference between consecutive offsets offers compression potential.
In most cases, a single byte will suffice.
While there are exceptions (general dynamic TLS model of s390/s390x uses a local "out-of-order" pair: `R_390_PLT32DBL(offset=o) R_390_TLS_GDCALL(offset=o-2)`), we are optimizing for the common case.
Switching to SLEB128 would increase the total `.o` size by 0.1%.

For ELFCLASS32, `r_offsets` members are calculated using modular arithmetic modulo 4294967296.

*Delta encoding for `r_type` (SLEB128)*:

Some psABIs utilize relocation types greater than 128.
AArch64's static relocation types begin at 257 and dynamic relocation types begin at 1024, necessitating two bytes with ULEB128/SLEB128 encoding in the absence of delta encoding.
Delta encoding allows all but the first relocation's type to be encoded in a single byte.
An alternative design is to define a base type in the header and encode types relative to the base type, which would introduce slight complexity.

If the AArch32 psABI could be redesigned, allocating `[0,64)` for Thumb relocation types and `[64,*)` for ARM relocation types would optimize delta encoding even further.

While sharing a single type code for multiple relocations would be efficient, it would require reordering relocations.
This conflicts with order requirements imposed by several psABIs and could complicate linker implementations.

*Symbol index and addend presence (SLEB128)*:

ULEB128 optimizes for the common case when the symbol index is encodable in one or two bytes.
Using SLEB128 and delta encoding instead of ULEB128 for the symbol index field would increase the total size by 0.4%.
Potential gains for dynamic relocations with consecutive indexes are outweighed by the loss in static relocations and added complexity, hence avoiding delta encoding.
We indicate addend omission using the sign bit (see below).

*Delta encoding for addend (SLEB128)*:

Consecutive static relocations often have identical addends, especially on architectures with frequent zero addends (AArch64, PowerPC, RISC-V, etc).
Addend omission optimizes this scenario.
Additionally, it benefits architectures like AArch32 and x86, which often have identical negative addends (call instructions) for consecutive relocations.

```c
// Many R_X86_64_PLT32(g-4)
void f() { g(); g(); g(); ... }
```

Dynamic relocations (except `R_*_RELATIVE` and `R_*_IRELATIVE`) typically have zero addends, also benefiting from our scheme.

The sign bit of the previous member flags whether the addend has changed relative to the prior entry.
When the addend changes, we use an addend delta. This offers a slight size advantage (about 0.20%) and optimizes for cases like:

```asm
.quad .data + 0x78
.quad .data + 0x80
.quad .data + 0x88
...
```

Note: when the bitwise NOT code path is taken, the zero delta addend is not utilized.

### RELLEB for dynamic relocations

Redacted. My blog post presents a comparison with Android packed relocations.


## RELLEB proposal for the generic ABI

In <https://www.sco.com/developers/gabi/latest/ch4.sheader.html>, make the following changes.

_In Figure 4-9: Section Types,sh_type, append a row_

`SHT_RELLEB` | 20

_Add text:_

SHT_RELLEB - The section holds compact relocation entries with explicit addends. An object file may have multiple relocation sections. ''Relocation'' below for details.

_In Figure 4-16: Special Sections, append a row_

`.rellebname` | `SHT_RELLEB` | see below

_Change the text below:_

.relname, .relaname, and .rellebname

These sections hold relocation information, as described in ''Relocation''. If the file has a loadable segment that includes relocation, the sections' attributes will include the SHF_ALLOC bit; otherwise, that bit will be off. Conventionally, name is supplied by the section to which the relocations apply. Thus a relocation section for .text normally would have the name .rel.text, .rela.text, or .relleb.text.

_In Figure 4-23: Relocation Entries, add:_

```c
typedef struct {
  Elf32_Addr r_offset;
  Elf32_Word r_type;
  Elf32_Word r_symidx;
  Elf32_Sxword r_addend;
} Elf32_Relleb;

typedef struct {
  Elf64_Addr r_offset;
  Elf64_Word r_type;
  Elf64_Word r_symidx;
  Elf64_Sxword r_addend;
} Elf64_Relleb;
```

_Add text above "A relocation section references two other sections":_

A `SHT_RELLEB` section holds compact relocation entries that decode to `Elf32_Relr` or `Elf64_Relr` depending on the object file class (32-bit or 64-bit).
Its content begins with a ULEB128-encoded relocation count, followed by entries encoding `r_offset`, `r_type`, `r_symidx`, and `r_addend`.
Note that the `r_info` member in traditional REL/RELA formats has been split into separate `r_type` and `r_symidx` members, allowing `uint32_t` relocation types for ELFCLASS32 as well.

* Delta offset: (ULEB128-encoded) The difference in `r_offset` relative to the previous entry. The difference is truncated to 32-bit/64-bit for ELFCLASS32/ELFCLASS64, respectively.
* Delta type: (SLEB128-encoded) The difference in relocation type relative to the previous entry.
* Symbol table index and addend presence: (SLEB128-encoded) If the `r_offset` is equal to the previous `r_addend`, the encoded value represents the symbol index; otherwise, the bitwise NOT of the encoded value indicates the symbol index.
* Delta addend: (SLEB128-encoded, only if indicated in the previous member) The difference in `r_addend` relative to the previous entry. The difference is truncated to 32-bit/64-bit for ELFCLASS32/ELFCLASS64, respectively.

Example C++ encoder for `Elf64_Relleb`:

```cpp
  uint64_t offset = 0, addend = 0;
  uint32_t type = 0;
  encodeULEB128(relocs.size(), os);
  for (const Reloc &rel : relocs) {
    encodeULEB128(rel.offset - offset, os);
    encodeSLEB128(static_cast<int32_t>(rel.type - type), os);
    type = rel.type;
    if (addend == rel.addend) {
      encodeSLEB128(rel.symidx, os);
    } else {
      encodeSLEB128(~symidx, os);
      encodeSLEB128(rel.addend - addend, os);
      addend = rel.addend;
    }
  }
```

For the first relocation entry, the previous offset, type, and addend members are treated as zero.

_In Figure 5-10: Dynamic Array Tags, d_tag, add:_

`DT_RELLEB` | 38 | `d_ptr` | optional | optional

_Add text below:_

* `DT_RELLEB` - This element is similar to `DT_RELA`, except its table uses the RELLEB format.

Florian Weimer

unread,
Mar 11, 2024, 4:42:23 AMMar 11
to 'Fangrui Song' via Generic System V Application Binary Interface
* via Generic System V. Application Binary Interface:

> Hi, I'd like to propose a LEB128-based relocation format to replace
> REL and RELA formats.

Do you intend this for run-time relocations as well?

The leb128 encoding is comparatively hard to decode on many CPUs because
they lack a bit gather (parallel bit extract) instruction. It would be
easier to decode if the unary length count is in the first byte
(although that limits the overall magnitude of values, which shouldn't
matter for this application).

Thanks,
Florian

Rui Ueyama

unread,
Mar 11, 2024, 9:08:59 AMMar 11
to Generic System V Application Binary Interface
The proposal overall seems reasonable to me, but I'm more interested in how we can transition from the existing relocation format to the new one. Historically, we have a lot of proposals that are good in theory which, however, have failed to be a viable option due to the lack of a migration strategy. I think we need to answer the following questions:

1. How do we make the older linkers print out user-friendly error messages for object files containing only .relleb relocation tables? I believe if we don't do anything, .relleb sections would just be ignored, and such object files would be treated as if they don't contain any relocations, resulting in broken output files.

2. How do we make the new relocation format the standard one for object files in the long term? If there's no migration strategy, people would conservatively stick with the existing format for the sake of compatibility, and the efforts to make the new relocation format usable would become less effective.

I don't know the answer for (1), but for (2), I have an idea to make it happen. Even today, psABIs are gradually evolving, and it is already required to use a newer linker to use new features such as TLSDESC or x86's GOT32X relocations. We can bundle .relleb with such new features -- i.e., when a feature that needs recent versions of the linker is requested, we can safely enable .relleb. For example, we can bundle .relleb with RISC-V's TLSDESC. Or maybe with new relocations for Intel's new instruction format, APX. This bundling strategy might be able to provide a gradual migration path.

Fangrui Song

unread,
Mar 11, 2024, 2:55:23 PMMar 11
to gener...@googlegroups.com
On 2024-03-11, Florian Weimer wrote:
>* via Generic System V. Application Binary Interface:
>
>> Hi, I'd like to propose a LEB128-based relocation format to replace
>> REL and RELA formats.
>
>Do you intend this for run-time relocations as well?
>
>The leb128 encoding is comparatively hard to decode on many CPUs because
>they lack a bit gather (parallel bit extract) instruction. It would be
>easier to decode if the unary length count is in the first byte
>(although that limits the overall magnitude of values, which shouldn't
>matter for this application).

I have added more description to https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#relleb-for-dynamic-relocations

RELLEB is not efficient for dynamic relocations, as it does not utilize two properties:

* There are much fewer relocation types.
* The offsets are often adjacent by Elf_Addr. No two dynamic relocations can share the same offset.

Nevertheless, RELLEB improves the non-relative portion, achieving a decent 16.3% reduction.
However, it's overshadowed by Android packed relocations (6.3%).

My blog post proposes a generalized RELR encoding that looks like:

```
// R_*_GLOB_DAT/absolute address relocation group
Encode(length of the group)
Encode(R_*_GLOB_DAT)
Use RELR to encode offsets
Encode symbol indexes separately

// R_*_JUMP_SLOT group
Encode(length of the group)
Encode(R_*_JUMP_SLOT)
Use RELR to encode offsets
Encode symbol indexes separately

...
```

However, relative relocations usually outnumber non-relative relocations.
Perhaps non-relative dynamic relocations are not worth compacting using an
optimal format?

On 2024-03-11, Rui Ueyama wrote:
>The proposal overall seems reasonable to me, but I'm more interested in how
>we can transition from the existing relocation format to the new one.
>Historically, we have a lot of proposals that are good in theory which,
>however, have failed to be a viable option due to the lack of a migration
>strategy. I think we need to answer the following questions:
>
>1. How do we make the older linkers print out user-friendly error messages
>for object files containing only .relleb relocation tables? I believe if we
>don't do anything, .relleb sections would just be ignored, and such object
>files would be treated as if they don't contain any relocations, resulting
>in broken output files.

Thanks for mentioning this.

GNU ld allows certain unknown section types:

* [SHT_LOUSER,SHT_HIUSER] and non-SHF_ALLOC
* [SHT_LOOS,SHT_HIOS] and non-SHF_OS_NONCONFORMING

but reports errors and stops linking for others (unless --no-warn-mismatch is
specified). When linking a relocatable file using SHT_RELLEB, you might
encounter errors like the following:

% clang -mrelleb -fuse-ld=bfd a.c b.c
/usr/bin/ld.bfd: unknown architecture of input file `/tmp/a-1e0778.o' is incompatible with i386:x86-64 output
/usr/bin/ld.bfd: unknown architecture of input file `/tmp/b-9963f0.o' is incompatible with i386:x86-64 output
/usr/bin/ld.bfd: error in /tmp/a-1e0778.o(.eh_frame); no .eh_frame_hdr table will be created
/usr/bin/ld.bfd: error in /tmp/b-9963f0.o(.eh_frame); no .eh_frame_hdr table will be created
clang: error: linker command failed with exit code 1 (use -v to see invocation)

gold, lld, and mold do not report errors. I have filed:

https://github.com/llvm/llvm-project/issues/84812
https://github.com/rui314/mold/issues/1215

In addition, when there is one .eh_frame section with CIE pieces but no
relocation, _bfd_elf_parse_eh_frame will report an error.

>2. How do we make the new relocation format the standard one for object
>files in the long term? If there's no migration strategy, people would
>conservatively stick with the existing format for the sake of
>compatibility, and the efforts to make the new relocation format usable
>would become less effective.
>
>I don't know the answer for (1), but for (2), I have an idea to make it
>happen. Even today, psABIs are gradually evolving, and it is already
>required to use a newer linker to use new features such as TLSDESC or x86's
>GOT32X relocations. We can bundle .relleb with such new features -- i.e.,
>when a feature that needs recent versions of the linker is requested, we
>can safely enable .relleb. For example, we can bundle .relleb with RISC-V's
>TLSDESC. Or maybe with new relocations for Intel's new instruction format,
>APX. This bundling strategy might be able to provide a gradual migration
>path.

For both existing and new architectures, we toolchain maintainers can encourage
them to implement the new relocation format. We try to implement the new
relocation format for important components in the ecosystem so that new
architectures are motivated to start with RELLEB.

We probably would never force any existing architectures to drop REL/RELA.
REL/RELA will never go away anyway.
>--
>You received this message because you are subscribed to a topic in the Google Groups "Generic System V Application Binary Interface" group.
>To unsubscribe from this topic, visit https://groups.google.com/d/topic/generic-abi/yb0rjw56ORw/unsubscribe.
>To unsubscribe from this group and all its topics, send an email to generic-abi...@googlegroups.com.
>To view this discussion on the web visit https://groups.google.com/d/msgid/generic-abi/6995749a-1c3b-47fc-8df5-1716aec0224bn%40googlegroups.com.

727785714

unread,
Mar 11, 2024, 3:08:34 PMMar 11
to Rui Ueyama
您好,您的来件我已收到,我会尽快处理!
     谢谢!
 

727785714

unread,
Mar 11, 2024, 3:10:26 PMMar 11
to Fangrui Song' via Generic System V Application Binary Interface <>
您好,您的来件我已收到,我会尽快处理!
     谢谢!
 

Rui Ueyama

unread,
Mar 11, 2024, 11:28:18 PMMar 11
to Generic System V Application Binary Interface
I didn't suggest we would drop REL/RELA. What I wanted to say is that we need a strategy to enable gradual adoption of the new feature, instead of forcing toolchain maintainers to make a hard decision to do so.

If we do not have a strategy, chances are the new feature would be adopted only within a big organization/company where a developer team has complete control over their toolchains, and the world outside of it would never adopt it because of the fear of breaking backward compatibility.

So, my suggestion is to mandating linker developers support .relleb before supporting future generic/psABI extensions, and encourage toolchain developers to enable .relleb by default if other future generic/psABI extensions are explicitly enabled by the user. This way, we can ensure gradual adoption of .relleb even in an environment where no one controls the toolchain choices.

727785714

unread,
Mar 11, 2024, 11:28:33 PMMar 11
to Rui Ueyama
您好,您的来件我已收到,我会尽快处理!
     谢谢!
 

Fangrui Song

unread,
Mar 14, 2024, 4:50:24 PMMar 14
to gener...@googlegroups.com
On Mon, Mar 11, 2024 at 8:28 PM Rui Ueyama <rui...@gmail.com> wrote:
>
> I didn't suggest we would drop REL/RELA. What I wanted to say is that we need a strategy to enable gradual adoption of the new feature, instead of forcing toolchain maintainers to make a hard decision to do so.
>
> If we do not have a strategy, chances are the new feature would be adopted only within a big organization/company where a developer team has complete control over their toolchains, and the world outside of it would never adopt it because of the fear of breaking backward compatibility.
>
> So, my suggestion is to mandating linker developers support .relleb before supporting future generic/psABI extensions, and encourage toolchain developers to enable .relleb by default if other future generic/psABI extensions are explicitly enabled by the user. This way, we can ensure gradual adoption of .relleb even in an environment where no one controls the toolchain choices.

When a psABI adds certain drastic changes, we can strongly encourage
someone to implement the new relocation format first:)
For instance, https://github.com/riscv-non-isa/riscv-elf-psabi-doc/pull/423
proposes vendor-specific relocations
and the Elf32_Rel/Elf32_Rela relocation type constraint has become
more noticeable.

-----

The sign bit of symidx indicated addend omission.
I have changed the design and
https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf
to omit both type and addend.

* Delta offset: (ULEB128-encoded) The difference in r_offset relative
to the previous entry, represented as a 32-bit or 64-bit unsigned
integer for ELFCLASS32/ELFCLASS64, respectively.
* Symbol table index and type/addend presence: (SLEB128-encoded) If
the current type and addend are equal to the previous type and addend,
the encoded value represents the symbol index and the following two
members are omitted; otherwise, the bitwise NOT of the encoded value
(a 64-bit signed integer) indicates the symbol index and the following
two members are present. For example, the bitwise NOT of symbol index
0xffffffff is -0x100000000 (64-bit) instead of 0 (32-bit).
* Delta type: (SLEB128-encoded, only if not omitted) The difference in
relocation type relative to the previous entry, represented as a
32-bit signed integer.
* Delta addend: (SLEB128-encoded, only if not omitted) The difference
in r_addend relative to the previous entry, represented as a 32-bit or
64-bit signed integer for ELFCLASS32/ELFCLASS64, respectively.

Example C++ encoder:

```
// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

Elf_Addr offset = 0, addend = 0;
uint32_t type = 0;
encodeULEB128(relocs.size(), os);
for (const Reloc &rel : relocs) {
encodeULEB128(rel.offset - offset, os);
int64_t symidx = rel.symidx;
if (type == rel.type && addend == rel.addend) {
encodeSLEB128(symidx, os);
} else {
encodeSLEB128(~symidx, os);
encodeSLEB128(static_cast<int32_t>(rel.type - type), os);
type = rel.type;
encodeSLEB128(std::make_signed_t<Elf_Addr>(rel.addend - addend), os);
addend = rel.addend;
}
}
```

Now, the new encoding works for static relocations and non-relative
dynamic relocations better.
(RISC-V static relocations might be less optimized now, because its
adjacent relocations often changes types, rendering type/addend
omission to happen less.)

RELLEB compresses the non-relative dynamic relocations down to 9.7%
(0x00239d / 0x016e90).
Android packed relocations compresses the portion down to 6.3%.
Android's advantage is due to RELOCATION_GROUPED_BY_INFO_FLAG sharing
r_info.

% llvm-readelf -S clang | grep ' \.rel.*\.'
[ 8] .rela.dyn RELA 0000000000f28998 f28998
e177d0 18 A 3 0 8
[ 9] .rela.plt RELA 0000000001d40168 1d40168
0025b0 18 AI 3 26 8
% llvm-readelf -S clang.relr | grep ' \.rel.*\.'
[ 8] .rela.dyn RELA 0000000000f28998 f28998
016e90 18 A 3 0 8
[ 9] .relr.dyn RELR 0000000000f3f828 f3f828
02ccd0 08 A 0 0 8
[10] .rela.plt RELA 0000000000f6c4f8 f6c4f8
0025b0 18 AI 3 27 8
% llvm-readelf -S clang.relr+android | grep ' \.rel.*\.'
[ 8] .rela.dyn ANDROID_RELA 0000000000f28998 f28998
001707 01 A 3 0 8
[ 9] .relr.dyn RELR 0000000000f2a0a0 f2a0a0
02ccd0 08 A 0 0 8
[10] .rela.plt RELA 0000000000f56d70 f56d70
0025b0 18 AI 3 27 8
% llvm-readelf -S clang.relr+rellebd | grep ' \.rel.*\.'
[ 8] .relleb.dyn RELLEB 0000000000f28998 f28998
00239d 00 A 3 0 8
[ 9] .relr.dyn RELR 0000000000f2ad38 f2ad38
02ccd0 08 A 0 0 8
[10] .rela.plt RELA 0000000000f57a08 f57a08
0025b0 18 AI 3 27 8
> To view this discussion on the web visit https://groups.google.com/d/msgid/generic-abi/6637a2de-94c7-413d-a71b-d1ee5031331fn%40googlegroups.com.



--
宋方睿

727785714

unread,
Mar 14, 2024, 4:50:43 PMMar 14
to Fangrui Song' via Generic System V Application Binary Interface <>
您好,您的来件我已收到,我会尽快处理!
     谢谢!
 

Fangrui Song

unread,
Mar 22, 2024, 12:29:12 AMMar 22
to gener...@googlegroups.com
I have made another revision (combine delta offset and r_info omission) to utilize this pattern in relocations:

* Symbol index/type equal to the previous entry's while addend changes, common for STT_SECTION symbols, e.g. .rodata, .eh_frame, .debug_str_offsets, .debug_names, .debug_line, and .debug_addr.

In a -O1 -g build, .relleb*/.rela* size ratio drops from 21.1% to 11.2%!
In a -O3 -g -gpubnames -gsplit-dwarf build, .relleb*/.rela* size ratio drops from 23.1% to 14.8%.

build | format | `.o size` | `size(.rel*)` | `size(.rel*)/.o size`
---------------------------------------+-------------------+-------------+---------------+----------------------
x86_64 -O3 | RELA | 135996752 | 28234416 | 20.8%
x86_64 -O3 | compressed RELA | 116424688 | 8446950 | 7.3%
x86_64 -O3 | RELLEB | 112011104 | 4065883 | 3.6%
x86_64 -O3 | compressed RELLEB | 111383192 | 3437953 | 3.1%
aarch64 -O3 | RELA | 124965808 | 25855272 | 20.7%
aarch64 -O3 | RELLEB | 103815168 | 4528236 | 4.4%
riscv64 -O3 | RELA | 227133296 | 91378680 | 40.2%
riscv64 -O3 | RELLEB | 153458488 | 17516119 | 11.4%
x86_64 -O0 -g | RELA | 2174149216 | 467462904 | 21.5%
x86_64 -O0 -g | RELLEB | 1768587248 | 59677108 | 3.4%
x86_64 -O1 -g | RELA | 1506173736 | 340965576 | 22.6%
x86_64 -O1 -g | RELLEB | 1203502544 | 38114639 | 3.2%
x86_64 -O3 -g -gpubnames -gsplit-dwarf | RELA | 548986280 | 104226264 | 19.0%
x86_64 -O3 -g -gpubnames -gsplit-dwarf | RELLEB | 460408336 | 15442990 | 3.4%

We need to draw a line before this format becomes a full-blown data compression algorithm.
This technique adds very few lines to the encoder/decoder.

I have attached C++ encoder/decoder in the proposal.

---

In <https://www.sco.com/developers/gabi/latest/ch4.sheader.html>, make the following changes.

_In Figure 4-9: Section Types,sh_type, append a row_

`SHT_RELLEB` | 20

_Add text:_

SHT_RELLEB - The section holds compact relocation entries with explicit addends. An object file may have multiple relocation sections. See ''Relocation'' below for details.

_In Figure 4-16: Special Sections, append_

`.rellebname` | `SHT_RELLEB` | see below

_Change the text below:_

.relname, .relaname, and .rellebname

These sections hold relocation information, as described in ''Relocation''. If the file has a loadable segment that includes relocation, the sections' attributes will include the SHF_ALLOC bit; otherwise, that bit will be off. Conventionally, name is supplied by the section to which the relocations apply. Thus a relocation section for .text normally would have the name .rel.text, .rela.text, or .relleb.text.

_In Figure 4-23: Relocation Entries, add:_

```c
typedef struct {
Elf32_Addr r_offset;
Elf32_Word r_symidx;
Elf32_Word r_type;
Elf32_Sxword r_addend;
} Elf32_Relleb;

typedef struct {
Elf64_Addr r_offset;
Elf64_Word r_symidx;
Elf64_Word r_type;
Elf64_Sxword r_addend;
} Elf64_Relleb;
```

_Add text above "A relocation section references two other sections":_

A `SHT_RELLEB` section holds compact relocation entries that decode to `Elf32_Relr` or `Elf64_Relr` depending on the object file class (32-bit or 64-bit).
Its content begins with a ULEB128-encoded relocation count, followed by entries encoding `r_offset`, `r_symidx`, `r_type`, and `r_addend`.
Note that the `r_info` member in traditional REL/RELA formats has been split into separate `r_symidx` and `r_type` members, allowing `uint32_t` relocation types for ELFCLASS32 as well.

In the following description, `Elf_Addr`/`Elf_SAddr` denote `uint32_t`/`int32_t` for ELFCLASS32 or `uint64_t`/`int64_t` for ELFCLASS64.

* First member (ULEB128): Holds `2 * delta_offset + eq` (33-bit or 65-bit unsigned), where:
+ `delta_offset`: Difference in `r_offset` from the previous entry (`Elf_Addr`).
+ `eq`: Indicates if the symbol index/type match the previous entry (1 for match, 0 otherwise).
* Second Member (SLEB128) if `eq` is 1:
+ Difference in `r_addend` from the previous entry (`Elf_SAddr`).
* Second Member (SLEB128) if `eq` is 0:
+ If type and addend match the previous entry, the encoded value is the symbol index; type and addend are omitted.
+ Otherwise, the bitwise NOT of the encoded value (33-bit signed) is the symbol index; delta type and delta addend follow:
- Delta type (SLEB128): The difference in relocation type from the previous entry (32-bit signed).
- Delta addend (SLEB128): The difference in `r_addend` relative to the previous entry (signed `Elf_Addr`).

The bitwise NOT of symbol index 0xffffffff is -0x100000000 (33-bit) instead of 0 (32-bit).

Encoder in pseudo-code:
```c
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
encodeULEB128(relocs.size());
for (const Reloc &rel : relocs) {
if (symidx == rel.r_symidx && type == rel.r_type) {
// Symbol index/type match the previous entry. Encode the addend.
encodeULEB128(2 * uint128_t(rel.r_offset - offset) + 1); // at most 65-bit
encodeSLEB128(rel.r_addend - addend);
} else {
encodeULEB128(2 * uint128_t(rel.r_offset - offset)); // at most 65-bit
if (type == rel.r_type && addend == rel.r_addend) {
// Type/addend match the previous entry. Encode the symbol index.
encodeSLEB128(rel.r_symidx);
} else {
// No optimization is applied. Encode symbol index, type, and addend.
encodeSLEB128(~static_cast<int64_t>(symidx));
encodeSLEB128(static_cast<int32_t>(rel.type - type));
type = rel.r_type;
encodeSLEB128(static_cast<Elf_SAddr>(rel.r_addend - addend));
addend = rel.r_addend;
}
symidx = rel.r_symidx;
}
}
```

Encoding/decoding a unsigned 65-bit does not need multi-precision arithmetic. We can just unroll and special case the first iteration.
Example C++ encoder:

```cpp
// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
encodeULEB128(relocs.size(), os);
for (const Reloc &rel : relocs) {
auto deltaOffset = static_cast<uint64_t>(rel.r_offset - offset);
offset = rel.r_offset;
uint8_t odd = outSymidx == symidx && outType == type, b = deltaOffset * 2 + odd;
if (deltaOffset < 0x40) {
os << char(b);
} else {
os << char(b | 0x80);
encodeULEB128(deltaOffset >> 6, os);
}
symidx = rel.symidx;
if (!odd && type == rel.type && addend == rel.addend) {
encodeSLEB128(symidx, os);
} else {
if (!odd) {
encodeSLEB128(~static_cast<int64_t>(symidx), os);
encodeSLEB128(static_cast<int32_t>(rel.type - type), os);
type = rel.type;
}
encodeSLEB128(std::make_signed_t<uint>(rel.addend - addend), os);
addend = rel.addend;
}
}
```

Example C++ decoder:

```cpp
// uint64_t decodeULEB128(uint8_t *&p);
// int64_t decodeSLEB128(uint8_t *&p);

size_t count = decodeULEB128(p);
Elf_Addr offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (size_t i = 0; i != count; ++i) {
const uint8_t b = *p++;
offset += b >> 1;
if (b >= 0x80)
offset += (decodeULEB128(p) << 6) - 0x40;
int64_t x = decodeSLEB128(p);
if (b & 1) {
addend += x;
} else {
if (x < 0) {
x = ~x;
type += decodeSLEB128(p);
addend += decodeSLEB128(p);
}
symidx = x;
}
rels[i] = {offset, symidx, type, addend};
}
```

Note: 0x40 compensates for not masking out the bit 0x80.

For the first relocation entry, the previous offset, type, and addend members are treated as zero.

_In Figure 5-10: Dynamic Array Tags, d_tag, add:_

`DT_RELLEB` | 38 | `d_ptr` | optional | optional

_Add text below:_

* `DT_RELLEB` - This element is similar to `DT_RELA`, except its table uses the RELLEB format. The relocation count can be inferred from the header.

Roland McGrath

unread,
Mar 22, 2024, 2:38:04 AMMar 22
to gener...@googlegroups.com
STT_SECTION symbols are always STB_LOCAL.  So why wouldn't the static linker just always convert a reloc to any STB_LOCAL symbol into a R_*_RELATIVE reloc (and thus collapse it into DT_RELR format)?

I don't see anything wrong with having the "header compression" style of reloc compression include tricks that are good for runs of relocs using the same symbol with different addends if that is indeed a pattern that comes up not infrequently.  But I'm curious to understand why in practice it should come up at all, or more than quite rarely.  Off hand I think TLS dynamic relocs might be the only ones that couldn't be collapsed to RELATIVE (or IRELATIVE) when the symbol is STB_LOCAL.

Or are you expecting to see this pattern of runs of relocs only in ET_REL files?

Incidentally, I think with the various "header compression" style tricks you're adding here this is getting a bit far afield from just "LEB128 relocs".  I think there's good reason believe these kinds of tricks will pay off very well in getting much more compact reloc sections that aren't especially hard to decode in practice.  But really using LEB128 as an aspect of the format is now a somewhat small part of the whole scheme in terms of complexity and likely size impact.  So I think we should be calling this something more like "compressed relocs" or "compact relocs".  Perhaps {SHT,DT}_RELC and *_RELAC?  Or *_CREL and *_CRELA?

Especially for dynamic relocations, using in-place addends rather than in-reloc addends is a clear winner for space savings.  Some machines use REL for ET_REL files anyway, and so they could use CREL instead of CRELA so not even one bit in in the reloc section is spent on addends.  AFAIK every "RELA machine" can do RELA->REL conversion for dynamic relocs (when not doing TEXTRELs) if the dynamic linker supports it, and certainly any new dynamic linker supporting DT_CREL should support DT_REL as well as DT_RELA too.

Also I think it might make sense to move beyond LEB128 as a basis here.  With all the tricks going on here now, you're not really encoding separate fields in LEB128.  You're encoding a bunch of things, as well as encoding some variable-sized values.  But they're all an interdependent unit.  So the fundamental feature of LEB128 that it's trivially self-describing at every step and integer values that fit in 7 bits or fewer take only one byte total, is not necessarily so valuable.  Perhaps having some bespoke header bits at the beginning of the whole entry that give the bit or byte widths and other flags about all the fields that follow could be encoded in a way where the total entry fits in fewer bits than something that's always using LEB128 as the base encoding, but still be simple enough to decode quickly.

--
You received this message because you are subscribed to the Google Groups "Generic System V Application Binary Interface" group.
To unsubscribe from this group and stop receiving emails from it, send an email to generic-abi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/generic-abi/20240322042904.ck7sn5ndig4ep5kh%40google.com.

Fangrui Song

unread,
Mar 22, 2024, 7:50:41 PMMar 22
to 'Roland McGrath' via Generic System V Application Binary Interface
On 2024-03-21, 'Roland McGrath' via Generic System V Application Binary Interface wrote:
>STT_SECTION symbols are always STB_LOCAL. So why wouldn't the static
>linker just always convert a reloc to any STB_LOCAL symbol into a
>R_*_RELATIVE reloc (and thus collapse it into DT_RELR format)?
>
>I don't see anything wrong with having the "header compression" style of
>reloc compression include tricks that are good for runs of relocs using the
>same symbol with different addends if that is indeed a pattern that comes
>up not infrequently. But I'm curious to understand why in practice it
>should come up at all, or more than quite rarely. Off hand I think TLS
>dynamic relocs might be the only ones that couldn't be collapsed to
>RELATIVE (or IRELATIVE) when the symbol is STB_LOCAL.
>
>Or are you expecting to see this pattern of runs of relocs only in ET_REL
>files?
>
>Incidentally, I think with the various "header compression" style tricks
>you're adding here this is getting a bit far afield from just "LEB128
>relocs". I think there's good reason believe these kinds of tricks will
>pay off very well in getting much more compact reloc sections that aren't
>especially hard to decode in practice. But really using LEB128 as an
>aspect of the format is now a somewhat small part of the whole scheme in
>terms of complexity and likely size impact. So I think we should be
>calling this something more like "compressed relocs" or "compact relocs".
>Perhaps {SHT,DT}_RELC and *_RELAC? Or *_CREL and *_CRELA?

Thanks for the feedback! It makes a good point. As I refine the original
pure LEB-based format, I agree that "RELLEB" might not be the most
fitting name.

I like SHT_CREL/DT_CREL/.crel and I've updated
https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf
and my LLVM prototype
https://github.com/MaskRay/llvm-project/tree/demo-crel

(By the way, I intentionally left out "RELLEB" in the blog post title to
keep the naming flexible.)

I want to avoid REL/RELA variants for the new format. Read on. I think
I've found a good format that makes addend encoding not wasteful.

>Especially for dynamic relocations, using in-place addends rather than
>in-reloc addends is a clear winner for space savings. Some machines use
>REL for ET_REL files anyway, and so they could use CREL instead of CRELA so
>not even one bit in in the reloc section is spent on addends. AFAIK every
>"RELA machine" can do RELA->REL conversion for dynamic relocs (when not
>doing TEXTRELs) if the dynamic linker supports it, and certainly any new
>dynamic linker supporting DT_CREL should support DT_REL as well as DT_RELA
>too.
>
>Also I think it might make sense to move beyond LEB128 as a basis here.
>With all the tricks going on here now, you're not really encoding separate
>fields in LEB128. You're encoding a bunch of things, as well as encoding
>some variable-sized values. But they're all an interdependent unit. So
>the fundamental feature of LEB128 that it's trivially self-describing at
>every step and integer values that fit in 7 bits or fewer take only one
>byte total, is not necessarily so valuable. Perhaps having some bespoke
>header bits at the beginning of the whole entry that give the bit or byte
>widths and other flags about all the fields that follow could be encoded in
>a way where the total entry fits in fewer bits than something that's always
>using LEB128 as the base encoding, but still be simple enough to decode
>quickly.

I'm hesitant to move to a non-byte-oriented varint scheme due to the
potential for increased complexity. My experiments indicate that we can
achieve space savings by using a shifted offset technique. This is
possible because:

64-bit data sections frequently have absolute relocations spaced 8 bytes
apart. Additionally, in RISC architectures, offsets are often multiples
of 2 or 4.

Here is another revision and I have renamed the new format to CREL.
The format is simpler with better size decrease.


// encodeULEB128(uint64_t, raw_ostream &os);
// encodeSLEB128(int64_t, raw_ostream &os);

Elf_Addr offsetMask = 8, offset = 0, addend = 0;
uint32_t symidx = 0, type = 0;
for (const Reloc &rel : relocs)
offsetMask |= crels[i].r_offset;
int shift = std::countr_zero(offsetMask)
encodeULEB128(relocs.size() * 4 + shift, os);
for (const Reloc &rel : relocs) {
Elf_Addr deltaOffset = (rel.r_offset - offset) >> shift;
uint8_t b = deltaOffset * 8 + (symidx != rel.r_symidx) +
(type != rel.r_type ? 2 : 0) + (addend != rel.r_addend ? 4 : 0);
if (deltaOffset < 0x10) {
os << char(b);
} else {
os << char(b | 0x80);
encodeULEB128(deltaOffset >> 4, os);
}
if (b & 1) {
encodeSLEB128(static_cast<int32_t>(rel.r_symidx - symidx), os);
symidx = rel.r_symidx;
}
if (b & 2) {
encodeSLEB128(static_cast<int32_t>(rel.r_type - type), os);
type = rel.r_type;
}
if (b & 4) {
encodeSLEB128(std::make_signed_t<uint>(rel.r_addend - addend), os);
addend = rel.r_addend;
}
}


build | format | `.o size` | `size(.rel*)` | .o size decrease
--------------------------------+-----------------+-------------+---------------+-----------------
-O3 | RELA | 136012504 | 28235448 |
-O3 | CREL | 111583312 | 3806234 | 18.0%
aarch64 -O3 | RELA | 124965808 | 25855800 |
aarch64 -O3 | CREL | 102529784 | 3388307 | 18.0%
riscv64 -O3 | RELA | 227189744 | 91396344 |
riscv64 -O3 | CREL | 149343352 | 13549699 | 34.3%
-O1 -g | RELA | 1506173760 | 340965576 |
-O1 -g | CREL | 1202445768 | 37237274 | 20.2%
-O3 -g -gpubnames -gsplit-dwarf | RELA | 549003848 | 104227128 |
-O3 -g -gpubnames -gsplit-dwarf | CREL | 459768736 | 14992114 | 16.3%

Fangrui Song

unread,
Mar 22, 2024, 7:54:50 PMMar 22
to Generic System V Application Binary Interface
In <https://www.sco.com/developers/gabi/latest/ch4.sheader.html>, make the following changes.

_In Figure 4-9: Section Types,sh_type, append a row_

`SHT_CREL` | 20

_Add text:_

SHT_CREL - The section holds compact relocation entries with explicit addends. An object file may have multiple relocation sections. See ''Relocation'' below for details.


_In Figure 4-16: Special Sections, append_

`.crelname` | `SHT_CREL` | see below


_Change the text below:_

.relname, .relaname, and .crelname

These sections hold relocation information, as described in ''Relocation''. If the file has a loadable segment that includes relocation, the sections' attributes will include the SHF_ALLOC bit; otherwise, that bit will be off. Conventionally, name is supplied by the section to which the relocations apply. Thus a relocation section for .text normally would have the name .rel.text, .rela.text, or .crel.text.


_In Figure 4-23: Relocation Entries, add:_

```c
typedef struct {
  Elf32_Addr r_offset;
  Elf32_Word r_symidx;
  Elf32_Word r_type;
  Elf32_Sxword r_addend;
} Elf32_Crel;


typedef struct {
  Elf64_Addr r_offset;
  Elf64_Word r_symidx;
  Elf64_Word r_type;
  Elf64_Sxword r_addend;
} Elf64_Crel;

```

_Add text above "A relocation section references two other sections":_

A `SHT_CREL` section holds compact relocation entries that decode to `Elf32_Crel` or `Elf64_Crel` depending on the object file class (32-bit or 64-bit).
Its content begins with a ULEB128-encoded value `count * 4 + shift` (34-bit or 66-bit unsigned), where:

* `count`: Relocation count
* `shift`: The shift value used by `delta_offset` in relocation entries

Relocation entries (`r_offset`, `r_symidx`, `r_type`, and `r_addend`) follow the header.
Note: `r_info` in traditional REL/RELA formats has been split into `r_symidx` and `r_type`, allowing `uint32_t` relocation types for ELFCLASS32 as well.


In the following description, `Elf_Addr`/`Elf_SAddr` denote `uint32_t`/`int32_t` for ELFCLASS32 or `uint64_t`/`int64_t` for ELFCLASS64.

* Delta offset and flags (ULEB128): Holds `delta_offset * 8 + flags` (35-bit or 67-bit unsigned), where:
  + `delta_offset`: Difference in `r_offset` from the previous entry (`Elf_Addr`), right shifted by `shift`.
  + `flags & 1`: Indicate if delta symbol index is present.
  + `flags & 2`: Indicate if delta type is present.
  + `flags & 4`: Indicate if delta addend is present.
* Delta symbol index (SLEB128, if present): The difference in symbol index from the previous entry (32-bit signed).
* Delta type (SLEB128, if present): The difference in relocation type from the previous entry (32-bit signed).
* Delta addend (SLEB128, if present): The difference in addend from the previous entry (`Elf_SAddr`).

Encoding/decoding delta offset and flags does not need multi-precision arithmetic. We can just unroll and special case the first iteration.
The header can be encoded/decoded in a similar way. An implementation can assume that the relocation count cannot be larger than 2**61 and simplify the code.

Example C++ encoder:

```cpp
```

Example C++ decoder:

```cpp
// uint64_t decodeULEB128(uint8_t *&p);
// int64_t decodeSLEB128(uint8_t *&p);

const auto countShift = decodeULEB128(p);
const size_t count = countShift / 4, shift = countShift % 4;
Elf_Addr offset = 0, addend = 0;

uint32_t symidx = 0, type = 0;
for (size_t i = 0; i != count; ++i) {
  const uint8_t b = *p++;
  offset += b >> 3;
  if (b >= 0x80)
    offset += (decodeULEB128(p) << 4) - 0x10;
  if (b & 1)
    symidx += decodeSLEB128(p);
  if (b & 2)
    type += decodeSLEB128(p);
  if (b & 4)
    addend += decodeSLEB128(p);
  rels[i] = {offset << shift, symidx, type, addend};
}
```

Note: In the `(b < 0x80)` code path, 0x10 (0x80 >> 3) is to compensate for not masking out the bit 0x80.


For the first relocation entry, the previous offset, type, and addend members are treated as zero.

_In Figure 5-10: Dynamic Array Tags, d_tag, add:_

`DT_CREL` | 38 | `d_ptr` | optional | optional

_Add text below:_

* `DT_CREL` - This element is similar to `DT_RELA`, except its table uses the CREL format. The relocation count can be inferred from the header.

Florian Weimer

unread,
Mar 25, 2024, 4:06:14 AMMar 25
to 'Fangrui Song' via Generic System V Application Binary Interface
* via Generic System V. Application Binary Interface:

> Its content begins with a ULEB128-encoded value `count * 4 + shift`
> (34-bit or 66-bit unsigned), where:

I want to mention again that ULEB128 is really terrible to decode on
many CPUs.

It would probably be best if we could avoid any form of unary encoding
(so that decoders don't need a CLZ-like instructions).

> * Delta offset and flags (ULEB128): Holds `delta_offset * 8 + flags` (35-bit or 67-bit unsigned), where:
> + `delta_offset`: Difference in `r_offset` from the previous entry (`Elf_Addr`), right shifted by `shift
> `.
> + `flags & 1`: Indicate if delta symbol index is present.
> + `flags & 2`: Indicate if delta type is present.
> + `flags & 4`: Indicate if delta addend is present.
> * Delta symbol index (SLEB128, if present): The difference in symbol index from the previous entry
> (32-bit signed).
> * Delta type (SLEB128, if present): The difference in relocation type from the previous entry (32-bit
> signed).

32-bit signed does not permit encoding sequences like 0x0, 0xffff_ffff
or 0xffff_ffff, 0x0. Is this your intent?

I think this also applies to the initial 35-bit or 67-bit estimate
because the delta_offset is affected by this as well.

The delta-based encoding of relocation types works on x86-64, but it's
really not ideal on AArch64 because there, R_AARCH64_ABS64 (257) and
AARCH64_GLOB_DAT (1025) are common, and their difference is greater than
127. Maybe this could be fixed by introducing an alternative value for
R_AARCH64_ABS64 that is greater than 1024.

On the other hand, this encoding needs two bits for encoding the
*length* of the type delta, when in practice, barely more than one bit
is needed to encode the entire type field. That seems quite wasteful.

Thanks,
Florian

Fangrui Song

unread,
Mar 25, 2024, 2:02:39 PMMar 25
to Generic System V Application Binary Interface
On Monday, March 25, 2024 at 1:06:14 AM UTC-7 Florian Weimer wrote:
* via Generic System V. Application Binary Interface:

> Its content begins with a ULEB128-encoded value `count * 4 + shift`
> (34-bit or 66-bit unsigned), where:

I want to mention again that ULEB128 is really terrible to decode on
many CPUs.

It would probably be best if we could avoid any form of unary encoding
(so that decoders don't need a CLZ-like instructions).

 
I appreciate your interest in the proposal, and I've carefully considered the remarks regarding LEB128 performance, including https://news.ycombinator.com/item?id=11263378 and MLIR's use of PrefixVarint (https://mlir.llvm.org/docs/TargetLLVMIR/).
These concerns focus on scenarios where longer LEB128 encodings dominate. In our case, where one-byte encodings dominant, LEB128 is actually the best choice.

I implemented a quick analysis of ld.lld -c rel to measure the frequency of different encoding sizes for offset/flags.

    +static int z1,z2,z3,z4,z5,z6;
    +[[gnu::destructor]] static void foo() {
    +  printf("1: %d\n2: %d\n3: %d\n4: %d\n5: %d\n6: %d\n", z1,z2,z3,z4,z5,z6);
    +}
     template <class uint> bool CrelSection<uint>::updateAllocSize() {
       const size_t count = relocs.size();
       uint offsetMask = 8;
    @@ -2024,6 +2029,12 @@ template <class uint> bool CrelSection<uint>::updateAllocSize() {
         uint8_t b = (deltaOffset << flagBits) + (symidx != rel.r_symidx) +

                     (type != rel.r_type ? 2 : 0) +
                     (addend != uint(rel.r_addend && config->isRela) ? 4 : 0);
    +    if (deltaOffset<<flagBits < 1<<7) z1++;
    +    else if (deltaOffset<<flagBits < 1<<14) z2++;
    +    else if (deltaOffset<<flagBits < 1<<21) z3++;
    +    else if (deltaOffset<<flagBits < 1<<28) z4++;
    +    else if (deltaOffset<<flagBits < 1<<35) z5++;
    +    else if (deltaOffset<<flagBits < 1<<42) z6++;

    % LLD_IN_TEST=1 fld.lld @response.txt -z pack-relative-relocs -z crel -z now -o clang.relr+crel
    1: 4690
    2: 1708
    3: 76
    4: 6
    5: 0
    6: 0
    % LLD_IN_TEST=1 fld.lld @response.txt -z crel -z now -o clang.relr+crel -o clang.crel
    1: 1067336
    2: 1820
    3: 78
    4: 8
    5: 0
    6: 0

The results confirm that one-byte encoding is indeed the dominant case, and the number of 3+ bytes is negligible.

I think CPU will predict a straightforward implementation as "not continue". We can even specifically handle the one-byte case for guarantee.
https://gist.github.com/binji/e70270f1e385ddb91c0a highlights that LEB128 is indeed superior.

Furthermore, LEB128 naturally extends to SLEB128 (sign extension) and 67-bit integers, while a PrefixVarint-style encoding would be more involved.

I've incorporated some of the findings into a revised version of my blog post:
https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#leb128-among-variable-length-integer-encodings

    While we could utilize zigzag encoding (i>>31) ^ (i<<1) to convert SLEB128-encoded type/addend to use ULEB128 instead, the generate code is inferior to or on par with SLEB128 for one-byte encodings on x86, AArch64, and RISC-V.
   
    // One-byte case for SLEB128
    int64_t from_signext(uint64_t v) {
      return v < 64 ? v - 128 : v;
    }
   
    // One-byte case for ULEB128 with zig-zag encoding
    int64_t from_zigzag(uint64_t z) {
      return (z >> 1) ^ -(z & 1);
    }
    While some variale-length integer schemes allocate more integers within the one-byte bucket, I do not believe they would lead to noticeable improvement over LEB128. For example, if I modify the prototype to increase the shift value by 1, .crel.dyn would decrease by 2.5%.
   
    Here is an extremely simple C decoder implementation for ULEB128 and SLEB128. The clever use of 64/128 is from Stefan O'Rear. The return type uint64_t can be changed to size_t when used in a dynamic loader.
   
    static uint64_t read_leb128(unsigned char **buf, uint64_t sleb_uleb) {
      uint64_t tmp = 0, shift = 0, byte;
      do {
        byte = *(*buf)++;
        tmp |= (byte - 128*(byte >= sleb_uleb)) << shift;
        shift += 7;
      } while (byte >= 128);
      return tmp;
    }
   
    uint64_t read_uleb128(unsigned char **buf) { return read_leb128(buf, 128); }
    int64_t read_sleb128(unsigned char **buf) { return read_leb128(buf, 64); }

 

 
> * Delta offset and flags (ULEB128): Holds `delta_offset * 8 + flags` (35-bit or 67-bit unsigned), where:
> + `delta_offset`: Difference in `r_offset` from the previous entry (`Elf_Addr`), right shifted by `shift
> `.
> + `flags & 1`: Indicate if delta symbol index is present.
> + `flags & 2`: Indicate if delta type is present.
> + `flags & 4`: Indicate if delta addend is present.
> * Delta symbol index (SLEB128, if present): The difference in symbol index from the previous entry
> (32-bit signed).
> * Delta type (SLEB128, if present): The difference in relocation type from the previous entry (32-bit
> signed).

32-bit signed does not permit encoding sequences like 0x0, 0xffff_ffff
or 0xffff_ffff, 0x0. Is this your intent?

This should probably be clarified to state that we use 32-bit wraparound arithmetic.
To switch from type 0 to type 0xffffffff, we encode a int32_t value -1.
To switch from type 0 to type 0x80000000, we encode a int32_t value -0x80000000.
 


I think this also applies to the initial 35-bit or 67-bit estimate
because the delta_offset is affected by this as well.

On ELFCLASS64, we use 64-bit wraparound arithmetic for delta_offset.
To switch from offset 0 to 0xffffffffffffffff, we encode (0xffffffffffffffff-0)*8 + flags (67-bit).
To switch from offset 0xfffffffffffffffe to 0, we encode 2*8 + flags (67-bit).

If we use PrefixVarInt instead, the code will be more complex.

BTW, I have made another revision to support implicit addends

  • Delta offset and flags (ULEB128): Holds delta_offset * (addend_bit ? 8 : 4) + flags (35-bit or 67-bit unsigned), where:
    • delta_offset: Difference in r_offset from the previous entry (Elf_Addr), right shifted by shift.
    • flags: 0 to 7 if addend_bit is 1; otherwise 0 to 3.
    • flags & 1: Indicate if delta symbol index is present.
    • flags & 2: Indicate if delta type is present.
    • flags & 4: Indicate if delta addend is present.
    Dynamic loaders prioritizing simplicity can hardcode the desired addend_bit value, just like glibc supports either REL or RELA, but not both in one build.



    The delta-based encoding of relocation types works on x86-64, but it's
    really not ideal on AArch64 because there, R_AARCH64_ABS64 (257) and
    AARCH64_GLOB_DAT (1025) are common, and their difference is greater than
    127. Maybe this could be fixed by introducing an alternative value for
    R_AARCH64_ABS64 that is greater than 1024.

    On the other hand, this encoding needs two bits for encoding the
    *length* of the type delta, when in practice, barely more than one bit
    is needed to encode the entire type field. That seems quite wasteful.

    This is not an issue. My ld.lld -z crel implementation sorts relocations by type. So we would encode

    ABS64(x), ABS64(y), ABS64(z), GLOB_DAT(a), GLOB_DAT(b), GLOB_DAT(c), TLS_TPREL64(u), TLS_TPREL64(v)

    The differences are: ABS64, 0, 0, GLOB_DAT-ABS64, 0, 0, TLS_TPREL64-GLOB_DAT, 0.

    We encode the delta type field just once for each relocation type.
     


    Thanks,
    Florian

    Fangrui Song

    unread,
    Mar 25, 2024, 2:42:44 PMMar 25
    to Generic System V Application Binary Interface
    An updated proposal is attached at the end to support implicit addends.
    If we had CREL earlier (0.3% .crel.dyn in a mostly statically linked lld), I suspect we would not have RELR (0.1% .relr.dyn), since CREL achieves the majority of the advantage of RELR in a more general way:)

    Considering implicit addends for CREL

    Many dynamic relocations have zero addends:

    • COPY/GLOB_DAT/JUMP_SLOT relocations only use zero addends.
    • Absolute relocations could use non-zero addends with STT_SECTION symbol, but linkers convert them to relative relocations.

    Usually only RELATIVE/IRELATIVE and potentially TPREL/TPOFF might require non-zero addends. Switching from DT_RELA to DT_REL offers a minor size advantage.

    I considered defining two separate dynamic tags (DT_CREL and DT_CRELA) to distinguish between implicit and explicit addends. However, this would have introduced complexity:

    • Should llvm-readelf -r dump the zero addends for DT_CRELA?
    • Should dynamic loaders support both dynamic tags?

    I placed the delta addend bit next to offset bits so that it can be reused for offsets. Thanks to Stefan O'Rear's for making me believe that my original thought of reserving a single bit flag (addend_bit) within the CREL header is elegant. Dynamic loaders prioritizing simplicity can hardcode the desired addend_bit value.

    ld.lld -z crel defaults to implicit addends (addend_bit==0), but the option of using in-relocation addends is available with -z crel -z rela.

    DT_AARCH64_AUTH_RELR vs CREL

    The AArch64 PAuth ABI introduces DT_AARCH64_AUTH_RELR as a variant of RELR for signed relocations. However, its benefit seems limited.

    In a release build of Clang 16, using -z crel -z rela resulted in a .crel.dyn section size of only 1.0% of the file size. Notably, enabling implicit addends with -z crel -z rel further reduced the size to just 0.3%. While DT_AARCH64_AUTH_RELR will achieve a noticeable smaller relocation size if most relative relocations are encoded with it, the advantage seems less significant considering CREL's already compact size.

    Furthermore, DT_AARCH64_AUTH_RLEL introduces additional complexity to the linker due to its 32-bit addend limitation: the in-place 64 value encodes a 32-bit schema, giving just 32 bits to the implicit addend. If the addend does not fit into 32 bits, DT_AARCH64_AUTH_RELR cannot be used. CREL with addends would avoid this complexity.

    I have filed Quantifying the benefits of DT_AARCH64_AUTH_RELR.


    ---

    Its content begins with a ULEB128-encoded value `count * 8 + implicit_addend * 4 + shift` (35-bit or 67-bit unsigned), where:

    * `count`: Relocation count (32-bit or 64-bit unsigned).
    * `addend_bit`: 1 indicates that relocation entries encode addends. 0 indicates implicit addends (stored in the location to be modified).
    * `shift`: The shift value (0 to 3) applies to `delta_offset` in relocation entries.

    Relocation entries (which encode `r_offset`, `r_symidx`, `r_type`, and `r_addend`) follow the header.

    Note: `r_info` in traditional REL/RELA formats has been split into `r_symidx` and `r_type`, allowing `uint32_t` relocation types for ELFCLASS32 as well.

    * Delta offset and flags (ULEB128): Holds `delta_offset * (addend_bit ? 8 : 4) + flags` (35-bit or 67-bit unsigned), where:
      + `delta_offset`: Difference in `r_offset` from the previous entry (truncated to `Elf32_Addr` or `Elf64_Addr`), right shifted by `shift`.
      + `flags`: 0 to 7 if `addend_bit` is 1; otherwise 0 to 3.

      + `flags & 1`: Indicate if delta symbol index is present.
      + `flags & 2`: Indicate if delta type is present.
      + `flags & 4`: Indicate if delta addend is present.
    * Delta symbol index (SLEB128, if present): The difference in symbol index from the previous entry, truncated to a 32-bit signed integer.
    * Delta type (SLEB128, if present): The difference in relocation type from the previous entry, truncated to a 32-bit signed integer.
    * Delta addend (SLEB128, if present): The difference in addend from the previous entry, truncated to a 32-bit or 64-bit signed integer depending on the object file class.

    ULEB128 or SLEB128 encoded values use the canonical representation (i.e., the shortest byte sequence).
    For the first relocation entry, the previous offset, symbol index, type, and addend members are treated as zero.


    Encoding/decoding delta offset and flags does not need multi-precision arithmetic. We can just unroll and special case the first iteration.
    The header can be encoded/decoded in a similar way. An implementation can assume that the relocation count cannot be larger than 2**61 and simplify the code.

    Example C++ encoder:

    ```cpp
    // encodeULEB128(uint64_t, raw_ostream &os);
    // encodeSLEB128(int64_t, raw_ostream &os);

    const uint8_t addendBit = config->isRela ? 4 : 0, flagBits = config->isRela ? 3 : 2;

    Elf_Addr offsetMask = 8, offset = 0, addend = 0;
    uint32_t symidx = 0, type = 0;
    for (const Elf_Crel &rel : relocs)
      offsetMask |= rel.r_offset;

    int shift = std::countr_zero(offsetMask)
    encodeULEB128(relocs.size() * 8 + addendBit + shift, os);
    for (const Elf_Crel &rel : relocs) {

      Elf_Addr deltaOffset = (rel.r_offset - offset) >> shift;
      uint8_t b = (deltaOffset << flagBits) + (symidx != rel.r_symidx) +
                  (type != rel.r_type ? 2 : 0) + (addend != rel.r_addend ? 4 : 0);
      if (deltaOffset < (0x80 >> flagBits)) {

        os << char(b);
      } else {
        os << char(b | 0x80);
        encodeULEB128(deltaOffset >> (7 - flagBits), os);

      }
      if (b & 1) {
        encodeSLEB128(static_cast<int32_t>(rel.r_symidx - symidx), os);
        symidx = rel.r_symidx;
      }
      if (b & 2) {
        encodeSLEB128(static_cast<int32_t>(rel.r_type - type), os);
        type = rel.r_type;
      }
      if (b & 4 & addendBit) {
        encodeSLEB128(std::make_signed_t<Elf_Addr>(rel.r_addend - addend), os);

        addend = rel.r_addend;
      }
    }
    ```

    Example C++ decoder:

    ```cpp
    // uint64_t decodeULEB128(uint8_t *&p);
    // int64_t decodeSLEB128(uint8_t *&p);

    const auto hdr = decodeULEB128(p);
    const size_t count = hdr / 8, flagBits = hdr & 4 ? 3 : 2, shift = hdr % 4;

    Elf_Addr offset = 0, addend = 0;
    uint32_t symidx = 0, type = 0;
    for (size_t i = 0; i != count; ++i) {
      const uint8_t b = *p++;
      offset += b >> flagBits;
      if (b >= 0x80)
        offset += (decodeULEB128(p) << (7 - flagBits)) - (0x80 >> flagBits);

      if (b & 1)
        symidx += decodeSLEB128(p);
      if (b & 2)
        type += decodeSLEB128(p);
      if (b & 4 & hdr)

        addend += decodeSLEB128(p);
      rels[i] = {offset << shift, symidx, type, addend};
    }
    ```

    Both encoder and decoder can be simplified if the desired `addend_bit` is hardcoded, making `flagBits` an integer literal.


    _In Figure 5-10: Dynamic Array Tags, d_tag, add:_

    `DT_CREL` | 38 | `d_ptr` | optional | optional

    _Add text below:_

    * `DT_CREL` - This element is similar to `DT_REL`, except its table uses the CREL format. The relocation count can be inferred from the header.

    _Update `DT_PLTREL` and `DT_PLTRELSZ`:_

    * `DT_PLTRELSZ`: This element holds the total size, in bytes, of the relocation entries associated with the procedure linkage table. If an entry of type `DT_JMPREL` is present and the `DT_PLTREL` entry value is `DT_REL` or `DT_RELA`, a `DT_PLTRELSZ` must accompany it.
    * `DT_PLTREL`: This member specifies the type of relocation entry to which the procedure linkage table refers. The `d_val` member holds `DT_REL`, `DT_RELA`, or `DT_CREL`, as appropriate. All relocations in a procedure linkage table must use the same relocation type.

    Ali Bahrami

    unread,
    Apr 1, 2024, 1:15:07 AMApr 1
    to gener...@googlegroups.com
    I haven't commented on this proposal yet, because I wanted to see where it
    would end up. The work that has gone into it is considerable, and interesting,
    and I think its important to have explored the space.

    The spec at

    https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#relleb-for-dynamic-relocations

    says that the "emphasis is on simplicity over absolute minimal encoding", and
    I don't doubt that it could have been far more complex. All good, but for me,
    the overall impact of introducing LEB to ELF, and then the roughly 3 page
    description for how it works, isn't stone simple. It's not the most complex thing,
    but there's some complexity.

    I'm more concerned about the big picture though, and no one has commented about
    that, so I guess it's time for me to stir the pot a bit, and see if I'm the
    only one with these concerns.

    Part of the discussion involved how to encourage toolchains to quickly
    adopt it. We can assume that some will (especially for binutils and
    llvm, if you drive the work as you did for RELR), but not all. That's
    just how standards work. It's probably the wrong goal --- implementations
    will adopt it if it solves a pressing problem for them, and not otherwise,
    so you can never assume that it's everywhere. I have yet to implement
    RELR, because the lack isn't causing problems here, and I'm happy to
    have less code to maintain. I don't see why CREL would be different.

    I do understand that the number of relocations are going up, and that
    there is some pressure to do something about that. And it seems that
    creating fewer relocations isn't going to be that solution, so there's
    a natural temptation to try and squeeze their encoding to compensate.
    That dynamic is as old as linking.

    All things being equal, who doesn't want smaller objects? However, all
    things are not equal here. We need to balance space, time, and complexity.
    This solution optimizes space at the expense of the other two, and I'm not
    convinced that it is the right tradeoff. While I would not mind if relocations
    took less space, the size of relocation sections isn't a limiting concern,
    and I don't feel urgent pressure to solve it at any cost. Objects with
    relocation sections that are too large to deal with are a somewhat fringe
    case. Traditional Rel/Rela make simple things easy, and extreme things possible,
    and that works pretty well.

    And so, I'm not excited to see more types of relocation section. We
    started with REL and RELA, and arguably, that was one too many. To me,
    REL seems like a mistake that RELA fixed. Then came RELR. I didn't
    really feel the need for it myself, but I went along, and even tried
    to help improve it, taking it on faith that it was needed, that it
    would be acceptable to allow just one more thing into this area, and
    that the gABI is the right place for it. I expected that to be the
    end of the relocation compression discussion, and was surprised
    to see RELLEB proposed, which has now evolved to CREL. And then,
    sad to read this:

    > If we had CREL earlier (0.3% .crel.dyn in a mostly statically
    > linked lld), I suspect we would not have RELR (0.1% .relr.dyn),
    > since CREL achieves the majority of the advantage of RELR in
    > a more general way:)

    That suggests that we standardized RELR too quickly. We designed it
    on list, took it on faith that it was a fairly permanent solution
    that had been thought through, and immediately added it to the standard.
    No doubt it works as designed, but now we're already discussing an additional
    relocation section type, one that could have rendered RELR unnecessary.
    How many relocation section types are we going to end up with? It's easy
    to add features to an ABI, but impossible to remove them, so it's important
    to have a clear idea of the final destination before adding more.

    The process for CREL (designed on list, an immediate move to standardize)
    is a lot like the one for RELR. How can we know that this is the last
    one (no more relocation sections, no additional tweaks to CREL)?

    Questions:

    - Is this problem really worth solving? Note that I'm not asking
    if it would be nice, or even useful, but rather, if it's essential.
    What happens if we don't, other than already large objects being
    somewhat larger?

    - Can it be made (a lot) simpler? Is LEB the best approach, or just
    one we know from DWARF?

    - Can we drop RELR as part of the deal?

    - How do we know when we have the right, enduring solution, so that
    we won't be back with more proposals that each nibble away at smaller
    and smaller problems?

    Hard questions, I know. I appreciate the discussion, and the energy
    behind it. I accept that CREL makes smaller relocation sections, but as with
    RELR, that's not necessarily an urgent problem for us, and so, I'm reluctant
    to embrace even more plumbing.

    Thanks.

    - Ali

    Fangrui Song

    unread,
    Apr 1, 2024, 3:26:17 PMApr 1
    to gener...@googlegroups.com, Ali Bahrami
    > I haven't commented on this proposal yet, because I wanted to see where it
    > would end up. The work that has gone into it is considerable, and interesting,
    > and I think its important to have explored the space.
    >
    > The spec at
    >
    > https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#relleb-for-dynamic-relocations
    >
    > says that the "emphasis is on simplicity over absolute minimal encoding", and
    > I don't doubt that it could have been far more complex. All good, but for me,
    > the overall impact of introducing LEB to ELF, and then the roughly 3 page
    > description for how it works, isn't stone simple. It's not the most complex thing,
    > but there's some complexity.
    >
    > I'm more concerned about the big picture though, and no one has commented about
    > that, so I guess it's time for me to stir the pot a bit, and see if I'm the
    > only one with these concerns.

    Hi Ali,

    Thanks for taking the time to provide your feedback on the proposal.
    I'll discuss benefits, complexity, and RELR below.

    > Part of the discussion involved how to encourage toolchains to quickly
    > adopt it. We can assume that some will (especially for binutils and
    > llvm, if you drive the work as you did for RELR), but not all. That's
    > just how standards work. It's probably the wrong goal --- implementations
    > will adopt it if it solves a pressing problem for them, and not otherwise,
    > so you can never assume that it's everywhere. I have yet to implement
    > RELR, because the lack isn't causing problems here, and I'm happy to
    > have less code to maintain. I don't see why CREL would be different.

    You're right, toolchain adoption depends on a pressing need, not just
    enthusiasm.
    I'll discuss why the size of relocation sections is a limiting concern below.

    For LLVM, I have mature implementations for the key components:
    compiler driver, assembler, linker, and binary utilities.
    With an official specification (or possibly enough interested parties
    given the evolution status of the generic ABI), I'm confident progress
    can be swift.

    On the binutils side, I've submitted a feature request.
    While I'm hoping for a volunteer, I'm also open to contributing
    myself, potentially starting with objdump/readelf support.
    For RELR, we were fortunate to have HJ Lu and Alan Modra's work with
    GNU ld's x86 and ppc64 ports.

    > I do understand that the number of relocations are going up, and that
    > there is some pressure to do something about that. And it seems that
    > creating fewer relocations isn't going to be that solution, so there's
    > a natural temptation to try and squeeze their encoding to compensate.
    > That dynamic is as old as linking.
    >
    > All things being equal, who doesn't want smaller objects? However, all
    > things are not equal here. We need to balance space, time, and complexity.
    > This solution optimizes space at the expense of the other two, and I'm not
    > convinced that it is the right tradeoff. While I would not mind if relocations
    > took less space, the size of relocation sections isn't a limiting concern,
    > and I don't feel urgent pressure to solve it at any cost. Objects with
    > relocation sections that are too large to deal with are a somewhat fringe
    > case. Traditional Rel/Rela make simple things easy, and extreme things possible,
    > and that works pretty well.

    RELA is the most size-inefficient control structure and its proportion in
    relocatable files is significant: around 20% for x86-64 (slightly more with -g,
    slightly less with -g -gsplit-dwarf) and around 40% for riscv64 with linker
    relaxation.

    Before CREL, I explored precursor formats
    (https://maskray.me/blog/2024-01-14-exploring-object-file-formats#relocations)
    and existing dynamic relocation solutions (RELR and Android's packed
    relocations).
    CREL offers an incredibly efficient alternative.

    Regarding decoder complexity, the decoder can be implemented in just a
    few lines when decode[US]LEB128 are available:

    const auto hdr = decodeULEB128(p);
    const size_t count = hdr / 8, flagBits = hdr & 4 ? 3 : 2, shift = hdr % 4;
    Elf_Addr offset = 0, addend = 0;
    uint32_t symidx = 0, type = 0;
    for (size_t i = 0; i != count; ++i) {
    const uint8_t b = *p++;
    offset += b >> flagBits;
    if (b >= 0x80)
    offset += (decodeULEB128(p) << (7 - flagBits)) - (0x80 >> flagBits);
    if (b & 1)
    symidx += decodeSLEB128(p);
    if (b & 2)
    type += decodeSLEB128(p);
    if (b & 4 & hdr)
    addend += decodeSLEB128(p);
    rels[i] = {offset << shift, symidx, type, addend};
    }


    I have mentioned 12-line decode[US]LEB128 in my blog post.
    Note: many implementations have LEB128 library functions already in
    place (for DWARF).

    While the decoder is longer than RELR, it is still comparable.
    The decoder can be simplified if the desired `addend_bit` is
    hardcoded, making flagBits an integer literal.

    relrlim = (const Elf_Relr *)((const char *)obj->relr + obj->relrsize);
    for (relr = obj->relr; relr < relrlim; relr++) {
    Elf_Relr entry = *relr;
    if ((entry & 1) == 0) {
    where = (Elf_Addr *)(obj->relocbase + entry);
    *where++ += (Elf_Addr)obj->relocbase;
    } else {
    for (long i = 0; (entry >>= 1) != 0; i++)
    if ((entry & 1) != 0)
    where[i] += (Elf_Addr)obj->relocbase;
    where += CHAR_BIT * sizeof(Elf_Relr) - 1;
    > - Can we drop RELR as part of the deal?
    >
    > - How do we know when we have the right, enduring solution, so that
    > we won't be back with more proposals that each nibble away at smaller
    > and smaller problems?
    >
    > Hard questions, I know. I appreciate the discussion, and the energy
    > behind it. I accept that CREL makes smaller relocation sections, but as with
    > RELR, that's not necessarily an urgent problem for us, and so, I'm reluctant
    > to embrace even more plumbing.
    >
    > Thanks.

    I understand the concern about the proliferation of relocation formats.
    Careful consideration went into the development of CREL, aiming for an
    optimal balance between size, time, and implementation complexity.

    As explained in a previous paragraph, I think CREL balances
    size/time/complexity very well.
    CREL's primary target is the substantial space occupied by static
    relocations (20% for aarch64/x86-64 and 40% for riscv64).
    Additionally, it offers significant compression for non-relative
    dynamic relocations.
    While the non-relative relocations' proportion is usually smaller, GUI
    libraries such as GTK and Qt spend 10+% on .rela.dyn.

    RELR effectively addresses bulky relative relocations (reducing them
    from 10% to 0.1% of an executable).
    For this specific use case, RELR remains a superior choice.
    A small increase in complexity may be justified for those who
    prioritize the greater compression it offers over CREL.
    0.1% of a 100MiB executable is 100KiB, far larger than the size needed
    for a RELR decoder.
    Firefox's elfhack/relrhack (https://glandium.org/blog/?p=4297)
    demonstrate this opinion.

    If our goal is a unified, versatile solution, CREL is compelling.
    In a future port with no REL/RELA compatibility burden, CREL could
    potentially become the sole supported static relocation format.
    If we ever have ELF version 2
    (https://maskray.me/blog/2024-04-01-light-elf-exploring-potential-size-reduction),
    we can drop REL/RELA.

    The dramatic savings achieved by RELR (executable: 10% to 0.1%) and
    those projected for CREL (.o: 20% to 3%) reinforce the principle of
    diminishing returns.
    New formats would face a high justification bar as they're unlikely to
    offer substantial additional savings.
    This is quite similar to the ELFCOMPRESS_ZSTD situation.

    > - Can it be made (a lot) simpler? Is LEB the best approach, or just
    > one we know from DWARF?

    https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#leb128-among-variable-length-integer-encodings
    describes my process finding the best variable-length integer encoding
    for our goals.
    The clever implementation from Stefan O'Rear is hard to beat.

    Michael Matz

    unread,
    Apr 3, 2024, 11:42:15 AMApr 3
    to 'Fangrui Song' via Generic System V Application Binary Interface, Ali Bahrami
    Hello,

    On Mon, 1 Apr 2024, 'Fangrui Song' via Generic System V Application Binary Interface wrote:

    > Regarding decoder complexity, the decoder can be implemented in just a
    > few lines when decode[US]LEB128 are available:

    How to decode/encode things isn't the only dimension of complexity:
    * the RELA format can be mmaped and accessed (and modified!) in place. CREL
    needs decoding and encoding.
    * REL(A) can be random-accessed, CREL cannot.

    I.e. the very need for decode/encode itself is already complexity.

    I have a fancy for optimal encoding of stuff myself, and had I been around
    when ELF was designed I certainly would have had something to say about
    the RELA format. I probably would have made the addend optional per
    reloc, and found better encoding of type and symbol. _Without_ giving up
    on random access and mmap-ability. (Ending up in something not quite as
    small as CREL but still reasonable small)

    I needed to resist the temptation to suggest improvements for CREL here
    and there :-) Because: yes, whenever I look at ELF files and see section
    sizes of REL(A) sections I go "meh, what a design". OTOH, I'm not
    often worried about sizes of .o files. (I do worry sometimes, for the
    reasons you already stated, and for instance because LTO uses temporary .o
    files and the way we're doing it with GCC is partly influenced by the wish
    to not need too large tmp space for those, and smaller RELs would help a
    little with that).

    So, it remains a tradeoff, and I haven't convinced myself (yet?) that
    another relocation format is really the best thing in the grand scheme of
    things. Or that CREL, as it's proposed, is really the best solution if a
    new reloc format should be introduced.

    (And yes: the timing for CREL is really bad, only a couple moons after
    RELR :-( )

    So, hmm, I don't know...


    Ciao,
    Michael.

    Fangrui Song

    unread,
    Apr 3, 2024, 5:07:36 PMApr 3
    to Generic System V Application Binary Interface
    On Wednesday, April 3, 2024 at 8:42:15 AM UTC-7 Michael Matz wrote:
    Hello,

    On Mon, 1 Apr 2024, 'Fangrui Song' via Generic System V Application Binary Interface wrote:

    > Regarding decoder complexity, the decoder can be implemented in just a
    > few lines when decode[US]LEB128 are available:

    How to decode/encode things isn't the only dimension of complexity:
    * the RELA format can be mmaped and accessed (and modified!) in place. CREL
    needs decoding and encoding.
    * REL(A) can be random-accessed, CREL cannot.

    I.e. the very need for decode/encode itself is already complexity.

    I have a fancy for optimal encoding of stuff myself, and had I been around
    when ELF was designed I certainly would have had something to say about
    the RELA format. I probably would have made the addend optional per
    reloc, and found better encoding of type and symbol. _Without_ giving up
    on random access and mmap-ability. (Ending up in something not quite as
    small as CREL but still reasonable small)

    Thanks for sharing your perspective.
    I agree that CREL gives up the random-access capability. The very beginning of https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf specifies:

    > This principle, outlined in _Proceedings of the Summer 1990 USENIX Conference, ELF: An Object File to Mitigate Mischievous Misoneism_, promotes ease of random access for structures like program headers, section headers, and symbols. ... Since relocations are typically processed sequentially, they don't gain the same random-access advantages.

    I have surveyed tools including rtld, linkers, and objcopy implementations and concluded that this sacrifice is acceptable.

    glibc's lazy binding scheme relies on random access to relocation entries within the DT_JMPREL table (https://maskray.me/blog/2021-09-19-all-about-procedure-linkage-table#:~:text=_dl_fixup).
    This appears to be a specialized use case only if glibc is not willing to allocate memory to hold the symbol indexes.
    (A lot of distributions have migrated to Full RELRO, rendering lazy binding unneeded.)

    Most other tools scan relocations sequentially.

    Many tools build internal representations, making in-place modifiability less essential. For example:

    * objcopy: Builds an internal representation (`struct reloc_cache_entry`) and converts symbol indexes in `elf_slurp_reloc_table_from_section`.
    * llvm-objcopy: Creates an internal representation transforming symbol indexes to Symbol * pointers.

    The internal representations allow tools to easily manage removed/added symbols and obtain updated symbol indexes (or 0 for removed symbols).
    While it's technically possible for tools to prepare a symbol index mapping (e.g. [0,1,2,3]) to utilize in-place modifiability, handling removed symbols generically would be a challenge.

     

    I needed to resist the temptation to suggest improvements for CREL here
    and there :-) Because: yes, whenever I look at ELF files and see section
    sizes of REL(A) sections I go "meh, what a design". OTOH, I'm not
    often worried about sizes of .o files. (I do worry sometimes, for the
    reasons you already stated, and for instance because LTO uses temporary .o
    files and the way we're doing it with GCC is partly influenced by the wish
    to not need too large tmp space for those, and smaller RELs would help a
    little with that).

    So, it remains a tradeoff, and I haven't convinced myself (yet?) that
    another relocation format is really the best thing in the grand scheme of
    things. Or that CREL, as it's proposed, is really the best solution if a
    new reloc format should be introduced.
     
    I'm glad to hear that you, too, see room for improving REL/RELA.
    Do you have specific suggestions for a new relocation format that gives up random-access capability?
    Or do you see a case where in-place random-access capability is critically needed?

    As I'm sure you understand, I didn't propose CREL lightly.
    It's the result of extensive research and consideration of cross-architecture requirements.
    I feel pressing need for a new format driven by several factors:

    * Compilers are fighting against a wrong metric (relocation size due to REL/RELA bloat). DWARF evolution, .llvm_addrsig, ...
    * Some ELFCLASS32 architectures are running out of relocation types.
    * Overall file size: ~20% for some, and ~40% for RISC-V.

    My reply at https://sourceware.org/pipermail/binutils/2024-March/133229.html might be useful for this group as well.

    * Relocation sizes affect DWARF evolution and we were/are using an
      imperfect metric due to overly bloated REL/RELA. .debug_str_offsets
      does not get much traction in GCC, probably partly because it needs
      relocations. DWARF v5 introduced changes to keep relocations small.
      Many are good on their own, but we need to be cautious of relocation
      concerns causing us to pick the wrong trade-off in the future.
    * On many Linux targets, Clang emits .llvm_addrsig by default to allow
      ld.lld --icf=safe. .llvm_addrsig stores symbol indexes in ULEB128
      instead of using relocations to prevent a significant size increase.
    * Static relocations make .a files larger.
    * Some users care about the build artifact size due to limited disk space.
      + I believe part of the reasons -ffunction-sections -fdata-sections
        do not get more adoption is due to the relocatable file size concern.
      + I prefer to place build directories in Linux tmpfs. 12G vs 10G in
        memory matters to me :)
      + Large .o files => more IO amount. This may be more significant
        when the storage is remote.
     
    (And yes: the timing for CREL is really bad, only a couple moons after
    RELR :-( )

    So, hmm, I don't know...


    Ciao,
    Michael.

    RELR was finalized in Feb 2018, more than 6 years ago now.
    Considering ELF's long lifespan, a six-year interval (long enough for many projects) could be argued as either long or short.

    In terms of complexity reduction, perhaps the adoption of CREL for dynamic relocations could have prevented future RELR variants like DT_AARCH64_AUTH_RELR
    (https://github.com/ARM-software/abi-aa/issues/252).
     

    James Y Knight

    unread,
    Apr 4, 2024, 8:10:49 AMApr 4
    to gener...@googlegroups.com
    I wonder if it might be a good idea to restrict this proposal to only relocatable object files and not executable files. The requirements are somewhat different both in terms of expected processing, and in terms of expected sorts of relocations needed.

    E.g. Apple has come up with an interesting scheme for MachO to encode (at least most) relocations in a "chained fixups" encoding for executable binaries. This encoding makes it easy to apply a single page's relocations on demand during page-in from backing file, inside the kernel. I'm not familiar with all the details of the encoding, but, relocations are indexed by page offset within the segment to make it easy to find all of the relocations for a given page.

    The purpose is to reduce startup time/memory overhead, by avoiding the need to apply relocations to unused parts of the program, AND allows pages to be dropped under memory pressure without having to write them out to swap, as they don't get marked dirty.

    I'm not suggesting anyone implement this sort of scheme for ELF now, but, maybe it'd be good to avoid the disruption of an executable-file relocation format change unless it could also enable this sort of optimization?

    For the relocatable object file case (which is where it seems like CREL is also of the most benefit?), a page-indexed encoding would be useless, so the above concern is irrelevant if CREL is proposed only for relocatable object files.

    --
    You received this message because you are subscribed to the Google Groups "Generic System V Application Binary Interface" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to generic-abi...@googlegroups.com.

    r...@google.com

    unread,
    Apr 4, 2024, 12:29:21 PMApr 4
    to Generic System V Application Binary Interface
    +1 to what James said.

    I was also wondering if the System V generic-abi@ group has or wants to have something similar to the C++ WG21 "technical specification" process, where a format extension is proposed and documented, but it's not required that every ELF implementation support the extension.

    As James points out, CREL is particularly useful for relocatable object files, especially when the final build artifacts are all executables and shared objects. Setting aside static archives for this moment, in this case, CREL is an implementation detail of the build that saves build resources. It is a handshake between the compiler and the static linker, and is easy to disable with a compiler flag if a different static linker is needed or there is some other consumer of the relocatable objects.

    Fangrui Song

    unread,
    Apr 5, 2024, 6:55:37 PMApr 5
    to Generic System V Application Binary Interface
    Thanks for the support and feedback, James!

    On Thursday, April 4, 2024 at 5:10:49 AM UTC-7 James Y Knight wrote:
    I wonder if it might be a good idea to restrict this proposal to only relocatable object files and not executable files.

    I think we don't need to.
    I believe demonstrating CREL's versatility for both static and dynamic relocations is valuable.
    This highlights its potential as a unified format, making a future dynamic-only scheme less likely.

    Actually, the dynamic relocation description of the proposal is short (https://groups.google.com/g/generic-abi/c/yb0rjw56ORw/m/qQWVkGpuAQAJ):
    it proposes a new dynamic tag DT_CREL, and updates DT_PLTRELSZ/DT_PLTREL wording to mention availability of DT_CREL.

    The generic ABI leaves a lot to the implementation.
     
    The requirements are somewhat different both in terms of expected processing, and in terms of expected sorts of relocations needed.

    Dynamic relocations, RELATIVE/IRELATIVE or not, fall into either category specified by
    https://maskray.me/blog/2024-03-09-a-compact-relocation-format-for-elf#crel-relocation-format

    * Symbol index/type remain constant with varying addends, common for STT_SECTION symbols, e.g. .rodata, .eh_frame, .debug_str_offsets, .debug_names, .debug_line, and .debug_addr.
    * Type/addend remain constant with varying symbol indexes, common for non-section symbols, e.g. function calls, C++ virtual tables, and dynamic relocations.

    CREL is designed to exploit both patterns for optimal encoding for both relocatable files and executable/shared object files:)
     
    E.g. Apple has come up with an interesting scheme for MachO to encode (at least most) relocations in a "chained fixups" encoding for executable binaries. This encoding makes it easy to apply a single page's relocations on demand during page-in from backing file, inside the kernel. I'm not familiar with all the details of the encoding, but, relocations are indexed by page offset within the segment to make it easy to find all of the relocations for a given page.

    The purpose is to reduce startup time/memory overhead, by avoiding the need to apply relocations to unused parts of the program, AND allows pages to be dropped under memory pressure without having to write them out to swap, as they don't get marked dirty.

    I'm not suggesting anyone implement this sort of scheme for ELF now, but, maybe it'd be good to avoid the disruption of an executable-file relocation format change unless it could also enable this sort of optimization?

    For the relocatable object file case (which is where it seems like CREL is also of the most benefit?), a page-indexed encoding would be useless, so the above concern is irrelevant if CREL is proposed only for relocatable object files.

    I'm familiar with Apple's chained fixups concept and have studied relevant resources
    (https://github.com/qyang-nj/llios/blob/main/dynamic_linking/chained_fixups.md , lld/MachO and llvm-objdump --macho --dyld-info code).

    Apple's scheme consists of two parts: the concept and the implementation.
    Some aspects of the concept make chained fixups never suitable as a generic mechanism applicable to all ELF OSes.

    While intriguing, there are aspects that make it inherently less suitable as a generic ELF solution:

    * It necessitates kernel understanding of dynamic linking data and communication with the dynamic loader.
    * The encoding repurposes the ELF "implicit addend" for page offsets, limiting its use for larger addends. Some addend bits are stolen to encode a singly linked list (dyld_chained_ptr_64_bind::next dyld_chained_ptr_64_rebase::next).
      + For the rebase operation (ELF relative relocation), 36 least significant bits and 8 most significant bits are encoded. This is insufficient if we need a larger addend.
      + The page offset is muliple of 4, further losing generality.

    The implementation relies heavily on two dynamic relocation types (rebase and bind: similar to relative and symbolic (R_X86_64_64/R_X86_64_GLOB_DAT/general-dynamic TLS)), which doesn't directly translate to ELF's broader range.
    For ELF, we would need REL/RELA/CREL like encoding for the relocation type.

    That said, if an ELF OS were to adopt a similar scheme, CREL could potentially replace certain aspects of the chained fixups implementation.
    We could encode the first relocations along with the relocation type information in each page in CREL.
    We could ditch the import table in favor of CREL's symbol index encoding and remove DYLD_CHAINED_IMPORT/DYLD_CHAINED_IMPORT_ADDEND/DYLD_CHAINED_IMPORT_ADDEND64 complexity.
    However, such optimizations might remain OS-specific and never be part of the generic ABI.
    Reply all
    Reply to author
    Forward
    0 new messages