On the implementation of IBT-enabled PLT with lazy binding

195 views

Skip to first unread message

Fāng-ruì Sòng

unread,

Apr 3, 2019, 2:11:44 AM4/3/19

to x86-6...@googlegroups.com, binu...@sourceware.org

Chapter 13 "Intel CET Extension" of x86-64 psABI describes an
alternative PLT scheme for IBT (Indirect Branch Tracking). With GCC>=8
and latest ld.bfd (in binutils-gdb), we can see the synthesized PLT
with:

gcc -g -fuse-ld=bfd -fcf-protection=branch a.c -Wl,-z,ibtplt,-z,now -o
a # -mibt for some older GCC 8 releases
objdump -d a

A PLT function, say `putchar`, has instruction sequences in both .plt
and .plt.sec:

.plt (16 bytes)
1030: f3 0f 1e fa endbr64
1034: 68 00 00 00 00 pushq $0x0
1039: f2 e9 e1 ff ff ff bnd jmpq 1020 <.plt>
103f: 90 nop

.plt.sec (16 bytes)
0000000000001060 <putchar@plt>:
1060: f3 0f 1e fa endbr64
1064: f2 ff 25 5d 2f 00 00 bnd jmpq *0x2f5d(%rip)
# 3fc8 <putchar@GLIBC_2.2.5>
106b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)

.text uses `callq 1060` to call putchar@plt. 0x1064 jumps to 0x1030
for the initial call (lazy binding). After the stub at 0x1030 resolves
the GOT slot to the real entry, future 0x1060 calls will jump directly
to the real entry.

I have several questions regarding the second PLT scheme.

1. Should psABI change .splt to .plt.sec?

The implementation uses .plt.sec for this feature.
PLT sections do not have a dedicated section type and in practice they
are usually recognized by the name .plt . The tools include but not
limited to disassemblers (objdump, llvm-objdump), assemblers
(assemblers (e.g. llvm-mc) emit warnings for unusual flags), binary
instrumentation tools, profilers, debuggers.

If the implementations pick names different from the ABI, tools have
to understand both .plt.sec and .splt to be ABI conforming. The
complexity could have been avoided if implementations and the ABI
agreed on the same name: .plt.sec

I prefer .plt.sec to .splt because the convention is already used in
several other places to assign fine-grained semantics to sections,
e.g. .text.hot .text.startup .text.unlikely

2. Merge .plt and .plt.sec

As I proposed at https://reviews.llvm.org/D59780#1451608 , since we
don't emit the bnd prefix (0xf2) for MPX
(dropped by GCC 9), we can merge .plt and .plt.sec entries as follows:

4 endbr64
5 jmpq *xxx(%rip) ; jump to the next endbr64 for lazy binding
4 endbr64
5 pushq ; relocaton index
5 jmpq *xxx(%rip) ; jump to .plt

This PLT entry takes 4+5+4+5+5=23 bytes, and fits in a 24-byte entry
size if we aim for 8-byte alignment.

Not having to deal with .plt.sec simplifies implementation of PLT-aware tools.

(If MPX resurrects (I am not sure about the likelyhood), the bnd
prefixes before jmpq will take another 2 bytes and the PLT entry will
no longer fit in a 24-byte entry. We can expand it to 32-byte then)

3. Necessity of the second PLT

It was raised in https://reviews.llvm.org/D58102 that having
instruction sequences split into .plt and .plt.sec, it may improve
code cache locality. According to my understanding, in theory .plt.sec
is hot while .plt is cold (only used for the first time). That being
said, we see no evidence or benchmark results supporting the claim.

The other argument is that it provides compatibility with other tools
that have an hardcoded limit of 16.

I found top-of-tree gdb/objdump cannot symbolize 16-byte .plt and
.plt.sec entries without the bnd prefixes
(https://reviews.llvm.org/D59780#1451608) => even if the entry size
sticks with 16, existing tools have to adapt new rules to symbolize
PLT entries.
Thus, we are causing trouble to existing tools, no matter we introduce
the second PLT or not.
Given the complexity of the second PLT, not having the second PLT
might be better.

BTW, I want to remind readers the subject of this email contains "lazy
binding". 32-byte imposes more overhead to libc's without the lazy
binding functionality. For musl, a 5-byte `jmpq` instruction suffices,
but of course `-fno-plt` may be a better solution to not deal with PLT
stuff at all.

P.S. Don't get me wrong. The new security enhancement technology
attracts me. I've done a few ROP-style CTF pwn challenges in the past
and can imagine how useful IBT is, but I hope it introduces less
complexity to toolchains :)

--
宋方睿

H.J. Lu

unread,

Apr 3, 2019, 9:12:54 AM4/3/19

to Fāng-ruì Sòng, x86-64-abi, Binutils

Sounds reasonable.

> I prefer .plt.sec to .splt because the convention is already used in
> several other places to assign fine-grained semantics to sections,
> e.g. .text.hot .text.startup .text.unlikely
>
> 2. Merge .plt and .plt.sec
>
> As I proposed at https://reviews.llvm.org/D59780#1451608 , since we
> don't emit the bnd prefix (0xf2) for MPX
> (dropped by GCC 9), we can merge .plt and .plt.sec entries as follows:
>
> 4 endbr64
> 5 jmpq *xxx(%rip) ; jump to the next endbr64 for lazy binding
> 4 endbr64
> 5 pushq ; relocaton index
> 5 jmpq *xxx(%rip) ; jump to .plt
>
> This PLT entry takes 4+5+4+5+5=23 bytes, and fits in a 24-byte entry
> size if we aim for 8-byte alignment.
> Not having to deal with .plt.sec simplifies implementation of PLT-aware tools.
>
> (If MPX resurrects (I am not sure about the likelyhood), the bnd
> prefixes before jmpq will take another 2 bytes and the PLT entry will
> no longer fit in a 24-byte entry. We can expand it to 32-byte then)
>
> 3. Necessity of the second PLT
>
> It was raised in https://reviews.llvm.org/D58102 that having
> instruction sequences split into .plt and .plt.sec, it may improve
> code cache locality. According to my understanding, in theory .plt.sec
> is hot while .plt is cold (only used for the first time). That being
> said, we see no evidence or benchmark results supporting the claim.
>

It may not show up on your benchmarks. But improve cache locality
is a good thing for overall system performance. We need every bit
of performance for CET.

> The other argument is that it provides compatibility with other tools
> that have an hardcoded limit of 16.
>
> I found top-of-tree gdb/objdump cannot symbolize 16-byte .plt and
> .plt.sec entries without the bnd prefixes
> (https://reviews.llvm.org/D59780#1451608) => even if the entry size
> sticks with 16, existing tools have to adapt new rules to symbolize
> PLT entries.
> Thus, we are causing trouble to existing tools, no matter we introduce
> the second PLT or not.
> Given the complexity of the second PLT, not having the second PLT
> might be better.

A single PLT is simpler to implement. We designed 2 PLTs with performance
in mind. We have implemented it many years ago starting from MPX. It shouldn't
be changed just because it is "hard" to implement.

> BTW, I want to remind readers the subject of this email contains "lazy
> binding". 32-byte imposes more overhead to libc's without the lazy
> binding functionality. For musl, a 5-byte `jmpq` instruction suffices,
> but of course `-fno-plt` may be a better solution to not deal with PLT
> stuff at all.
>
> P.S. Don't get me wrong. The new security enhancement technology
> attracts me. I've done a few ROP-style CTF pwn challenges in the past
> and can imagine how useful IBT is, but I hope it introduces less
> complexity to toolchains :)
>

2 PLTs is a small piece for CET run-time, comparing with kernel, GCC
and glibc, partially because ld has a very flexible PLT framework to
accommodate different PLT schemes with lazy PLT (the first PLT)
and non-lazy PLT (the second PLT).

--
H.J.

Reply all

Reply to author

Forward

0 new messages