Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Re: x86-64: new CET-enabled PLT format proposal

87 views
Skip to first unread message

H.J. Lu

unread,
Feb 27, 2022, 10:07:14 AM2/27/22
to Rui Ueyama, Andi Kleen, x86-64-abi, Binutils
On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
<binu...@sourceware.org> wrote:
>
> Hello,
>
> I'd like to propose an alternative instruction sequence for the Intel
> CET-enabled PLT section. Compared to the existing one, the new scheme is
> simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> require a separate second PLT section (.plt.sec).
>
> Here is the proposed code sequence:
>
> PLT0:
>
> f3 0f 1e fa // endbr64
> 41 53 // push %r11
> ff 35 00 00 00 00 // push GOT[1]
> ff 25 00 00 00 00 // jmp *GOT[2]
> 0f 1f 40 00 // nop
> 0f 1f 40 00 // nop
> 0f 1f 40 00 // nop
> 66 90 // nop
>
> PLTn:
>
> f3 0f 1e fa // endbr64
> 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d
> ff 25 00 00 00 00 // jmp *GOT[namen_index]

All PLT calls will have an extra MOV.

> GOT[namen_index] is initialized to PLT0 for all PLT entries, so that when a
> PLT entry is called for the first time, the control is passed to PLT0 to call
> the resolver function.
>
> It uses %r11 as a scratch register. x86-64 psABI explicitly allows PLT entries
> to clobber this register (*1), and the resolve function (__dl_runtime_resolve)
> already clobbers it.
>
> (*1) x86-64 psABI p.24 footnote 17: "Note that %r11 is neither required to be
> preserved, nor is it used to pass arguments. Making this register available as
> scratch register means that code in the PLT need not spill any registers when
> computing the address to which control needs to be transferred."
>
> FYI, this is the current CET-enabled PLT:
>
> PLT0:
>
> ff 35 00 00 00 00 // push GOT[0]
> f2 ff 25 e3 2f 00 00 // bnd jmp *GOT[1]
> 0f 1f 00 // nop
>
> PLTn in .plt:
>
> f3 0f 1e fa // endbr64
> 68 00 00 00 00 // push $namen_reloc_index
> f2 e9 e1 ff ff ff // bnd jmpq PLT0
> 90 // nop
>
> PLTn in .plt.sec:
>
> f3 0f 1e fa // endbr64
> f2 ff 25 ad 2f 00 00 // bnd jmpq *GOT[namen_index]
> 0f 1f 44 00 00 // nop
>
> In the proposed format, PLT0 is 32 bytes long and each entry is 16 bytes. In
> the existing format, PLT0 is 16 bytes and each entry is 32 bytes. Usually, we
> have many PLT sections while we have only one header, so in practice, the
> proposed format is almost 50% smaller than the existing one.

Does it have any impact on performance? .plt.sec can be placed
in a different page from .plt.

> The proposed PLT does not use jump instructions with BND prefix, as Intel MPX
> has been deprecated.
>
> I already implemented the proposed scheme to my linker
> (https://github.com/rui314/mold) and it looks like it's working fine.
>
> Any thoughts?

I'd like to see visible performance improvements or new features in
a new PLT layout.

I cced x86-64 psABI mailing list.


--
H.J.

H.J. Lu

unread,
Feb 28, 2022, 7:05:12 PM2/28/22
to Rui Ueyama, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils
On Sun, Feb 27, 2022 at 7:46 PM Rui Ueyama <rui...@gmail.com> wrote:
>
> On Mon, Feb 28, 2022 at 12:07 AM H.J. Lu <hjl....@gmail.com> wrote:
> >
> > On Sat, Feb 26, 2022 at 7:19 PM Rui Ueyama via Binutils
> > <binu...@sourceware.org> wrote:
> > >
> > > Hello,
> > >
> > > I'd like to propose an alternative instruction sequence for the Intel
> > > CET-enabled PLT section. Compared to the existing one, the new scheme is
> > > simple, compact (32 bytes vs. 16 bytes for each PLT entry) and does not
> > > require a separate second PLT section (.plt.sec).
> > >
> > > Here is the proposed code sequence:
> > >
> > > PLT0:
> > >
> > > f3 0f 1e fa // endbr64
> > > 41 53 // push %r11
> > > ff 35 00 00 00 00 // push GOT[1]
> > > ff 25 00 00 00 00 // jmp *GOT[2]
> > > 0f 1f 40 00 // nop
> > > 0f 1f 40 00 // nop
> > > 0f 1f 40 00 // nop
> > > 66 90 // nop
> > >
> > > PLTn:
> > >
> > > f3 0f 1e fa // endbr64
> > > 41 bb 00 00 00 00 // mov $namen_reloc_index %r11d
> > > ff 25 00 00 00 00 // jmp *GOT[namen_index]
> >
> > All PLT calls will have an extra MOV.
>
> One extra load-immediate mov instruction is executed per a function
> call through a PLT entry. It's so tiny that I couldn't see any
> difference in real-world apps.
> I didn't see any visible performance improvement with real-world apps.
> I might be able to craft a microbenchmark to hammer PLT entries really
> hard in some pattern to see some difference, but I think that doesn't
> make much sense. The size reduction is for real though.

I am aware that there are 2 other proposals to use R11 in PLT/function
call. But they are introducing new features. I don't think we should
use R11 in PLT without any real performance improvements.

> > I cced x86-64 psABI mailing list.
> >
> >
> > --
> > H.J.



--
H.J.

Joao Moreira

unread,
Mar 1, 2022, 4:17:02 AM3/1/22
to H.J. Lu, Rui Ueyama, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i...@maskray.me
(also replying to Fangrui, whose e-mail, for whatever reason, did not
come to this mailbox).

I can see the benefits of having 16 byte/single plt entries. Yet, the
R11 clobbering on every PLT transition is not amusing... If we want PLT
entries to have only 16 bytes and not have a sec.plt section, maybe we
could try:

<plt_header>
pop %r11
sub %r11d, plt_header
shr $0x5, %r11
push %r11
jmp _dl_runtime_resolve_shstk_thunk

<foo>:
endbr // 4b
jmp GOT[foo] // 6b
call plt_header // 5b

Here, the plt entry has 16 bytes and it pushes the PLT entry address to
the stack by calling it. The address is then popped in the plt_header
and worked to retrieve the index by subbing the plt offset from the
address and then dividing it by 16. Then, the final step to make it
shstk compatible is jmping to a special implementation of
_dl_runtime_resolve (shstk_thnk) which will have the following snippet
(similarly to glibc's __longjmp):

testl $X86_FEATURE_1_SHSTK, %fs:FEATURE_1_OFFSET
jz 1
mov $1, %r11
incsspq %r11
1:
jmp _dl_runtime_resolve

I don't think the above test fits along with the other instructions in
the plt_header if we want it 32b at most, thus the suggestion for having
it as a __dl_runtime_resolve thunk. Another possibility is to also
resolve the relocation to the special thunk only if shstk is in place,
if not, resolve it directly to _dl_runtime_resolve to prevent resolving
overheads in the absence of shstk.

I think this solves both the size and the dummy mov overheads. The logic
is a bit more convoluted, but perhaps we can work on making it simpler.
Fwiiw, I did not test nor implement anything.

Ah, also, pardon any asm mistakes/obvious details that I may have missed
:)

Joao Moreira

unread,
Mar 1, 2022, 4:28:00 AM3/1/22
to Rui Ueyama, H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i...@maskray.me
> This is what I tried first but I then realized that I needed to insert
> another `endbr` between `jmp` and `call`. `jmp GOT[foo]` can jump only
> to `endbr` if CET is enabled, so it can't directly jump to the
> following `call`.
>
Ugh, there we go... dead. Thanks for not letting me waste a ton of time
:)

Joao Moreira

unread,
Mar 1, 2022, 4:45:24 AM3/1/22
to Rui Ueyama, H.J. Lu, Moreira, Joao, Andi Kleen, x86-64-abi, Binutils, i...@maskray.me
On 2022-03-01 01:32, Rui Ueyama wrote:
> I actually wasted my time by implementing it only to find that it
> wouldn't work. :) If you are interested, this is my commit to my
> linker.
> https://github.com/rui314/mold/commit/4ec0bbf04841e514aca2000f3d780d14efcaefc9

I'm glad I posted it here before trying to go and implement :)

Regarding the projects mentioned by HJ, I assume one of them is this (in
case you are curious):

https://static.sched.com/hosted_files/lssna2021/8f/LSS_FINEIBT_JOAOMOREIRA.pdf

In FineIBT we use R11 to pass hashes around through direct calls to
enable fine-grain CFI on top of IBT.

Florian Weimer

unread,
Mar 1, 2022, 5:35:17 AM3/1/22
to H.J. Lu, Rui Ueyama, Andi Kleen, x86-64-abi, Binutils
I do wonder if time is better spent on making symbol binding faster in
general, and eliminate the semantic difference between BIND_NOW and lazy
binding (like musl has done, albeit in an IFUNC-less context).

An example of the current performance issues:

ld.so has poor performance characteristics when loading large
quantities of .so files
<https://sourceware.org/bugzilla/show_bug.cgi?id=27695>

I'm not suggesting we bring back prelink. There must be other
approaches to make binding go faster.

Thanks,
Florian

Reply all
Reply to author
Forward
0 new messages