Re: Relocations to use when eliding plts

18 views
Skip to first unread message

H.J. Lu

unread,
May 28, 2015, 7:27:29 AM5/28/15
to Richard Henderson, IA32 System V Application Binary Interface, x86-6...@googlegroups.com, g...@gcc.gnu.org, Binutils, libc-alpha
Adding ia32/x86-64 psABI.

On Wed, May 27, 2015 at 5:44 PM, H.J. Lu <hjl....@gmail.com> wrote:
> On Wed, May 27, 2015 at 1:03 PM, Richard Henderson <r...@redhat.com> wrote:
>> There's one problem with the couple of patches that I've seen go by wrt eliding
>> PLTs with -z now, and relaxing inlined PLTs (aka -fno-plt):
>>
>> They're currently using the same relocations used by data, and thus the linker
>> and dynamic linker must ensure that pointer equality is maintained. Which
>> results in branch-to-branch-(to-branch) situations.
>>
>> E.g. the attached test case, in which main has a plt entry for function A in
>> a.so, and the function B in b.so calls A.
>>
>> $ LD_BIND_NOW=1 gdb main
>> ...
>> (gdb) b b
>> Breakpoint 1 at 0x400540
>> (gdb) run
>> Starting program: /home/rth/x/main
>> Breakpoint 1, b () at b.c:2
>> 2 void b(void) { a(); }
>> (gdb) si
>> 2 void b(void) { a(); }
>> => 0x7ffff7bf75f4 <b+4>: callq 0x7ffff7bf74e0
>> (gdb)
>> 0x00007ffff7bf74e0 in ?? () from ./b.so
>> => 0x7ffff7bf74e0: jmpq *0x20034a(%rip) # 0x7ffff7df7830
>> (gdb)
>> 0x0000000000400560 in a@plt ()
>> => 0x400560 <a@plt>: jmpq *0x20057a(%rip) # 0x600ae0
>> (gdb)
>> a () at a.c:2
>> 2 void a() { printf("Hello, World!\n"); }
>> => 0x7ffff7df95f0 <a>: sub $0x8,%rsp
>>
>>
>> If we use -fno-plt, we eliminate the first callq, but do still have two
>> consecutive jmpq's.

You get consecutive jmpq's because x86 PLT entry is used as the
canonical function address. If you compile main with -fno-plt -fPIE, you
get:

(gdb) b b
Breakpoint 1 at 0x7ffff7bf75f0: file b.c, line 4.
(gdb) r
Starting program: /export/home/hjl/bugs/binutils/pr18458/main

Breakpoint 1, b () at b.c:4
4 {
(gdb) si
5 a();
(gdb)
a () at a.c:4
4 {
(gdb)

>> If seems to me that we ought to have different relocations when we're only
>> going to use a pointer for branching, and when we need a pointer to be
>> canonicalized for pointer comparisons.
>>
>> In the linked image, we already have these: R_X86_64_GLOB_DAT vs
>> R_X86_64_JUMP_SLOT. Namely, GLOB_DAT implies "data" (and therefore pointer
>> equality), while JUMP_SLOT implies "code" (and therefore we can resolve past
>> plt stubs in the main executable).
>>
>> Which means that HJ's patch of May 16 (git hash 25070364), is less than ideal.
>> I do like the smaller PLT entries, but I don't like the fact that it now emits
>> GLOB_DAT for the relocations instead of JUMP_SLOT.
>
> ld.so just does whatever is arranged by ld. I am not sure change ld.so
> is a good idea. I don't what kind of optimization we can do when function
> is called and its address it taken.
>
>>
>> In the relocatable image, when we're talking about -fno-plt, we should think
>> about what relocation we'd like to emit. Yes, the existing R_X86_64_GOTPCREL
>> works with existing toolchains, and there's something to be said for that.
>> However, if we're talking about adding a new relocation for relaxing an
>> indirect call via GOTPCREL, then:
>>
>> If we want -fno-plt to be able to hoist function addresses, then we're going to
>> want the address that we load for the call to also not be subject to possible
>> jump-to-jump.
>>
>> Unless we want the linker to do an unreasonable amount of x86 code examination
>> in order to determine mov vs call for relaxation, we need two different
>> relocations (preferably using the same assembler mnemonic, and thus the correct
>> relocation is enforced by the assembler).
>>
>> On the users/hjl/relax branch (and posted on list somewhere), the new
>> relocation is called R_X86_64_RELAX_GOTPCREL. I'm not keen on that "relax"
>> name, despite that being exactly what it's for.
>>
>> I suggest R_X86_64_GOTPLTPCREL_{CALL,LOAD} for the two relocation names. That
>> is, the address is in the .got.plt section, it's a pc-relative relocation, and
>> it's being used by a call or load (mov) insn.
>
> Since it is used for indirect call, how about R_X86_64_INBR_GOTPCREL?
>
> I updated users/hjl/relax branch to covert relocation in *foo@GOTPCREL(%rip)
> from R_X86_64_GOTPCREL to R_X86_64_RELAX_GOTPCREL so that
> existing assembly code works automatically with a new binutils.
>
>> With those two, we can fairly easily relax call/jmp to direct branches, and mov
>> to lea. Yes, LTO can perform the same optimization, but I'll also agree that
>> there are many projects for which LTO is both overkill and unworkable.
>>
>> This does leave open other optimization questions, mostly around weak
>> functions. Consider constructs like
>>
>> if (foo) foo();
>>
>> Do we, within the compiler, try to CSE GOTPCREL and GOTPLTPCREL, accepting the
>> possibility (not certainty) of jump-to-jump but definitely avoiding a separate
>> load insn and the latency implied by that?
>>
>>
>> Comments?

Here is the new proposal to add R_X86_64_INDBR_GOTPCREL and
R_386_INDBR_GOT32. Comparing against the last proposal, I used
_INDBR_, instead of _RELAX_, and also I used the same assembler
mnemonic. Since only indirect branch instructions take a single
JumpAbsolute operand, it is quite easy to generate INDBR relocation
for indirect branches.


H.J.
-----
To avoid indirect branch to internal functions, I am proposing to add a
new relocation, R_X86_64_INDBR_GOTPCREL, to x86-64 psABI:

1. When branching to an external function, foo, toolchain generates
call/jmp *foo@GOTPCREL(%rip)
with R_X86_64_INDBR_GOTPCREL relocation, instead of
call/jmp foo[@PLT]
2. When function foo is locally defined, linker converts
call/jmp *foo@GOTPCREL(%rip)
to
nop call/jmp foo
3. Otherwise, linker treats R_X86_64_INDBR_GOTPCREL the same way as
R_X86_64_GOTPCREL.

For i386 psABI, we add R_386_INDBR_GOT32:

1. When branching to an external function, foo, in non-PIC mode,
toolchain generates
call/jmp *foo@GOT
with R_386_INDBR_GOT32 relocation, instead of
call/jmp foo
and in PIC mode
call/jmp *foo@GOT(%reg)
with R_386_INDBR_GOT32 relocation and REG holds the address
of GOT, instead of
call/jmp foo@PLT
2. When function foo is locally defined, linker converts
call/jmp *foo@GOT[(%reg)]
to
nop call/jmp foo
3. Otherwise,
a. In PIC mode, linker treats R_386_INDBR_GOT32 the same way as
R_386_GOT32 and "call/jmp *foo@GOT" is unsupported.
b. In no-PIC mode, linker computes its relocation value as relocation
value of R_386_GOT32 plus the address of GOT and converts
call/jmp *foo@GOT(%reg)
to
call/jmp *foo@GOT
if needed.

This new relocation effectively turns off lazy binding on function, foo.

For i386, compiler is free to choose any register to hold the address of
GOT and there is no need to make EBX a fixed register when branching to
an external function in PIC mode.

With this new relocation, only a one-byte NOP prefix overhead is added
when function, foo, which compiler determines is external, turns out to
be local at link-time, because of -Bsymbolic or a definition in another
input object file which compiler has no knowledge of.

The new -fno-plt GCC option can use R_X86_64_INDBR_GOTPCREL and
R_386_INDBR_GOT32 relocations if linker supports them to avoid indirect
branch to internal functions.

For x86-64 GCC, it is implemented in assembler and linker. Assembler should
generate R_X86_64_INDBR_GOTPCREL relocation, instead of
R_X86_64_GOTPCREL relocation for “call/jmp *foo@GOTPCREL(%rip)”

For i386 GCC, most is implemented in assembler and linker. Assembler should
generate R_386_INDBR_GOT32 relocation, instead of R_386_GOT32 relocation,
for “call/jmp *foo@GOT(%reg)”. GCC also needs to modify to generate
“call/jmp *foo@GOT” in non-PIC mode.

H.J. Lu

unread,
May 28, 2015, 11:42:46 AM5/28/15
to Richard Henderson, IA32 System V Application Binary Interface, x86-6...@googlegroups.com, g...@gcc.gnu.org, Binutils, libc-alpha
On Thu, May 28, 2015 at 8:29 AM, Richard Henderson <r...@redhat.com> wrote:
> On 05/28/2015 04:27 AM, H.J. Lu wrote:
>> You get consecutive jmpq's because x86 PLT entry is used as the
>> canonical function address. If you compile main with -fno-plt -fPIE, you
>> get:
>
> Well, duh. If the main executable has no PLTs, they aren't used as the
> canonical function address. Surely you aren't proposing that as a solution?
>

I was just explaining where those consecutive jmpq's came from.
I wasn't suggesting a solution..


--
H.J.

H.J. Lu

unread,
May 28, 2015, 12:09:07 PM5/28/15
to Jakub Jelinek, Richard Henderson, IA32 System V Application Binary Interface, x86-6...@googlegroups.com, g...@gcc.gnu.org, Binutils, libc-alpha
On Thu, May 28, 2015 at 9:02 AM, Jakub Jelinek <ja...@redhat.com> wrote:
> On Thu, May 28, 2015 at 08:52:28AM -0700, Richard Henderson wrote:
>> I did explain it. In the quite long message.
>>
>> No comments about the rest of it, wherein I suggest a solution that doesn't
>> require the main executable to be compiled with -fno-plt in order to avoid them?
>
> And even that wouldn't help, you'd need to compile the binaries with -fpie -fno-plt,
> as -fno-plt doesn't affect normal non-PIC calls.
>

Funny you should mention it. Here is a patch to extend -fno-plt
to normal non-PIC calls. 64-bit works with the current binutils. 32-bit
only works with users/hjl/relax branch. I need to add configure test
to enable it for 32-bit.


--
H.J.
---
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index e77cd04..db7ce3d 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -25611,7 +25611,22 @@ ix86_output_call_insn (rtx_insn *insn, rtx call_op)
if (SIBLING_CALL_P (insn))
{
if (direct_p)
- xasm = "%!jmp\t%P0";
+ {
+ if (!flag_plt
+ && !flag_pic
+ && !TARGET_MACHO
+ && !TARGET_SEH
+ && !TARGET_PECOFF)
+ {
+ /* Avoid PLT. */
+ if (TARGET_64BIT)
+ xasm = "%!jmp\t*%p0@GOTPCREL(%%rip)";
+ else
+ xasm = "%!jmp\t*%p0@GOT";
+ }
+ else
+ xasm = "%!jmp\t%P0";
+ }
/* SEH epilogue detection requires the indirect branch case
to include REX.W. */
else if (TARGET_SEH)
@@ -25654,7 +25669,22 @@ ix86_output_call_insn (rtx_insn *insn, rtx call_op)
}

if (direct_p)
- xasm = "%!call\t%P0";
+ {
+ if (!flag_plt
+ && !flag_pic
+ && !TARGET_MACHO
+ && !TARGET_SEH
+ && !TARGET_PECOFF)
+ {
+ /* Avoid PLT. */
+ if (TARGET_64BIT)
+ xasm = "%!call\t*%p0@GOTPCREL(%%rip)";
+ else
+ xasm = "%!call\t*%p0@GOT";
+ }
+ else
+ xasm = "%!call\t%P0";
+ }
else
xasm = "%!call\t%A0";

H.J. Lu

unread,
May 29, 2015, 1:59:58 PM5/29/15
to Richard Henderson, Rich Felker, Jakub Jelinek, Richard Henderson, IA32 System V Application Binary Interface, x86-6...@googlegroups.com, g...@gcc.gnu.org, Binutils, libc-alpha
On Fri, May 29, 2015 at 8:38 AM, Richard Henderson <r...@twiddle.net> wrote:
> On 05/28/2015 01:36 PM, Rich Felker wrote:
>> On Thu, May 28, 2015 at 09:40:57PM +0200, Jakub Jelinek wrote:
>>> On Thu, May 28, 2015 at 03:29:02PM -0400, Rich Felker wrote:
>>>>> You're not missing anything. But do you want the performance of a
>>>>> library to depend on how the main executable is compiled?
>>>>
>>>> Not directly. But I'd rather be in that situation than have
>>>> pessimizations in library codegen to avoid it. I'm worried about cases
>>>> where code both loads the address of a function and calls it, such as
>>>> this (stupid) example:
>>>>
>>>> a((void *)a);
>>>
>>> That can be handled by using just one GOT slot, the non-.got.plt one;
>>> only if there are only relocations that guarantee that address equality is
>>> not important it would use the faster (*_JUMP_SLOT?) relocations.
>>
>> How far would this extend, e.g. in the case of LTO or compiling the
>> whole library at once?
>
> It depends on how difficult that becomes, I suppose. It's certainly something
> that we can look for during LTO.
>
> I did in fact mention this exact point in the original message:
>
>> This does leave open other optimization questions, mostly around weak
>> functions. Consider constructs like
>>
>> if (foo) foo();
>>
>> Do we, within the compiler, try to CSE GOTPCREL and GOTPLTPCREL, accepting the
>> possibility (not certainty) of jump-to-jump but definitely avoiding a separate
>> load insn and the latency implied by that?
>
> As a last resort the two can always be unified at static link time, so that
> only one got slot is created, and only one runtime relocation exists. At which
> point we'd still have two loads in the insn stream. But barring preemption,
> the second load will be from cache and cost a single cycle.
>
> So which is less likely, this double-use of a function pointer, or a non-PIE
> executable?

Can you try hjl/no-plt branch in GCC git mirror with -fno-plt?
I got

[hjl@gnu-6 pr18458]$ make
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc -O2 -g -fno-plt -c -o
main.o main.c
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc -O2 -g -fno-plt -fpic
-c -o a.o a.c
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc -O2 -g -fno-plt
-Wl,-z,now -shared -o a.so a.o
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc -O2 -g -fno-plt -fpic
-c -o b.o b.c
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc -O2 -g -fno-plt
-Wl,-z,now -shared -o b.so b.o a.so
/export/build/gnu/gcc/build-x86_64-linux/gcc/xgcc
-B/export/build/gnu/gcc/build-x86_64-linux/gcc -Wl,-rpath=. -Wl,-z,now
-o main main.o a.so b.so
./main
PASS
[hjl@gnu-6 pr18458]$ readelf -r main

Relocation section '.rela.dyn' at offset 0x4b0 contains 4 entries:
Offset Info Type Sym. Value Sym. Name + Addend
000000600a20 000200000006 R_X86_64_GLOB_DAT 0000000000000000 b + 0
000000600a28 000500000006 R_X86_64_GLOB_DAT 0000000000000000
__libc_start_main@GLIBC_2.2.5 + 0
000000600a30 000600000006 R_X86_64_GLOB_DAT 0000000000000000 __gmon_start__ + 0
000000600a38 000800000006 R_X86_64_GLOB_DAT 0000000000000000 a + 0
[hjl@gnu-6 pr18458]$ gdb main
GNU gdb (GDB) Fedora 7.7.1-21.fc20
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from main...done.
(gdb) r
Starting program: /export/home/hjl/bugs/binutils/pr18458/main
PASS
[Inferior 1 (process 10663) exited normally]
Missing separate debuginfos, use: debuginfo-install glibc-2.18-19.2.fc20.x86_64
(gdb) b b
Breakpoint 1 at 0x7ffff7bf75f0: file b.c, line 5.
(gdb) r
Starting program: /export/home/hjl/bugs/binutils/pr18458/main

Breakpoint 1, b () at b.c:5
5 a();
(gdb) si
a () at a.c:5
5 printf("PASS\n");
(gdb)


--
H.J.

H.J. Lu

unread,
May 29, 2015, 3:35:28 PM5/29/15
to Richard Henderson, Rich Felker, Jakub Jelinek, Richard Henderson, IA32 System V Application Binary Interface, x86-6...@googlegroups.com, g...@gcc.gnu.org, Binutils, libc-alpha
I built GCC with -fno-plt on hjl/no-plt branch with binutils users/hjl/relax
branch. I got

[hjl@gnu-mic-2 gcc]$ objdump -dw cc1plus | grep addr32 | wc -l
204864
[hjl@gnu-mic-2 gcc]$ objdump -dw cc1plus | grep jmpq | grep %rip | wc -l
877
[hjl@gnu-mic-2 gcc]$ objdump -dw cc1plus | grep callq | grep %rip | wc -l
20099
[hjl@gnu-mic-2 gcc]$

Relocation section '.rela.plt' at offset 0x199c68 contains 50 entries:

Those come from archives which aren't compiled with -fno-plt.

Without -fno-plt:

nu-13:pts/19[5]> objdump -dw cc1plus | grep callq | grep %rip | wc -l
2083
gnu-13:pts/19[6]> objdump -dw cc1plus | grep jmpq | grep %rip | wc -l
603
gnu-13:pts/19[7]> objdump -dw cc1plus | grep addr32 | wc -l
0

Relocation section '.rela.plt' at offset 0x196f90 contains 514 entries:

--
H.J.
Reply all
Reply to author
Forward
0 new messages