[RFC] RISC-V ELF FDPIC psABI addendum

484 views
Skip to first unread message

Maciej W. Rozycki

unread,
Mar 1, 2020, 3:57:10 AM3/1/20
to sw-...@groups.riscv.org
Hi,

I am currently working on FDPIC support for RISC-V/Linux targets in the
GNU toolchain (GCC and GNU binutils) and a couple of runtimes (uClibc,
musl, possibly glibc). While at this time I only intend to implement the
pieces I named above, the psABI extension I am going to base this stuff on
is meant to become a part of the RISC-V ELF psABI, once proved with the
implementation, available for everyone to suit their requirements.

Therefore below I am sending a preliminary document that specifies the
technical details of the extension in hope someone finds it useful or
would like to comment on it at this early stage of development.

This design has been originally presented at LCA 2020 and a recording is
available here: <https://www.youtube.com/watch?v=GydyykyNjxs>.

I will appreciate your questions, comments and any other kind of
feedback.

Maciej

--------------------------------------------------------------------------
RISC-V FDPIC ELF psABI Addendum


Chapter 4 Object Files

4.1 Machine Information (normative)

A bit in the `e_flags' member of the ELF header shall identify, when set,
a file that conforms to this ABI:

#define EF_RISCV_FDPIC 0x0010


4.2 Relocation Types (normative)

The following relocation types have been defined to support this ABI.

Figure 4.1 Relocatable Fields, Relocated Bits Marked With X's

15 12 0 15 0
+----+-----------+ +----------------+
|XXXX| | |XXXXXXXXXXXXXXXX|
+----+-----------+ +----------------+
hi20[15:12] hi20[31:16]

15 0 15 4 0
+----------------+ +------------+---+
| | |XXXXXXXXXXXX| |
+----------------+ +------------+---+
lo12i[11:0]

15 11 7 0 15 9 0
+---+-----+------+ +-------+--------+
| |XXXXX| | |XXXXXXX| |
+---+-----+------+ +-------+--------+
lo12s[4:0] lo12s[11:5]

31 0
+--------------------------------+
|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
+--------------------------------+
word32[31:0]

15 12 0 15 0 15 0 15 4 0
+----+-----------+ +----------------+ +----------------+ +------------+---+
|XXXX| | |XXXXXXXXXXXXXXXX| | | |XXXXXXXXXXXX| |
+----+-----------+ +----------------+ +----------------+ +------------+---+
hi20lo12i[15:12] hi20lo12i[31:16] hi20lo12i[11:0]

63 0
+----------------------------------------------------------------+
|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
+----------------------------------------------------------------+
word64[63:0]

Note: For the value inserted into these fields T specifies truncation and
V specifies a signed overflow check on a relocation by relocation basis.
In the T case any high-order bits that extend beyond the width of the
field and are not equal to the highest-order bit that still fits are
silently ignored. In the V case the presence of such high-order bits
causes the static linker to produce a link error.

Table 4.1 Relocation Operands

Operand | Description
=========+================================================================
A | Relocation addend.
---------+----------------------------------------------------------------
DBA | Data segment's base address; 0 in static link.
---------+----------------------------------------------------------------
G | The offset from GP of a GOT entry for the symbol referred by
| the relocation.
---------+----------------------------------------------------------------
GP | The value of GP associated with the symbol referred, nominally
| (DVMA + DBA + 2048).
---------+----------------------------------------------------------------
P | The place (offset or address) of the storage unit affected by
| the relocation.
---------+----------------------------------------------------------------
PLTE | The address of a PLT entry associated with the symbol referred.
---------+----------------------------------------------------------------
PLTI | The address of a PLT entry designated to make indirect calls.
---------+----------------------------------------------------------------
S | The value of the symbol referred by the relocation.
---------+----------------------------------------------------------------
TBA | Text segment's base address; 0 in static link.

Table 4.2 Relocation Types

Name | Value | Field | Symbol | Calculation
==========================+=======+=============+===========+=============
R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
R_RISCV_REL_TEXT (alias) | | | |
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GP | 12 | T-word32,64 | any | GP
--------------------------+-------+-------------+-----------+-------------
R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A
==========================+=======+=============+===========+=============
| | | local | S - P
R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
| | | n/a | PLTI - P
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_HI20 | 59 | V-hi20 | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_GOT_HI20 | 62 | V-hi20 | any | G
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_GOT_LO12_I | 63 | T-lo12i | any | G

Local symbols are never preempted and therefore they can be addressed
with relative addressing in PIC code. For text symbols PC-relative
addressing can be used both in ordinary PIC and FDPIC code and therefore
the same relocations are used in both cases.

PC-relative addressing cannot however be used in FDPIC code for data
symbols as the relative position of text and data with respect to each
other is not fixed and therefore a separate global pointer (GP) has to be
maintained. This ABI designates the x3 register to hold the value of the
GP and defines gp as an alias ABI name of this register. This register
is used to access local data using direct GP-relative addressing.

The R_RISCV_GPREL_HI20, R_RISCV_GPREL_LO12_I and R_RISCV_GPREL_LO12_S
static relocations are defined to support direct GP-relative addressing
suitable for local data access.

External symbols can be preempted and therefore have to be addressed
indirectly. The Global Offset Table (GOT) is used to hold the addresses
of external data symbols. GOT itself is local data and can therefore be
accessed with GP-relative addressing.

The R_RISCV_GPREL_GOT_HI20 and R_RISCV_GPREL_GOT_LO12_I static
relocations are defined to support indirect GP-relative addressing
suitable for external data access.

Occasionally a GOT entry will be created for local data to satisfy the
use of R_RISCV_GPREL_GOT_HI20 and R_RISCV_GPREL_GOT_LO12_I relocations in
code referring to such data. The R_RISCV_REL_DATA dynamic relocation is
defined to support GP-relative relocation of such GOT entries at program
load time.


4.3 Procedure Calls (normative)

Local procedure calls use the same code sequence as with ordinary PIC
code. PC-relative addressing can be used as all code locations are fixed
with respect to each other and the address is not interpreted beyond
making the jump itself. GP does not change in the process of making a
local procedure call as control remains in the same module.

External calls need to set the PC and the GP both at a time. This is
because external symbols can be preempted, in which case a call will pass
control to another module, which will usually require access to its local
data.

A data structure called Function Descriptor Table (FDT) is created by the
static linker to hold PC/GP pairs used in external procedure calls.
Addresses of individual FDT entries serve as pointers to the respective
procedures. An FDT entry is therefore created for each function symbol
that is external, whether defined or not, or whose address is taken for
a purpose other than making a call.

As the ultimate values of the PC and the GP are only determined at load
time the static linker attaches dynamic relocations to data in the FDT.
For external function symbols the R_RISCV_JUMP_SLOT and R_RISCV_GP
relocations are used for the PC and GP respectively, both referring to
the function symbol. For local function symbols whose address is taken
the R_RISCV_REL_TEXT and R_RISCV_GP relocations are used with no symbol
referred.

Figure 4.2 Function Description Table

FDT Outstanding dynamic relocations
__riscv_fdt_func1 ---> +------------------+
| Text Pointer 1 | R_RISCV_JUMP_SLOT func1
+------------------+
| Global Pointer 1 | R_RISCV_GP func1
__riscv_fdt_func2 ---> +==================+
| Text Pointer 2 | R_RISCV_JUMP_SLOT func2
+------------------+
| Global Pointer 2 | R_RISCV_GP func2
__riscv_fdt_func3 ---> +==================+
| Text Pointer 3 | R_RISCV_REL_TEXT
+------------------+
| Global Pointer 3 | R_RISCV_GP
+==================+
| . . . |

A Procedure Linkage Table (PLT) is created to handle calls via the FDT,
so that the same code sequence is used in the program proper to make
direct procedure calls regardless of whether the function symbol called
is local or external. Since the PLT is local to the module its entries
can be reached with PC-relative addressing. Individual PLT entries are
created and called into for each external procedure called.

For direct calls an FDT entry is used that corresponds to the procedure
called and has been created in the module making the call. Therefore
code in the PLT can access the FDT entry directly as local data, using
GP-relative addressing.

For indirect calls the PLT is also used and an FDT entry is used that
corresponds to the procedure called and has been created in the module
providing the function symbol of the procedure.

If a function symbol is local, then the GP-relative address of the FDT
entry is directly used by the static linker as the value retrieved in
taking a function's address.

If a function symbol is external, then an external dynamic data symbol is
created that refers to that FDT entry and whose name is constructed by
prepending `__riscv_fdt_' to the function's symbol name.

If the address of an external function symbol is taken, then a GOT entry
is created for the corresponding `__riscv_fdt_' dynamic data symbol and
used to satisfy the reference.

When making an indirect call a dedicated PLT entry is used that is common
to all indirect calls and upon invocation of that PLT entry the x5 (t0)
register holds the address of the FDT entry in the module providing the
function symbol of the procedure to call.

Since the GP is different for each module the value held in the x3 (gp)
register can change in the course of making a procedure call. Therefore
under the FDPIC calling convention the x3 (gp) register is considered
call-clobbered and it has to be preserved by the caller when making a
call to an external function symbol unless it is known that the call does
not return or that the GP is no longer referred after the return from the
procedure called. A stack slot has to be typically allocated and
initialized in a function's prologue to preserve the x3 (gp) register
across calls.


4.4 Typical Code Sequences (informative)

In the sequences below expressions on the right-hand side of relocation
names denote the symbol and the addend specified with the relocation. In
the absence of a `+' operator only a symbol is specified, otherwise the
left-hand side of the addition is a symbol and the right-hand side is an
addend. If a symbol is specified as `*ABS*', then the value is 0 (the
symbol index is STN_UNDEF in the relocation). The value of ABS() is the
absolute (static-link-time) value of the expression in the parentheses.

4.4.1 Local Data Addressing

Ordinary PIC code, using PC-relative addressing:

# Outstanding static relocations
label:
auipc t0, %pcrel_hi(var+addend) # R_RISCV_PCREL_HI20 var+addend
lbu t1, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
sb t2, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_S label

Corresponding FDPIC code, using GP-relative addressing:

# Outstanding static relocations
lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
c.add t0, gp
lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend


4.4.2 External Data Addressing

Ordinary PIC code, using GOT and PC-relative addressing:

# Outstanding static relocations
label:
auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
lb t1, addend(t0)
sb t2, addend(t0)

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 var

# or if the data symbol turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(var)

Corresponding FDPIC code, using GOT and GP-relative addressing:

# Outstanding static relocations
lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
c.add t0, gp
l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
lbu t1, addend(t0)
sb t2, addend(t0)

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 var

# or if the function turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(var)


4.4.3 Taking a Function's Address

FDPIC code, local function:

# Outstanding static relocations
lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
c.add t0, gp
addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun

FDPIC code, external function:

# Outstanding static relocations
lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
c.add t0, gp
addi t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 __riscv_fdt_fun

# or if the function symbol turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(__riscv_fdt_fun)


4.4.4 Procedure Calls Using the PLT

FDPIC code, direct call:

# Outstanding static relocations
label:
auipc ra, %pcrel_call_hi(fun@PLT) # R_RISCV_CALL_PLT fun
jalr ra, ra, %pcrel_call_lo(label)
l[w|d] gp, <gp_slot>(sp)

FDPIC code, indirect call (to a2):

# Outstanding static relocations
c.mv t0, a2
label:
auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
jalr ra, ra, %pcrel_call_lo(label)
l[w|d] gp, <gp_slot>(sp)

# The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
# the PLT entry associated with indirect calls.


Chapter 5 Program Loading

5.1 Base Addresses (normative)

A single individual base address is defined by the ELF gABI for a module
being loaded that determines the amount to relocate the module by. This
is unsuitable for FDPIC modules, which need to have their text segments
and data segments mapped in memory separately. This is so that where a
module is mapped multiple times in a no-MMU system, only a single copy of
its text segments is present in memory and serves all the mappings, while
a separate copy of its data segments is present in memory for each of the
mappings. Consequently the distance between text and data segments is no
longer constant between mappings and there is no single base address.

Instead a separate text base address and a data base address is defined
as a difference between the load address and the link address of the text
segment and the data segment respectively. These two base addresses are
used by the dynamic loader to relocate text and data respectively.

In the initial module, such as a program interpreter, loaded by an OS or
other executive runtime the text base address of said initial module can
be determined by calculating a run-time difference between the actual
value of the PC for a given location, such as the beginning of the text
segment, obtained with a PC-relative reference to a symbol associated
with that location and the value of a corresponding absolute symbol
associated with the same location. The way to determine the data base
address and therefore the value of GP of the initial module is specific
to the individual OS or other executive runtime and therefore beyond the
scope of this specification. Possibilities include passing suitable
information via the initial stack, such as in the auxiliary vector,
preinitializing a processor register, providing a system call to retrieve
it, etc.

The presence of a separate text base address and a data base address also
means that ET_EXEC images cannot be supported with the FDPIC psABI as it
is not possible to make multiple copies of such image's data segment in a
no-MMU system without the ability to relocate it at load time.


5.2 Lazy Binding (normative)

Lazy binding can be optionally implemented by the dynamic loader. If it
is implemented, then the run-time relocation of R_RISCV_JUMP_SLOT and
their associated R_RISCV_GP relocations present in the FDT is done in two
stages.

In the first stage, which is done by the dynamic loader at the time a
module is loaded, R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are
resolved respectively to the address of the lazy resolver and the value
of the global pointer associated with the module providing the lazy
resolver.

In the second stage, which is done when the lazy resolver is reached by
means of making a call through an FDT entry referring to it,
R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are resolved respectively
to the address of the function symbol associated with the FDT entry and
the value of the global pointer associated with the module providing the
function symbol. To be able to do its work the lazy resolver is called
with certain registers containing values as follows:

* x3 (gp) holds the dynamic loader's GP value as with an ordinary FDT
entry (this is a consequence of the first stage of run-time relocation)

* x5 (t0) holds a pointer to the FDT entry to relocate

* x6 (t1) holds the caller's GP value

Registers have been assigned such as to work with the RV32E instruction
set as well.

Upon completion of the second stage the lazy resolver makes a jump to the
newly resolved address of the function symbol.


5.3 Example PLT Code (informative)

@PLT:
l[w|d] t2, 0(t0)
mv t1, gp
l[w|d] gp, [4|8](t0)
jr t2
fun1@PLT:
lui t0, %gprel_hi(FDT[fun1])
addi t0, %gprel_lo(FDT[fun1])
add t0, gp
j @PLT
fun2@PLT:
lui t0, %gprel_hi(FDT[fun2])
addi t0, %gprel_lo(FDT[fun2])
add t0, gp
j @PLT

Jim Wilson

unread,
Mar 6, 2020, 7:51:53 PM3/6/20
to Maciej W. Rozycki, RISC-V SW Dev
On Sun, Mar 1, 2020 at 12:57 AM Maciej W. Rozycki <ma...@wdc.com> wrote:
> I will appreciate your questions, comments and any other kind of
> feedback.

The style is different from the existing psABI, though it looks like a
better style. Maybe you could rewrite our existing psABI to improve
it?

This was mentioned in the RISC-V software meeting, and in a
riscv-elf-psabi-doc issue, so if there are others with opinions they
should comment soon. And if not, I think we should just go forward
with this plan.

> ---------+----------------------------------------------------------------
> GP | The value of GP associated with the symbol referred, nominally
> | (DVMA + DBA + 2048).

This uses DVMA without defining it.

> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A

This is identical to the existing R_RISCV_GPREL_I reloc.

> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A

This is identical to the existing R_RISCV_GPREL_S reloc.

Currently, the R_RISCV_GPREL_I and R_RISCV_GPREL_S can only be created
by linker relaxation, so we don't have assembler support for them, and
this is maybe also why the names are a little different than what you
expect.

> Corresponding FDPIC code, using GP-relative addressing:
>
> # Outstanding static relocations
> lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
> c.add t0, gp
> lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
> sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend

Not all targets have compressed instructions. The assembler will
convert regular instructions to compressed instructions if it can, so
using add instead of c.add is more general with no code size
optimization loss.

For relaxation purposes, there should be a reloc on the add, so it should be
add t0,t0,gp,%gprel_add(var+addend)
With this extra reloc, if %gprel_hi(var+addend) is zero, then we can
relax the three instruction sequence for the load to one instruction,
deleting the first two, and modifying the load to
lbu t1,%gprel_lo(var+addend)(gp)
and likewise for the store.

See for instance the tprel_add reloc used for TLS which works the same
way. There is an example in the psABI doc.

> auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var

This should be %got_pcrel_hi(var). It was first added to llvm, and
then just added to GNU Binutils this week. It is already mentioned in
riscv-asm-manual, but needs to be mentioned in the psABI. That is on
my todo list.

> lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> c.add t0, gp
> l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> lbu t1, addend(t0)

As above, adding a reloc, e.g.. %gprel_got_add, to the add makes this relaxable.

Jim

Stef O'Rear

unread,
Mar 7, 2020, 10:48:56 PM3/7/20
to Maciej W. Rozycki, RISC-V SW Dev, i...@maskray.me, dal...@aerifal.cx
On Sunday, March 1, 2020 at 3:57:10 AM UTC-5, Maciej W. Rozycki wrote:
> Hi,
>
> I am currently working on FDPIC support for RISC-V/Linux targets in the
> GNU toolchain (GCC and GNU binutils) and a couple of runtimes (uClibc,
> musl, possibly glibc). While at this time I only intend to implement the
> pieces I named above, the psABI extension I am going to base this stuff on
> is meant to become a part of the RISC-V ELF psABI, once proved with the
> implementation, available for everyone to suit their requirements.
>
> Therefore below I am sending a preliminary document that specifies the
> technical details of the extension in hope someone finds it useful or
> would like to comment on it at this early stage of development.
>
> This design has been originally presented at LCA 2020 and a recording is
> available here: <https://www.youtube.com/watch?v=GydyykyNjxs>.
>
> I will appreciate your questions, comments and any other kind of
> feedback.

I've discussed this proposal with Rich Felker and Fangrui Song (CCed)
in #musl; the
following comments are exclusively mine.

The register usage is exactly what I had in mind, and most of the code
sequences seem approximately fine (several are not), but the relocation
structure is extremely different from the other existing FDPIC ABIs[1][2][3],
in a way which will make it difficult to support in generic code such as musl;
I believe the ABI should be made as consistent as possible to avoid surprises
like what we went through with TLS copy relocs.

[1]: http://ftp.redhat.com/pub/redhat/gnupro/FRV/FDPIC-ABI.txt
[2]: https://j-core.org/downloads/fdpic-sh.txt
[3]: https://github.com/mickael-guene/fdpic_doc/blob/master/abi.txt
None of the SH, FRV, or ARM FDPIC ABIs define anything equivalent to REL_DATA
or GP. Why is it there?

"Data segment base address" does not seem to be defined anywhere?

> ==========================+=======+=============+===========+=============
> | | | local | S - P
> R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
> | | | n/a | PLTI - P

None of the SH, FRV, or ARM ABIs use anything like PLTI.

> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_HI20 | 59 | V-hi20 | local | S - GP + A
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_GOT_HI20 | 62 | V-hi20 | any | G
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_GOT_LO12_I | 63 | T-lo12i | any | G

The GPREL and GPREL_GOT relocations look correct. We also need assembler
syntax for them, and to decide whether they are %functions or @MODIFIERS.

We also need R_RISCV_FUNCDESC (canonical function descriptor),
R_RISCV_FUNCDESC_VALUE (copy of function descriptor),
R_RISCV_GPREL_GOTFUNCDESC_(HI20, LO12_I) (offset within GOT of a pointer-sized
slot which will receive a pointer to the canonical function descriptor),
R_RISCV_GPREL_FUNCDESC_(HI20, LO12).

R_RISCV_FUNCDESC and R_RISCV_FUNCDESC_VALUE are dynamic relocations.
Why do you need REL_DATA when ARM, FRV, and SH don't?

> 4.3 Procedure Calls (normative)
>
> Local procedure calls use the same code sequence as with ordinary PIC
> code. PC-relative addressing can be used as all code locations are fixed
> with respect to each other and the address is not interpreted beyond
> making the jump itself. GP does not change in the process of making a
> local procedure call as control remains in the same module.

Should clarify that while GP does not change as part of the call instruction
itself, the called procedure is allowed to clobber GP (this is necessary for
external tail calls).

> External calls need to set the PC and the GP both at a time. This is
> because external symbols can be preempted, in which case a call will pass
> control to another module, which will usually require access to its local
> data.
>
> A data structure called Function Descriptor Table (FDT) is created by the
> static linker to hold PC/GP pairs used in external procedure calls.
> Addresses of individual FDT entries serve as pointers to the respective
> procedures. An FDT entry is therefore created for each function symbol
> that is external, whether defined or not, or whose address is taken for
> a purpose other than making a call.

Canonical function descriptors are created by the *dynamic* linker, not ld,
and they exist outside of any load segment (except possibly when static
linking). Every function which is referred to gets a single canonical
function descriptor. Other FDPIC ABIs don't use the "FDT" term and I
think it detracts from clarity to use it here.

R_arch_FUNCDESC_VALUE can create a copy of a function descriptor at any
two-word aligned address in the load segment, but there is no "descriptor
table" as a cohesive entity.

> As the ultimate values of the PC and the GP are only determined at load
> time the static linker attaches dynamic relocations to data in the FDT.
> For external function symbols the R_RISCV_JUMP_SLOT and R_RISCV_GP
> relocations are used for the PC and GP respectively, both referring to
> the function symbol. For local function symbols whose address is taken
> the R_RISCV_REL_TEXT and R_RISCV_GP relocations are used with no symbol
> referred.

Every other FDPIC ABI uses a R_ARCH_FUNCDESC_VALUE relocation to fill in both
words of a function descriptor copy at once.

>
> Figure 4.2 Function Description Table
>
> FDT Outstanding dynamic relocations
> __riscv_fdt_func1 ---> +------------------+
> | Text Pointer 1 | R_RISCV_JUMP_SLOT func1
> +------------------+
> | Global Pointer 1 | R_RISCV_GP func1
> __riscv_fdt_func2 ---> +==================+
> | Text Pointer 2 | R_RISCV_JUMP_SLOT func2
> +------------------+
> | Global Pointer 2 | R_RISCV_GP func2
> __riscv_fdt_func3 ---> +==================+
> | Text Pointer 3 | R_RISCV_REL_TEXT
> +------------------+
> | Global Pointer 3 | R_RISCV_GP
> +==================+
> | . . . |

again, this is gratuitously different from what every other arch does.

Other arches use 1 relocation per function descriptor copy, and they don't
create duplicate symbols.

> A Procedure Linkage Table (PLT) is created to handle calls via the FDT,
> so that the same code sequence is used in the program proper to make
> direct procedure calls regardless of whether the function symbol called
> is local or external. Since the PLT is local to the module its entries
> can be reached with PC-relative addressing. Individual PLT entries are
> created and called into for each external procedure called.
>
> For direct calls an FDT entry is used that corresponds to the procedure
> called and has been created in the module making the call. Therefore
> code in the PLT can access the FDT entry directly as local data, using
> GP-relative addressing.

Again, "FDT" is misleading about how function descriptors are created.

> For indirect calls the PLT is also used and an FDT entry is used that
> corresponds to the procedure called and has been created in the module
> providing the function symbol of the procedure.

This seems a bad idea and gratuitously different from every other FDPIC ABI.
Other FDPIC ABIs use code at the call site for indirect calls. If you are
doing this for code size reasons, a compiler generated function in a
.gnu.linkonce section is a much better idea because it does not create an ABI
constraint.

> If a function symbol is local, then the GP-relative address of the FDT
> entry is directly used by the static linker as the value retrieved in
> taking a function's address.

> If a function symbol is external, then an external dynamic data symbol is
> created that refers to that FDT entry and whose name is constructed by
> prepending `__riscv_fdt_' to the function's symbol name.

This is gratuitously different from other FDPIC ABIs, which use *FUNCDESC*
relocations to generate function descriptors.

It is also very inefficient since it doubles the number of symbols and symbol
names in a library.

> If the address of an external function symbol is taken, then a GOT entry
> is created for the corresponding `__riscv_fdt_' dynamic data symbol and
> used to satisfy the reference.

The compiler should generate an @GOTFUNCDESC reference and the linker should
generate a R_RISCV_FUNCDESC relocation, not create a new symbol.

> When making an indirect call a dedicated PLT entry is used that is common
> to all indirect calls and upon invocation of that PLT entry the x5 (t0)
> register holds the address of the FDT entry in the module providing the
> function symbol of the procedure to call.

No other FDPIC ABI does this.
This is good, subject to Jim's point about add vs c.add

> 4.4.2 External Data Addressing
>
> Ordinary PIC code, using GOT and PC-relative addressing:
>
> # Outstanding static relocations
> label:
> auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
> l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
> lb t1, addend(t0)
> sb t2, addend(t0)

> # Outstanding dynamic relocations for the GOT entry
> # R_RISCV_32,64 var

So far so good

> # or if the data symbol turns out local at static link time
> # R_RISCV_REL_DATA *ABS*+ABS(var)

I don't think this actually works, for one thing var might be in rodata, there
could also be multiple data segments. I don't see anything like REL_DATA in
other FDPIC ABIs, I think it always has to be R_RISCV_{32,64}, or whatever the
other arches do.

> Corresponding FDPIC code, using GOT and GP-relative addressing:
>
> # Outstanding static relocations
> lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> c.add t0, gp
> l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> lbu t1, addend(t0)
> sb t2, addend(t0)
>
> # Outstanding dynamic relocations for the GOT entry
> # R_RISCV_32,64 var
>
> # or if the function turns out local at static link time
> # R_RISCV_REL_DATA *ABS*+ABS(var)

Code looks good, same concern about REL_DATA.

> 4.4.3 Taking a Function's Address
>
> FDPIC code, local function:
>
> # Outstanding static relocations
> lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
> c.add t0, gp
> addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun
>
> FDPIC code, external function:
>
> # Outstanding static relocations
> lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
> c.add t0, gp
> addi t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun

These are, unfortunately, not compatible with dynamic linking semantics. A
function needs to have the same address regardless of which module its address
is taken in, so you have to always get the canonical function descriptor, which
has to come from the GOT because canonical function descriptors are created by
the dynamic linker. This should be something like (same for both local and
external):

lui t0, %gprel_got_hi(fun@FUNCDESC) #
R_RISCV_GPREL_GOTFUNCDESC_HI20 fun
add t0, t0, gp
l[w|d] t0, %gprel_got_lo(fun@FUNCDESC)(t0) #
R_RISCV_GPREL_GOTFUNCDESC_LO12 fun

eventually resulting in dynamic relocations for the GOT entry:

R_RISCV_FUNCDESC fun

> # Outstanding dynamic relocations for the GOT entry
> # R_RISCV_32,64 __riscv_fdt_fun
>
> # or if the function symbol turns out local at static link time
> # R_RISCV_REL_DATA *ABS*+ABS(__riscv_fdt_fun)
>
>
> 4.4.4 Procedure Calls Using the PLT
>
> FDPIC code, direct call:
>
> # Outstanding static relocations
> label:
> auipc ra, %pcrel_call_hi(fun@PLT) # R_RISCV_CALL_PLT fun
> jalr ra, ra, %pcrel_call_lo(label)
> l[w|d] gp, <gp_slot>(sp)

This is the same as the local call case and looks correct.

> FDPIC code, indirect call (to a2):
>
> # Outstanding static relocations
> c.mv t0, a2
> label:
> auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
> jalr ra, ra, %pcrel_call_lo(label)
> l[w|d] gp, <gp_slot>(sp)
>
> # The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
> # the PLT entry associated with indirect calls.

As above I don't think it makes sense to handle this as a PLT entry. The call
should be generated inline:

lw t1, 0(a2)
lw gp, 4(a2)
jalr ra, t1
lw gp, <gp_slot>(sp)

> Chapter 5 Program Loading
>
> 5.1 Base Addresses (normative)
>
> A single individual base address is defined by the ELF gABI for a module
> being loaded that determines the amount to relocate the module by. This
> is unsuitable for FDPIC modules, which need to have their text segments
> and data segments mapped in memory separately. This is so that where a
> module is mapped multiple times in a no-MMU system, only a single copy of
> its text segments is present in memory and serves all the mappings, while
> a separate copy of its data segments is present in memory for each of the
> mappings. Consequently the distance between text and data segments is no
> longer constant between mappings and there is no single base address.
>
> Instead a separate text base address and a data base address is defined
> as a difference between the load address and the link address of the text
> segment and the data segment respectively. These two base addresses are
> used by the dynamic loader to relocate text and data respectively.

FDPIC does not have a "data base address"; there are one or more load segments,
relocated independently using a load map.

> In the initial module, such as a program interpreter, loaded by an OS or
> other executive runtime the text base address of said initial module can
> be determined by calculating a run-time difference between the actual
> value of the PC for a given location, such as the beginning of the text
> segment, obtained with a PC-relative reference to a symbol associated
> with that location and the value of a corresponding absolute symbol
> associated with the same location. The way to determine the data base
> address and therefore the value of GP of the initial module is specific
> to the individual OS or other executive runtime and therefore beyond the
> scope of this specification. Possibilities include passing suitable

Every other FDPIC ABI has a normative Start up section that specifies how
Linux will pass a elf32_fdpic_loadmap struct; it's in scope here.

Note that the Linux FDPIC support currently has 32-bit assumptions and
64-bit FDPIC will need to be documented here, much as the FRV ABI
supplement defined 32-bit FDPIC ptrace calls.

> information via the initial stack, such as in the auxiliary vector,
> preinitializing a processor register, providing a system call to retrieve
> it, etc.
>
> The presence of a separate text base address and a data base address also
> means that ET_EXEC images cannot be supported with the FDPIC psABI as it
> is not possible to make multiple copies of such image's data segment in a
> no-MMU system without the ability to relocate it at load time.
>
>
> 5.2 Lazy Binding (normative)
>
> Lazy binding can be optionally implemented by the dynamic loader. If it
> is implemented, then the run-time relocation of R_RISCV_JUMP_SLOT and
> their associated R_RISCV_GP relocations present in the FDT is done in two
> stages.

these should be a single relocation for consistency with other FDPIC ABIs.

Properly supporting lazy binding on FDPIC is very difficult for multithreaded
programs because it is impossible (on baseline RV*IA) to atomically update both
words that compose a function descriptor copy. Lazy binding is disabled on
modern distros as a hardening measure and not supported by musl as a matter of
policy, so it is likely not worth trying to make it work.

If you were to attempt to do this, it would be necessary to specify the order
of loads in PLT entries (always load the entry point first and the GOT second);
updates would write the correct GOT, issue a membarrier() syscall (a no-op on
uniprocessor or sequentially consistent systems, required for ordering
otherwise), and then write the new entry point.

This guarantees that the entry point can only be reached with the corresponding
GOT, however, it allows the lazy resolver to be called with _either_ the
initial GOT value for the lazy descriptor, _or_ the final symbol's GOT. As
such, the lazy resolver cannot depend(!) on the GOT register it receives.

> In the first stage, which is done by the dynamic loader at the time a
> module is loaded, R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are
> resolved respectively to the address of the lazy resolver and the value
> of the global pointer associated with the module providing the lazy
> resolver.

> In the second stage, which is done when the lazy resolver is reached by
> means of making a call through an FDT entry referring to it,
> R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are resolved respectively
> to the address of the function symbol associated with the FDT entry and
> the value of the global pointer associated with the module providing the
> function symbol. To be able to do its work the lazy resolver is called
> with certain registers containing values as follows:
>
> * x3 (gp) holds the dynamic loader's GP value as with an ordinary FDT
> entry (this is a consequence of the first stage of run-time relocation)

The dynamic loader needs to be able to tolerate _any_ valid gp value. This
could be achieved by reserving a few words near gp and having the dynamic
loader store a pointer to its own state at a known offset from every GOT.

> * x5 (t0) holds a pointer to the FDT entry to relocate
>
> * x6 (t1) holds the caller's GP value

I don't think this is actually needed - the SH and ARM FDPIC ABIs
unconditionally clobber the caller's GP. Given a pointer to a function
descriptor copy (which is within one of the caller's data segments) the dynamic
linker can easily find the caller by walking a list of loaded libraries.

> Registers have been assigned such as to work with the RV32E instruction
> set as well.
>
> Upon completion of the second stage the lazy resolver makes a jump to the
> newly resolved address of the function symbol.

> 5.3 Example PLT Code (informative)
>
> @PLT:
> l[w|d] t2, 0(t0)
> mv t1, gp

We don't need to save t1 here; we could save 2 bytes per PLT entry by moving
the adds into this function.

> l[w|d] gp, [4|8](t0)
> jr t2
> fun1@PLT:
> lui t0, %gprel_hi(FDT[fun1])
> addi t0, %gprel_lo(FDT[fun1])
> add t0, gp
> j @PLT
> fun2@PLT:
> lui t0, %gprel_hi(FDT[fun2])
> addi t0, %gprel_lo(FDT[fun2])
> add t0, gp
> j @PLT

-s

Fangrui Song

unread,
Mar 7, 2020, 10:50:47 PM3/7/20
to Maciej W. Rozycki, RISC-V SW Dev, Jim Wilson
I am not subscribed, so I suspect my reply will be eaten by Google Groups... I also guessed your email addresses.
How is the data segment defined? The PT_LOAD segment containing .data,
.sdata, or something else?

> G | The offset from GP of a GOT entry for the symbol referred by
> | the relocation.
>---------+----------------------------------------------------------------
> GP | The value of GP associated with the symbol referred, nominally
> | (DVMA + DBA + 2048).

GNU ld seems to define __global_pointer$ = .sdata + 0x800
In lld, I arbitrarily set it to (exists(.sdata) ? .sdata : __ehdr_start) + 0x800

>---------+----------------------------------------------------------------
> P | The place (offset or address) of the storage unit affected by
> | the relocation.
>---------+----------------------------------------------------------------
> PLTE | The address of a PLT entry associated with the symbol referred.
>---------+----------------------------------------------------------------
> PLTI | The address of a PLT entry designated to make indirect calls.

I am confused by PLTE/PLTI.

Some PLT entries do not need a .symtab/.dyntab entry:

As an example, bl foo (R_PPC64_REL24) can cause the creation of PLT
call stubs. There can be several stubs for one symbol, because each
call stub can only be accessed within +-32MB.

R_AARCH64_{CALL,JUMP}26 can cause the creation of similar call stubs (veneers).

Some PLT entries need a .dynsym entry: canonical PLT entry (st_value>0, st_shndx=0).
Such a PLT is caused by non-pic code, create by the linker for non-GOT-non-PLT relocation
types to an external function.

What are PLTE and PLTI?

> S | The value of the symbol referred by the relocation.
>---------+----------------------------------------------------------------
> TBA | Text segment's base address; 0 in static link.

GNU ld -z separate-code (default on Linux x86 since 2.31) has the following segment layout:

R
RX
R
RW (relro ; non-relro)

lld has the following segment layout (since lld 9):

R
RX
RW(RELRO)
RW(non-RELRO)

The first PT_LOAD is not executable. Does the mandatory 0 in a static
link cause confusion?

>Table 4.2 Relocation Types
>
> Name | Value | Field | Symbol | Calculation
>==========================+=======+=============+===========+=============
> R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
> R_RISCV_REL_TEXT (alias) | | | |
>--------------------------+-------+-------------+-----------+-------------
> R_RISCV_GP | 12 | T-word32,64 | any | GP
>--------------------------+-------+-------------+-----------+-------------
> R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A

AFAIK no relocation type uses the start of a segment for calculation.
A concrete section is needed.
%pcrel_call_hi is not defined.

"The R_RISCV_CALL_PLT relocation with no symbol" - does it refer to the
PLT header?

Sam Elliott

unread,
Mar 20, 2020, 12:24:00 PM3/20/20
to Maciej W. Rozycki, sw-...@groups.riscv.org
Hi Maceij,

Thank you for this proposal. I realise you have had quite a bit of feedback already, I would like to add some from lowRISC’s point of view.

Recently I have been investigating Embedded PIC, along with some collaborators at the Oxide Computer Company.

In our system, we do not have a MMU, and want as simple a loader as possible. With this in mind we will be be statically linking all our embedded application executables. Thus, we are most interested in ROPI/RWPI, rather than FDPIC.

However, I think that FDPIC is not entirely orthogonal to ROPI/RWPI. It seems very likely that most of the GP-relative relocations and code sequences you propose here for local data addressing would also be useful for ROPI/RWPI (when combined with pc-relative, non-PLT function calls).

I am not convinced the code sequence for taking the address of a local function is correct, for either FDPIC or statically linked ROPI/RWPI executables, because I don’t think you can do the static relocation required for R_RISCV_GPREL_HI20(fun) if you don’t know the distance between the text and data sections (something you only know at runtime). I note you’ve had feedback that the sequences may need to be changed for FDPIC anyway, but I think ROPI/RWPI may just use the conventional pc-relative code sequences.

I am keen to see your revised specification, in light of the feedback so far.

Sam
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/alpine.LFD.2.21.2002282057330.18621%40redsun52.ssa.fujisawa.hgst.com.

--
Sam Elliott
Software Developer - LLVM and OpenTitan
lowRISC CIC

Sam Elliott

unread,
Mar 23, 2020, 1:01:42 PM3/23/20
to Fangrui Song, RISC-V SW Dev
Fangrui,

Something you said on the FDPIC thread has me concerned.

> On 8 Mar 2020, at 3:50 am, Fangrui Song <i...@maskray.me> wrote:
>
> GNU ld seems to define __global_pointer$ = .sdata + 0x800
> In lld, I arbitrarily set it to (exists(.sdata) ? .sdata : __ehdr_start) + 0x800

Isn’t this mismatch a cause for concern?

If I’m linking for an embedded system, and forgot to define `__global_pointer$` (maybe because I don’t know about this special symbol), then GNU ld may perform linker relaxations (and associated relocations) based on the assumption that `gp` (the register) has a value of 0. On lld, it’s going to perform the same relocations based on the assumption that `gp` (the register) has quite a different value. While this isn’t an issue today, I think it may become one if LLD implements linker relaxations.

I presume that lld does not have the concept of a “default linker script”, which is maybe where this mismatch has come from, and why it has to programatically define these symbols.

As an aside, I will update the psABI to mention `__global_pointer$`.

What are your thoughts on this issue?

Sam

Tommy Murphy

unread,
Mar 23, 2020, 1:11:14 PM3/23/20
to RISC-V SW Dev
> then GNU ld may perform linker relaxations (and associated relocations) based on the assumption that `gp` (the register) has a value of 0.

Does it actually assume 0 in that case?

Certainly if relaxations are performed at compile/link time and the (startup) code doesn't initialize $gp appropriately then execution will use whatever garbage/uninitialised value $gp happens to contain and will almost certainly fail/crash.

Maciej W. Rozycki

unread,
Mar 23, 2020, 1:13:34 PM3/23/20
to Jim Wilson, RISC-V SW Dev
Hi Jim,

I am now back to this effort after a holiday and a short period to catch
up.

Thank you for your feedback. I will be making amendments to the proposal
as I go through your notes. I yet have to address Stef's extensive input,
so some of this stuff might be iteratively updated.

> > I will appreciate your questions, comments and any other kind of
> > feedback.
>
> The style is different from the existing psABI, though it looks like a
> better style. Maybe you could rewrite our existing psABI to improve
> it?

This has been written with the ELF gABI as a reference, and with some
influence from the style the original MIPS psABIs used that I have found
quite comprehensible and got used to over the years.

I can look into improving the base RISC-V psABI once we have got through
the implementation of this FDPIC extension.

> > ---------+----------------------------------------------------------------
> > GP | The value of GP associated with the symbol referred, nominally
> > | (DVMA + DBA + 2048).
>
> This uses DVMA without defining it.

Now defined, in terms of `p_vaddr'.

> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
>
> This is identical to the existing R_RISCV_GPREL_I reloc.
>
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
>
> This is identical to the existing R_RISCV_GPREL_S reloc.

Except for overflow detection. Ones I have defined cause no overflow
detection as they assume the corresponding high part to also be present.

> Currently, the R_RISCV_GPREL_I and R_RISCV_GPREL_S can only be created
> by linker relaxation, so we don't have assembler support for them, and
> this is maybe also why the names are a little different than what you
> expect.

Well, as long as BFD provides them you can always use `.reloc' to emit
them with GAS. This doesn't solve the issue of link-time overflow
detection however; they do not have a corresponding high-part relocation
so we do expect them to catch overflows to facilitate code that has been
written for `.sdata'/`.sbss' support, don't we? Or otherwise what is the
purpose of their existence?

> > Corresponding FDPIC code, using GP-relative addressing:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
> > c.add t0, gp
> > lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
> > sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend
>
> Not all targets have compressed instructions. The assembler will
> convert regular instructions to compressed instructions if it can, so
> using add instead of c.add is more general with no code size
> optimization loss.

I have deliberately left relaxation out and given the code sequences (in
informational sections) as examples rather than requirements (in normative
sections). You are of course right that some configurations will lack
compressed instructions and code is obviously allowed to use base encoding
equivalents or different sequences e.g. due to compiler optimisations.

Also as a side note I think it is GCC (or any other compiler) that should
produce the intended assembly right from the beginning, so as to get the
code size right and avoid unnecessary longer sequences such as with
branches that seem out of range due to size estimate pessimisation but are
not (of course some sizes are only known at link stage making certain
kinds of optimisations possible in the linker anyway).

> For relaxation purposes, there should be a reloc on the add, so it should be
> add t0,t0,gp,%gprel_add(var+addend)

I don't think we need to invent extra syntax here for this as we have the
`.reloc' pseudo-op for such use cases, e.g. where no instruction operand
refers to a symbol or there's no symbol involved (cf. R_MIPS_JALR). This
could look like:

0:
add t0, t0, gp
.reloc 0b, R_RISCV_GPREL_ADD, var + addend

> With this extra reloc, if %gprel_hi(var+addend) is zero, then we can
> relax the three instruction sequence for the load to one instruction,
> deleting the first two, and modifying the load to
> lbu t1,%gprel_lo(var+addend)(gp)
> and likewise for the store.
>
> See for instance the tprel_add reloc used for TLS which works the same
> way. There is an example in the psABI doc.

Relaxation optimisations like this were considered and comprehensively
implemented with the nanoMIPS target in GOLD, publicly available. I think
we ought to follow suite.

Therefore I think this will be best considered separately, as this is not
strictly necessary for FDPIC support on one hand, and may be used for
other purposes on the other. For this reason I have decided not to
include any relaxation support with the FDPIC psABI addendum.

> > auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
>
> This should be %got_pcrel_hi(var). It was first added to llvm, and
> then just added to GNU Binutils this week. It is already mentioned in
> riscv-asm-manual, but needs to be mentioned in the psABI. That is on
> my todo list.

Yep, I have now seen the patch posted to the binutils mailing list. I
have updated the document accordingly throughout.

I now actually wonder if we shouldn't have used composed relocations
(e.g. R_RISCV_GOT for GOT references with a corresponding %got operator,
R_RISCV_PCREL for PC-relative calculations with %pcrel, etc.) to avoid
proliferating relocation variants providing repeating patterns.

> > lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> > c.add t0, gp
> > l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> > lbu t1, addend(t0)
>
> As above, adding a reloc, e.g.. %gprel_got_add, to the add makes this
> relaxable.

Likewise, this can be done with `.reloc' like I noted above, and the
relaxation defined separately. I think relaxation support that requires
psABI support (e.g. extra relocations) should be defined in a separate
section of the standard. Perhaps individual sections included in the base
psABI and this addendum.

If you think it is important to have relaxation defined right from the
beginning (why?), then I might consider doing it right away.

Maciej

Sam Elliott

unread,
Mar 23, 2020, 1:39:25 PM3/23/20
to RISC-V SW Dev, Tommy Murphy
Judging by this code, yes I believe it assumes zero.

https://github.com/bminor/binutils-gdb/blob/master/bfd/elfnn-riscv.c#L1408-L1420

Yes, I agree that you have problems if you haven’t loaded 0 or `__global_pointer$` into gp, which suggests to me that the linker should raise an error if you try to use these relocations and `__global_pointer$` is not defined. This seems better than getting some unknown run-time failure where a completely wrong address was loaded.

I’m coming from the point of view of working on an embedded project, with a custom linker script (that doesn't define `__global_pointer$`), and not finding any documentation of this special symbol, not even in the psABI (something I will attempt to correct today). Inadvertently, my project has got it right as it zeroes all registers at startup.

Sam
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/547a2caa-6306-4387-93d2-0e8a03159362%40groups.riscv.org.

Andrew Waterman

unread,
Mar 23, 2020, 2:38:06 PM3/23/20
to Sam Elliott, RISC-V SW Dev, Tommy Murphy
On Mon, Mar 23, 2020 at 10:39 AM Sam Elliott <sell...@lowrisc.org> wrote:
Judging by this code, yes I believe it assumes zero.

Actually, zero is just used as a sentinel value in that code. The semantics are, if __global_pointer$ is not defined, then the linker won’t perform relaxations against the global pointer.

Sam Elliott

unread,
Mar 23, 2020, 5:03:26 PM3/23/20
to Andrew Waterman, RISC-V SW Dev, Tommy Murphy
Oh, now I see how that works. The interaction with `max_alignment` is not so easy to understand, until I noticed it was `(bfd_vma) -1` by default.

This has also answered some questions I have about how to prevent `gp`-relative relaxations being used to reference symbols in a different output section (just in case I choose to start moving, for example, the data section).

Thanks for your clarification!

Sam

Fangrui Song

unread,
Mar 23, 2020, 6:30:36 PM3/23/20
to Sam Elliott, RISC-V SW Dev
While I was making glibc applications linkable with lld, I noticed that
a linker had to define __global_pointer$ because glibc Scrt1.o
(sysdeps/riscv/start.S) requires it. To be honest I am not too sure why
.sdata and __global_pointer$ is used by RISC-V. I am always wondering
whether it is legacy cruft copied from elsewhere (e.g. MIPS).

MIPS needs GP just because it lacks a PC-relative instruction. It needs
a register to amortize the PIC cost. Similarly, PPC64 does this via a
dedicated TOC register.

I don't follow RISC-V development that closely so I may be wrong. If no
code is using .sdata, then it does not matter that much how the linker
defines __global_pointer$

> x3 gp Global pointer -- (Unallocatable)

This just wastes a register for no good reason for most applications.

Andrew Waterman

unread,
Mar 23, 2020, 6:40:47 PM3/23/20
to Fangrui Song, Sam Elliott, RISC-V SW Dev
You know, you could ask why it's there rather than just assuming we don't know what we're doing...

It's not legacy MIPS cruft.  It works quite a bit differently, relying on linker relaxations to opportunistically shorten global-variable accesses.  Earlier RISC-V ABIs didn't have gp; it was added when it was found that it was a better use of that register than another temporary or callee-saved register.


--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.

Fangrui Song

unread,
Mar 23, 2020, 8:54:42 PM3/23/20
to Andrew Waterman, Sam Elliott, RISC-V SW Dev
Then when is gp actually beneficial? Only in a sub-ABI like FDPIC? I
don't think clang or GCC can generate it.

To use it in source code, some attribute annotation is required.
Alternatively, let a post-link optimizer rewrite some PC relative
load/store instructions.

In addition, it is not clear how gp should be manipulated while calling
an external component (another shared object).

I wish gp were reserved with all the specifications coming with it.

Andrew Waterman

unread,
Mar 23, 2020, 9:03:42 PM3/23/20
to Fangrui Song, Sam Elliott, RISC-V SW Dev
Linker relaxation is the only way that gp gets used.  For following trivial program

  int x;
  int main() { return x; }

GCC emits the code sequence

  lui a5,%hi(x)
  lw a0,%lo(x)(a5)
  ret

but after linking, the executable contains

  lw      a0,-104(gp) # 11be0 <x>
  ret


In addition, it is not clear how gp should be manipulated while calling
an external component (another shared object).

gp is only used by the main executable; shared libraries do not use it, so it doesn't need to be manipulated when crossing into or between shared objects.

Maciej W. Rozycki

unread,
Mar 23, 2020, 9:33:53 PM3/23/20
to Andrew Waterman, Fangrui Song, Sam Elliott, RISC-V SW Dev
On Mon, 23 Mar 2020, Andrew Waterman wrote:

> > In addition, it is not clear how gp should be manipulated while calling
> > an external component (another shared object).
>
> gp is only used by the main executable; shared libraries do not use it, so
> it doesn't need to be manipulated when crossing into or between shared
> objects.

I suppose GP could be used to reduce the number of instructions needed
for accesses to the GOT (combined with `.sdata' and `.sbss' if required)
in DSOs, however the limited 12-bit span of offsets supported by machine
instructions combined with the presence of PC-relative addressing possible
with just two hardware instructions makes such an optimisation somewhat
questionable compared to architectures such as Alpha, MIPS or Power that
have 16-bit offsets and no reasonable PC-relative addressing (at least in
the classic instruction sets).

Maciej

Maciej W. Rozycki

unread,
Mar 23, 2020, 9:49:35 PM3/23/20
to Fangrui Song, RISC-V SW Dev, Jim Wilson
Hi Fangrui,

Thank you for your input.

> I am not subscribed, so I suspect my reply will be eaten by Google
> Groups... I also guessed your email addresses.

It went through as I received it at my LMO personal e-mail address too.
Perhaps the mailing list isn't open for posting only by subscribers after
all (I sought advice on that from the list owner, but haven't ever heard
back).

> > Operand | Description
> >=========+================================================================
> > A | Relocation addend.
> >---------+----------------------------------------------------------------
> > DBA | Data segment's base address; 0 in static link.
>
> How is the data segment defined? The PT_LOAD segment containing .data,
> .sdata, or something else?

The data segment here are the r/w PT_LOAD segments (i.e. whose `p_flags'
have the PF_R and PF_W bits set), combined. As opposed to the r/x PT_LOAD
segments (with PF_X and/or PF_R set), also combined, which are the text
segment.

> > G | The offset from GP of a GOT entry for the symbol referred by
> > | the relocation.
> >---------+----------------------------------------------------------------
> > GP | The value of GP associated with the symbol referred, nominally
> > | (DVMA + DBA + 2048).
>
> GNU ld seems to define __global_pointer$ = .sdata + 0x800
> In lld, I arbitrarily set it to (exists(.sdata) ? .sdata : __ehdr_start)
> + 0x800

There's no clash here I believe. A distinct linker script will likely be
required for the FDPIC configuration, but that's an implementation detail.

> >---------+----------------------------------------------------------------
> > P | The place (offset or address) of the storage unit affected by
> > | the relocation.
> >---------+----------------------------------------------------------------
> > PLTE | The address of a PLT entry associated with the symbol referred.
> >---------+----------------------------------------------------------------
> > PLTI | The address of a PLT entry designated to make indirect calls.
>
> I am confused by PLTE/PLTI.

It is further described in 4.3 "Procedure Calls (normative)" although the
acronyms are not referred, which was a mistake. I have corrected it now.

> Some PLT entries do not need a .symtab/.dyntab entry:
>
> As an example, bl foo (R_PPC64_REL24) can cause the creation of PLT
> call stubs. There can be several stubs for one symbol, because each
> call stub can only be accessed within +-32MB.
>
> R_AARCH64_{CALL,JUMP}26 can cause the creation of similar call stubs
> (veneers).
>
> Some PLT entries need a .dynsym entry: canonical PLT entry (st_value>0,
> st_shndx=0).
> Such a PLT is caused by non-pic code, create by the linker for
> non-GOT-non-PLT relocation
> types to an external function.
>
> What are PLTE and PLTI?

PLTE entries are individually associated with external function symbols
calls to which are made directly. There is only one PLTI entry used for
making indirect calls.

> > S | The value of the symbol referred by the relocation.
> >---------+----------------------------------------------------------------
> > TBA | Text segment's base address; 0 in static link.
>
> GNU ld -z separate-code (default on Linux x86 since 2.31) has the
> following segment layout:
>
> R
> RX
> R

These are the text segment, relocated together at load time.

> RW (relro ; non-relro)

This is the data segment, relocated together at load time.

> lld has the following segment layout (since lld 9):
>
> R
> RX

These are the text segment, relocated together at load time.

> RW(RELRO)
> RW(non-RELRO)

These are the data segment, relocated together at load time.

> The first PT_LOAD is not executable. Does the mandatory 0 in a static
> link cause confusion?

The base address is 0 in static-link calculation, because the
dynamic-load relocation does not yet apply at this stage; the relevant
segment's VMA (as with `p_vaddr') is the actual address used for
calculation. A relative relocation (either R_RISCV_REL_TEXT or
R_RISCV_REL_DATA, as applicable) may have to be attached to the result of
such calculation for dynamic-load relocation.

FAOD I have been using the explicit terms: "static linker" and "dynamic
loader", and derived grammatical forms such as "static linking",
"static-link", etc. and "dynamic loading", "dynamic-load", etc. to avoid
confusion in terminology like with "dynamic linker", which makes lone
"linker" have two meanings. This has nothing to do with static vs dynamic
executables (all FDPIC executables are PIE anyway and go through the
dynamic load stage at run time, although the dynamic loader code may be
embedded within the executable rather than standalone).

> >Table 4.2 Relocation Types
> >
> > Name | Value | Field | Symbol | Calculation
> >==========================+=======+=============+===========+=============
> > R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
> > R_RISCV_REL_TEXT (alias) | | | |
> >--------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GP | 12 | T-word32,64 | any | GP
> >--------------------------+-------+-------------+-----------+-------------
> > R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A
>
> AFAIK no relocation type uses the start of a segment for calculation.
> A concrete section is needed.

A relocation whose symbol index in `r_info' is STN_UNDEF does not refer a
symbol nor consequently a section. Instead a value of 0 is used for
calculation; this has been explicitly defined in the ELF gABI.

This value is still relocated in dynamic loading by the base address as
are all actual symbols (save for SHN_ABS ones); since we have separate
base addresses for text and data in this specification this will be the
text base address and the data base address respectively and distinct
relocations are therefore required.

There are many existing examples of such relocation calculation across
various psABIs, e.g. R_ALPHA_RELATIVE, R_386_RELATIVE, etc.

> >FDPIC code, indirect call (to a2):
> >
> > # Outstanding static relocations
> > c.mv t0, a2
> >label:
> > auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
> > jalr ra, ra, %pcrel_call_lo(label)
> > l[w|d] gp, <gp_slot>(sp)
> >
> > # The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
> > # the PLT entry associated with indirect calls.
>
> %pcrel_call_hi is not defined.

It's an implementation detail (the section is informative), assemblers
are free to define their own syntax, which is beyond the scope of an ABI.

In the GNU assembler percent-operators indicate relocation, however we
currently have an issue in that several operations have not been defined
and the compiler has no direct way to synthesize them other than with the
`.reloc' pseudo-op.

In particular there is no way (or I haven't found one) for the compiler
to emit an instruction sequence to make a function call. Instead the
`call' assembly macro has to be used, that expands to a pair of
instructions.

So I used this synthetic example instead using inexistent percent-ops.
Perhaps this could be expressed in a better way; suggestions are welcome.
Maybe this could be just:

# Outstanding static relocations
c.mv t0, a2
auipc ra, %call_plt(@PLT) # R_RISCV_CALL_PLT
jalr ra, ra, 0
l[w|d] gp, <gp_slot>(sp)

instead (observing that the R_RISCV_CALL_PLT relocation has its relocated
fields spread across two instructions). I have updated my code examples
accordingly.

> "The R_RISCV_CALL_PLT relocation with no symbol" - does it refer to the
> PLT header?

Yes, aka PLTI, according to this definition:

Name | Value | Field | Symbol | Calculation
==========================+=======+=============+===========+=============
| | | local | S - P
R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
| | | n/a | PLTI - P
--------------------------+-------+-------------+-----------+-------------

-- no symbol referred here, so the third calculation applies.

Do these explanations and corrections clear your concerns?

Maciej

Andrew Waterman

unread,
Mar 23, 2020, 10:39:05 PM3/23/20
to Maciej W. Rozycki, Fangrui Song, Sam Elliott, RISC-V SW Dev
Agreed.  Fortunately, if, down the road, that proves to be a good use of gp, it can be done without changing the ABI.


  Maciej

Maciej W. Rozycki

unread,
Mar 25, 2020, 3:17:44 PM3/25/20
to Sam Elliott, sw-...@groups.riscv.org
Hi Sam,

Thank you for your input.

> In our system, we do not have a MMU, and want as simple a loader as
> possible. With this in mind we will be be statically linking all our
> embedded application executables. Thus, we are most interested in
> ROPI/RWPI, rather than FDPIC.
>
> However, I think that FDPIC is not entirely orthogonal to ROPI/RWPI. It
> seems very likely that most of the GP-relative relocations and code
> sequences you propose here for local data addressing would also be
> useful for ROPI/RWPI (when combined with pc-relative, non-PLT function
> calls).

That is correct, static PIE is just a special case where you have no
additional modules loaded. So all the dynamic relocation processing is
still done as required for text and data segment separation (you still
want to map text once with multiple instances of the PIE running), but you
don't need a PLT or FDT because all symbols by definition resolve locally.

NB if you don't need running multiple instances of the same executable,
then you can just get away with the flat binary format already supported,
so long as you build your software using the static PIE format.

> I am not convinced the code sequence for taking the address of a local
> function is correct, for either FDPIC or statically linked ROPI/RWPI
> executables, because I don?t think you can do the static relocation
> required for R_RISCV_GPREL_HI20(fun) if you don?t know the distance
> between the text and data sections (something you only know at runtime).

This is a GP-relative rather than a PC-relative reference so the local
sequence is right (the external one has an editorial mistake, as noticed
by Stef already), as the pointer taken will be to the relevant FDT entry,
which is data (poked at by the dynamic loader) and therefore in the data
segment. The whole point of a separate GP is to have local data offsets
constant with respect to it.

> I note you?ve had feedback that the sequences may need to be changed for
> FDPIC anyway, but I think ROPI/RWPI may just use the conventional
> pc-relative code sequences.

I think you are right here as in your environment you won't ever pass
function pointers externally, and therefore you don't need to update GP.

This scenario would actually correspond to the STV_INTERNAL export class
(visibility) in the usual dynamic load scenario, including the FDPIC ABI
in particular. So I think it may be worth it to permit function symbols
marked STV_INTERNAL to be referred directly not only for calls, but for
for taking their address as well in the FDPIC psABI. In that case no FDT
entry will be required and the address can be taken with a PC-relative
reference.

There is no way that I know of however to verify that such a pointer is
not passed externally (except perhaps by static analysis, which is beyond
the scope of a compiler toolchain), so the onus would be on the software
writer to make sure the restriction has not been violated.

I think this would actually be a useful enhancement to the FDPIC psABI
addendum. By having the semantics of STV_INTERNAL symbols defined like
this in the specification we'll have both the FDPIC and the ROPI/RWPI use
cases covered with a single ABI (the latter as a special case of the more
general FDPIC case). With such semantics to build a static PIE program
for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
GCC will be using the `-mfdpic -fvisibility=internal' command-line options
(and of course nothing prevents us from making that the default for the
compiler at its build time, based either on target selection or `--with-*'
configuration options if that made people's life easier). Other compilers
may follow suit.

NB regular FDPIC static PIE programs will still require FDT entries to be
created for function pointers passed to modules loaded with dlopen(3).

Does this reply answer your questions and clear your concerns?

Maciej

Maciej W. Rozycki

unread,
Mar 26, 2020, 9:26:43 AM3/26/20
to Sam Elliott, sw-...@groups.riscv.org
On Wed, 25 Mar 2020, Maciej W. Rozycki wrote:

> This scenario would actually correspond to the STV_INTERNAL export class
> (visibility) in the usual dynamic load scenario, including the FDPIC ABI
> in particular. So I think it may be worth it to permit function symbols
> marked STV_INTERNAL to be referred directly not only for calls, but for
> for taking their address as well in the FDPIC psABI. In that case no FDT
> entry will be required and the address can be taken with a PC-relative
> reference.
>
> There is no way that I know of however to verify that such a pointer is
> not passed externally (except perhaps by static analysis, which is beyond
> the scope of a compiler toolchain), so the onus would be on the software
> writer to make sure the restriction has not been violated.
>
> I think this would actually be a useful enhancement to the FDPIC psABI
> addendum. By having the semantics of STV_INTERNAL symbols defined like
> this in the specification we'll have both the FDPIC and the ROPI/RWPI use
> cases covered with a single ABI (the latter as a special case of the more
> general FDPIC case). With such semantics to build a static PIE program
> for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
> GCC will be using the `-mfdpic -fvisibility=internal' command-line options
> (and of course nothing prevents us from making that the default for the
> compiler at its build time, based either on target selection or `--with-*'
> configuration options if that made people's life easier). Other compilers
> may follow suit.

Ditch it! There is no way I can think of to actually track visibility at
a pointer's *use* place and we have no way to restrict a function pointer
type/variable to only accept assignments from STV_INTERNAL function
references, which would be a guarantee that only an STV_INTERNAL function
could be pointed at. At least with the high-level-language/toolchain
infrastructure we have, down to the static linker.

Barring that we need to keep the function pointer format uniform whether
for local or external references so that one piece of code works for both,
and for FDPIC that means using a pointer to a function descriptor rather
than the entry point as a function pointer.

So while it looked like a nice idea at first unfortunately it does not
appear feasible. Sigh.

Maciej

Sam Elliott

unread,
Mar 26, 2020, 9:48:17 AM3/26/20
to Maciej W. Rozycki, sw-...@groups.riscv.org
Thanks for the reply!
Ah, I see, so the R_RISCV_GPREL_HI20(fun) is actually closer to something like R_RISCV_GPREL_HI20(__riscv_fdt_fun). And we know the FDTs are in the data section.

I have two notational points, which I feel add clarity and ensure this specification lines up with the existing assembler conventions:

1. Can I propose we use the fun@FDT notation, given the psABI already using the @PLT notation for the PLT? Thus above, the relocation would be R_RISCV_GPREL_HI20(fun@FDT) - I don't think the symbols need to change, but I think this makes it more obvious that you're really pointing at the FDT entry here.

2. It came up in a different reply of yours, but you stated you would prefer not to add %-based assembly operators in this proposal (specifically gprel_add). I think this proposal is exactly the time to propose these operators, in line with existing conventions, especially as one of the replies has advocated for trying to avoid the explosion of relocations that are needed to cover GOT, non-GOT, PLT relocations etc.

>
>> I note you?ve had feedback that the sequences may need to be changed for
>> FDPIC anyway, but I think ROPI/RWPI may just use the conventional
>> pc-relative code sequences.
>
> I think you are right here as in your environment you won't ever pass
> function pointers externally, and therefore you don't need to update GP.
>
> This scenario would actually correspond to the STV_INTERNAL export class
> (visibility) in the usual dynamic load scenario, including the FDPIC ABI
> in particular. So I think it may be worth it to permit function symbols
> marked STV_INTERNAL to be referred directly not only for calls, but for
> for taking their address as well in the FDPIC psABI. In that case no FDT
> entry will be required and the address can be taken with a PC-relative
> reference.
>
> There is no way that I know of however to verify that such a pointer is
> not passed externally (except perhaps by static analysis, which is beyond
> the scope of a compiler toolchain), so the onus would be on the software
> writer to make sure the restriction has not been violated.

Yeah this does sound like an issue, but there are other not dissimilar issues on ROPI/RWPI anyway to do with where constant vs non-constant data is placed (and issues around "constant" pointers to non-constant data). I would err towards not creating this semantic issue in FDPIC, but on the other hand if it helps us avoid another psABI, I could see the advantage of using STV_INTERNAL in this way.

>
> I think this would actually be a useful enhancement to the FDPIC psABI
> addendum. By having the semantics of STV_INTERNAL symbols defined like
> this in the specification we'll have both the FDPIC and the ROPI/RWPI use
> cases covered with a single ABI (the latter as a special case of the more
> general FDPIC case). With such semantics to build a static PIE program
> for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
> GCC will be using the `-mfdpic -fvisibility=internal' command-line options
> (and of course nothing prevents us from making that the default for the
> compiler at its build time, based either on target selection or `--with-*'
> configuration options if that made people's life easier). Other compilers
> may follow suit.
>
> NB regular FDPIC static PIE programs will still require FDT entries to be
> created for function pointers passed to modules loaded with dlopen(3).

One of the reasons for us choosing ROPI/RWPI is that it should have lower loading overhead than the full FDPIC implementation, so should be compatible with small embedded systems. Given we want as simple a loader as possible, it's likely the platform will also not provide dlopen(3) in any capacity either.

>
> Does this reply answer your questions and clear your concerns?

I think it does, and I am more satisfied with the "normal FDPIC" proposal. I do need more time to think about how FDPIC+static-PIE may be compatible or not with what I expected to propose for ROPI/RWPI.

Thanks for helping clarify my understanding of how ROPI/RWPI and FDPIC relate to each other

Sam

>
> Maciej

Jim Wilson

unread,
Mar 26, 2020, 11:15:21 PM3/26/20
to Maciej W. Rozycki, RISC-V SW Dev
On Mon, Mar 23, 2020 at 10:13 AM Maciej W. Rozycki <ma...@wdc.com> wrote:
> > Currently, the R_RISCV_GPREL_I and R_RISCV_GPREL_S can only be created
> > by linker relaxation, so we don't have assembler support for them, and
> > this is maybe also why the names are a little different than what you
> > expect.
>
> Well, as long as BFD provides them you can always use `.reloc' to emit
> them with GAS. This doesn't solve the issue of link-time overflow
> detection however; they do not have a corresponding high-part relocation
> so we do expect them to catch overflows to facilitate code that has been
> written for `.sdata'/`.sbss' support, don't we? Or otherwise what is the
> purpose of their existence?

Linker relaxation only creates them if they are in range. This is a
code size optimization. So for a testcase

int i;
int main (void) { return i; }

Using riscv64-unknown-linux-gcc -O -c to compile it and running
objdump on the output I see

0000000000000000 <main>:
0: 000007b7 lui a5,0x0
0: R_RISCV_HI20 i
0: R_RISCV_RELAX *ABS*
4: 0007a503 lw a0,0(a5) # 0 <main>
4: R_RISCV_LO12_I i
4: R_RISCV_RELAX *ABS*
8: 8082 ret

Then at link time if the variable i is within range of gp, then linker
relaxation deletes the lui instruction, changes its reloc to
R_RISCV_NONE, changes the lw to use gp as the base address, and
changes the reloc to R_RISCV_GPREL_I. Adding --emit-relocs to the
link, and running objdump I see

0000000000010436 <main>:
10436: 8341a503 lw a0,-1996(gp) # 12034 <i>
10436: R_RISCV_NONE *ABS*
10436: R_RISCV_RELAX *ABS*
10436: R_RISCV_GPREL_I i-0x12800
10436: R_RISCV_RELAX *ABS*
1043a: 8082 ret

This linker relaxation support is an important part of the RISC-V
toolchain support for reducing code size, and improving performance.
We handle a number of different cases in linker relaxation, and I
expect that we will add more.

Your point about overflows is a good one. If these relaxation relocs
overflow, then it is a linker bug. We have had a few bugs in this
area that I have had to fix. With your proposal where we have both hi
and lo part gprel relocs, overflow should not be a problem. It isn't
immediately obvious to me if that means that they need to be different
reloc numbers though. I suppose it will depend on how the relocs are
represented, but different reloc numbers may be necessary so we can
handle overflow differently for them.

> Also as a side note I think it is GCC (or any other compiler) that should
> produce the intended assembly right from the beginning, so as to get the
> code size right and avoid unnecessary longer sequences such as with
> branches that seem out of range due to size estimate pessimisation but are
> not (of course some sizes are only known at link stage making certain
> kinds of optimisations possible in the linker anyway).

It isn't possible for gcc to produce the smallest code size directly.
Gcc doesn't emit compressed instructions; the assembler does this. So
gcc doesn't know the size of the code. This is fixable in theory, but
doesn't really help. Neither gcc nor the assembler know link time
addresses, and hence some compressed instructions can only be
generated at link time via relaxation. Also, we need link time
address info to perform relaxations like converting lui/add or lui/lw
to a single add or lw instruction off of the gp reg when the address
is in range. There are also other relaxations performed in the
linker. Since code size reduction via relaxation can change function
and variable addresses, we can't know any address until linker
relaxation is done. Like it or not, linker relaxation is a very
important part of the RISC-V toolchain.

> > For relaxation purposes, there should be a reloc on the add, so it should be
> > add t0,t0,gp,%gprel_add(var+addend)
>
> I don't think we need to invent extra syntax here for this as we have the
> `.reloc' pseudo-op for such use cases, e.g. where no instruction operand
> refers to a symbol or there's no symbol involved (cf. R_MIPS_JALR). This
> could look like:

Good point about .reloc. Unfortunately, we already support the four
operand add for the tls reloc, and can't drop that without
compatibility break, but we could consider using .reloc going forward.

Though trying this, I see it gets a little complicated. Given the testcase

__thread int i;
int main (void) { return i; }

riscv64-unknown-linux-gnu-gcc -O -S generates

main:
lui a5,%tprel_hi(i)
add a5,a5,tp,%tprel_add(i)
lw a0,%tprel_lo(i)(a5)
ret

Note the four operand add for the extra reloc. assembling and running
objdump I get

0000000000000000 <main>:
0: 000007b7 lui a5,0x0
0: R_RISCV_TPREL_HI20 i
0: R_RISCV_RELAX *ABS*
4: 004787b3 add a5,a5,tp
4: R_RISCV_TPREL_ADD i
4: R_RISCV_RELAX *ABS*
8: 0007a503 lw a0,0(a5) # 0 <main>
8: R_RISCV_TPREL_LO12_I i
8: R_RISCV_RELAX *ABS*
c: 8082 ret

and then linking with relaxation and objdump I get

0000000000010466 <main>:
10466: 00022503 lw a0,0(tp) # 0 <i>
10466: R_RISCV_NONE *ABS*
10466: R_RISCV_RELAX *ABS*
10466: R_RISCV_NONE *ABS*
10466: R_RISCV_RELAX *ABS*
10466: R_RISCV_TPREL_I i
10466: R_RISCV_RELAX *ABS*
1046a: 8082 ret

Now trying this with .reloc, I was able to make it work, but I need to
add two relocs to the gp add, the TPREL_ADD reloc and a RELAX reloc.
I then ran into the problem that absent the reloc, the assembler
converts the add into a compressed add, and as a compresssed add the
relaxation doesn't work. So I had to disable assembler compression
for the add. That gives me

main:
lui a5,%tprel_hi(i)
.option push
.option norvc
0:
add a5,a5,tp
.reloc 0b, R_RISCV_TPREL_ADD, i
.reloc 0b, R_RISCV_RELAX
.option pop
lw a0,%tprel_lo(i)(a5)
ret

This does work, but it isn't very convenient. The four operand add
got expanded into seven lines of code in the assembly output. Now the
relaxation problem with a compressed add could perhaps be considered a
relaxation bug, and might be fixable. if that is fixable, then the
four operand add only gets expanded to four lines of assembler code,
which isn't as bad as 7, but could still be inconvenient. The current
syntax is much friendlier to people trying to write assembly code.

> Relaxation optimisations like this were considered and comprehensively
> implemented with the nanoMIPS target in GOLD, publicly available. I think
> we ought to follow suite.

FYI there is no RISC-V GOLD support, if someone wants to volunteer to
do that work.

> Therefore I think this will be best considered separately, as this is not
> strictly necessary for FDPIC support on one hand, and may be used for
> other purposes on the other. For this reason I have decided not to
> include any relaxation support with the FDPIC psABI addendum.

I think you will find code size and performance to be disappointing if
linker relaxation is not considered from the start. But yes, it
should be possible to handle relaxation as a separate task.

> I now actually wonder if we shouldn't have used composed relocations
> (e.g. R_RISCV_GOT for GOT references with a corresponding %got operator,
> R_RISCV_PCREL for PC-relative calculations with %pcrel, etc.) to avoid
> proliferating relocation variants providing repeating patterns.

We can't change existing relocs without an ABI break. But as a
general comment, yes, the current scheme is not designed but rather
implemented as necessary.

> Likewise, this can be done with `.reloc' like I noted above, and the
> relaxation defined separately. I think relaxation support that requires
> psABI support (e.g. extra relocations) should be defined in a separate
> section of the standard. Perhaps individual sections included in the base
> psABI and this addendum.
>
> If you think it is important to have relaxation defined right from the
> beginning (why?), then I might consider doing it right away.

Linker relaxation is fundamental to the design of the RISC-V
toolchain, or perhaps I should say the RISC-V GNU toolchain. You
won't get good code size or performance without it. I'm not sure if
separating this stuff out to a separate section make sense. It may
also be difficult to do that, since some of the relaxations don't
require relocs, and some of the relaxations use the same relocs used
elsewhere, and only some of the relaxations require unique relocs used
only for relaxation.

Jim

Jim Wilson

unread,
Mar 27, 2020, 12:05:28 AM3/27/20
to Maciej W. Rozycki, Fangrui Song, RISC-V SW Dev
On Mon, Mar 23, 2020 at 6:49 PM Maciej W. Rozycki <ma...@wdc.com> wrote:
> Hi Fangrui,
> > I am not subscribed, so I suspect my reply will be eaten by Google
> > Groups... I also guessed your email addresses.
>
> It went through as I received it at my LMO personal e-mail address too.
> Perhaps the mailing list isn't open for posting only by subscribers after
> all (I sought advice on that from the list owner, but haven't ever heard
> back).

Maybe the next draft can be done via the
github.com/riscv/riscv-elf-psabi-doc tree as an issue or pull request?
I think most all interested parties are watching that github repo.

> It's an implementation detail (the section is informative), assemblers
> are free to define their own syntax, which is beyond the scope of an ABI.

Well, compatibility between assemblers is useful, and I would hope
that GCC and LLVM at least have compatible assembly syntax.

> In the GNU assembler percent-operators indicate relocation, however we
> currently have an issue in that several operations have not been defined
> and the compiler has no direct way to synthesize them other than with the
> `.reloc' pseudo-op.
>
> In particular there is no way (or I haven't found one) for the compiler
> to emit an instruction sequence to make a function call. Instead the
> `call' assembly macro has to be used, that expands to a pair of
> instructions.

Yes. this is lacking.

We do have an assembler manual, but it is woefully incomplete.
https://github.com/riscv/riscv-asm-manual
and some things like call can't be easily expressed except as a macro
as you mentioned.

I would offer one word of warning, which is that gcc -mcmodel=medany
-mexplicit-relocs is known to fail sometimes. it is a complex problem
that might require an ABI change to fix. It has something like a 1 in
a 1K chance of failing for each risky use. So we should not
accidentally encourage use of explicit relocs in cases when it is
known to fail.
https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/KnziiZtEJNo/M8Vfbw9UCgAJ

Jim

Maciej W. Rozycki

unread,
Apr 4, 2020, 1:07:09 PM4/4/20
to Sam Elliott, sw-...@groups.riscv.org
On Thu, 26 Mar 2020, Sam Elliott wrote:

> >> I am not convinced the code sequence for taking the address of a local
> >> function is correct, for either FDPIC or statically linked ROPI/RWPI
> >> executables, because I don?t think you can do the static relocation
> >> required for R_RISCV_GPREL_HI20(fun) if you don?t know the distance
> >> between the text and data sections (something you only know at runtime).
> >
> > This is a GP-relative rather than a PC-relative reference so the local
> > sequence is right (the external one has an editorial mistake, as noticed
> > by Stef already), as the pointer taken will be to the relevant FDT entry,
> > which is data (poked at by the dynamic loader) and therefore in the data
> > segment. The whole point of a separate GP is to have local data offsets
> > constant with respect to it.
>
> Ah, I see, so the R_RISCV_GPREL_HI20(fun) is actually closer to
> something like R_RISCV_GPREL_HI20(__riscv_fdt_fun). And we know the FDTs
> are in the data section.

Yes, the static linker can handle the redirection, as it does for various
special cases across some targets. There's no need to create actual
static `__riscv_fdt_fun' symbol.

Some thought may have to be put though into recording such an arrangement
in debug information. I know cases where it's not done at all, causing
troubles in debugging (GDB may have heuristics or rough static code
analysis implemented to handle some cases).

> I have two notational points, which I feel add clarity and ensure this
> specification lines up with the existing assembler conventions:
>
> 1. Can I propose we use the fun@FDT notation, given the psABI already
> using the @PLT notation for the PLT? Thus above, the relocation would
> be R_RISCV_GPREL_HI20(fun@FDT) - I don't think the symbols need to
> change, but I think this makes it more obvious that you're really
> pointing at the FDT entry here.

Hmm, now that you mention it I don't think this would be right as in my
view `fun@FDT' is just an alternative notation for the same relocation
operation (IOW I shouldn't have used `fun@PLT' either, because %call_plt()
already denotes that operation on the `fun' symbol). I have therefore
removed the `@PLT' symbol suffixes from code examples instead.

> 2. It came up in a different reply of yours, but you stated you would
> prefer not to add %-based assembly operators in this proposal
> (specifically gprel_add). I think this proposal is exactly the time
> to propose these operators, in line with existing conventions,
> especially as one of the replies has advocated for trying to avoid
> the explosion of relocations that are needed to cover GOT, non-GOT,
> PLT relocations etc.

I think assembly source syntax belongs to an assembly language manual or
specification (if we want to have a normative reference on this). While I
am not opposed to having one, I don't think a psABI document is the right
place for this (as it covers the binary format and not any programming
language), and neither is an architecture specification (as it covers the
hardware and again not any programming language syntax, including the
assembly language).

> > This scenario would actually correspond to the STV_INTERNAL export class
> > (visibility) in the usual dynamic load scenario, including the FDPIC ABI
> > in particular. So I think it may be worth it to permit function symbols
> > marked STV_INTERNAL to be referred directly not only for calls, but for
> > for taking their address as well in the FDPIC psABI. In that case no FDT
> > entry will be required and the address can be taken with a PC-relative
> > reference.
> >
> > There is no way that I know of however to verify that such a pointer is
> > not passed externally (except perhaps by static analysis, which is beyond
> > the scope of a compiler toolchain), so the onus would be on the software
> > writer to make sure the restriction has not been violated.
>
> Yeah this does sound like an issue, but there are other not dissimilar
> issues on ROPI/RWPI anyway to do with where constant vs non-constant
> data is placed (and issues around "constant" pointers to non-constant
> data). I would err towards not creating this semantic issue in FDPIC,
> but on the other hand if it helps us avoid another psABI, I could see
> the advantage of using STV_INTERNAL in this way.

Well, constant data is typically placed in sections like `.rodata' that
have their SHF_WRITE flag clear and at the static link time are merged
with sections containing code into the text segment.

I can see a problem here with referring to such read-only data as the
referrer may not necessarily know if data is constant or not and therefore
whether to use PC-relative or GP-relative addressing. In that case either
link-time relaxation or copy relocations will be required.

If instead constant data is merged with sections containing writable data
into the data segment, then there is no such issue, but memory is wasted.

I'll have to think about it some more, good point!

> > I think this would actually be a useful enhancement to the FDPIC psABI
> > addendum. By having the semantics of STV_INTERNAL symbols defined like
> > this in the specification we'll have both the FDPIC and the ROPI/RWPI use
> > cases covered with a single ABI (the latter as a special case of the more
> > general FDPIC case). With such semantics to build a static PIE program
> > for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
> > GCC will be using the `-mfdpic -fvisibility=internal' command-line options
> > (and of course nothing prevents us from making that the default for the
> > compiler at its build time, based either on target selection or `--with-*'
> > configuration options if that made people's life easier). Other compilers
> > may follow suit.
> >
> > NB regular FDPIC static PIE programs will still require FDT entries to be
> > created for function pointers passed to modules loaded with dlopen(3).
>
> One of the reasons for us choosing ROPI/RWPI is that it should have
> lower loading overhead than the full FDPIC implementation, so should be
> compatible with small embedded systems. Given we want as simple a loader
> as possible, it's likely the platform will also not provide dlopen(3) in
> any capacity either.

NB as we need to separate the PC from the GP anyway in GCC's code
generator to have FDPIC implemented I expect to have ROPI/RWPI supported
for RISC-V as a side effect.

> Thanks for helping clarify my understanding of how ROPI/RWPI and FDPIC
> relate to each other

You are welcome!

Maciej

Maciej W. Rozycki

unread,
Apr 7, 2020, 6:14:16 PM4/7/20
to Jim Wilson, RISC-V SW Dev
Hi Jim,
Right, but in this context the relocation is informational only really;
the relocation is never present (unless explicitly requested with
`.reloc') in object modules and in a fully-linked binary any static
relocations relocations are never going to be fed back to a linker, so
any overflow semantics does not matter.

So we could perhaps reuse relocation codes after all.

> This linker relaxation support is an important part of the RISC-V
> toolchain support for reducing code size, and improving performance.
> We handle a number of different cases in linker relaxation, and I
> expect that we will add more.

Sure.

> Your point about overflows is a good one. If these relaxation relocs
> overflow, then it is a linker bug. We have had a few bugs in this
> area that I have had to fix. With your proposal where we have both hi
> and lo part gprel relocs, overflow should not be a problem. It isn't
> immediately obvious to me if that means that they need to be different
> reloc numbers though. I suppose it will depend on how the relocs are
> represented, but different reloc numbers may be necessary so we can
> handle overflow differently for them.

As a reference we have this regular GOT vs large GOT (`-mxgot') model in
the MIPS target, where the latter uses R_MIPS_GOT_HI16/R_MIPS_GOT_LO16
relocation pairs to refer to the high 16-bit and the low 16-bit parts of
the GOT offset respectively rather than reusing R_MIPS_GOT16 for the low
part relocation, otherwise used for the former model. If you look at the
MIPS psABI (which I'm sure you're familiar with anyway), you'll notice
that R_MIPS_GOT_LO16 and R_MIPS_GOT16 both have the same calculation and
the same relocatable field and the only difference between them is the
overflow check.

We have the small issue that we have two kinds of low-part relocations to
match different machine instruction encodings, so the number of individual
relocations required doubles, and relocation numbers, being limited to 256
different values in the 32-bit ABI, are not exactly an abundant resource.
These are however repeating patterns, so the limitation can be easily
solved by using composed relocations, as I mentioned.

> > Also as a side note I think it is GCC (or any other compiler) that should
> > produce the intended assembly right from the beginning, so as to get the
> > code size right and avoid unnecessary longer sequences such as with
> > branches that seem out of range due to size estimate pessimisation but are
> > not (of course some sizes are only known at link stage making certain
> > kinds of optimisations possible in the linker anyway).
>
> It isn't possible for gcc to produce the smallest code size directly.
> Gcc doesn't emit compressed instructions; the assembler does this. So
> gcc doesn't know the size of the code. This is fixable in theory, but
> doesn't really help. Neither gcc nor the assembler know link time
> addresses, and hence some compressed instructions can only be
> generated at link time via relaxation. Also, we need link time
> address info to perform relaxations like converting lui/add or lui/lw
> to a single add or lw instruction off of the gp reg when the address
> is in range. There are also other relaxations performed in the
> linker. Since code size reduction via relaxation can change function
> and variable addresses, we can't know any address until linker
> relaxation is done. Like it or not, linker relaxation is a very
> important part of the RISC-V toolchain.

FWIW I have long been in favour to linker relaxation and took my part in
shaping how it has been done in the nanoMIPS effort on the design side.

My FDPIC psABI addendum has been created with a possibility to relax some
code sequences in mind, and in particular removing the high-part
relocations where unnecessary. For instance with a very small GOT the
LUI/ADD instruction pair used to add the upper part of the GOT offset can
be removed. This is why the GP is initialised to (DVMA + 2048); otherwise
the offset would not matter and GP could well be equal to DVMA.

As a side note I think the compiler should actually know instruction
sizes and produce compressed instructions where feasible, as it may make
decisions based on that where functionally equivalent code sequences can
be produced that differ only by their size and not instruction count. I
think we discussed that before in a different context.

To the best of my knowledge the RISC-V assembly language dialect is one
of the only two -- the other being the MIPS one -- where there is no 1:1
correspondence between assembly-language and machine instructions. And
over the years I have heard repeated complaints from people about this
peculiarity with the MIPS assembly language, actually leading to efforts
not to introduce new assembly macros corresponding to instructions added
with later ISA revisions even if exiting patterns made one to expect such
macros to exist.

Which makes me very wary about repeating the assembly-language design
decisions with the RISC-V dialect let alone making compilers rely on them.
I guess in compiler-generated code the number of assembly lines does not
really matter. I agree the notation with a fourth operand does help with
handcoded assembly though and I guess I'm fine with that as a means to
produce relocations in the syntax of RISC-V assembly.

Has it been generalised though across all the percent-ops and
instructions, or is it just a hack for this single special case?

As a side note I think a notation for individual fixed-width instructions
would be good having regardless, e.g.:

r.add a5, a5, tp

where the assembler would always produce the regular encoding, as there
are scenarios, for instance patchable code, where you want to have full
control over instruction lengths, and you never want to be forced to use
pseudo-ops such as `.half' to handcode machine code. In the nanoMIPS
psABI there's a reloc dedicated to prevent the linker from shortening such
instructions in relaxation (although same-length instructions may still be
substituted), which is inserted by the assembler automatically based on
the size suffix (the MIPS dialect chose to use suffixes rather than
prefixes, but I think our approach with a prefix is marginally cleaner).

I think at this point we could easily reserve the `r.' mnemonic prefix in
the assembly dialect to denote forced regular encoding.

> > Relaxation optimisations like this were considered and comprehensively
> > implemented with the nanoMIPS target in GOLD, publicly available. I think
> > we ought to follow suite.
>
> FYI there is no RISC-V GOLD support, if someone wants to volunteer to
> do that work.

Mentioned for the avoidance of doubt as to whether this information has
been published and under what licence (i.e. there is no trade secret I
would accidentally leak).

> > Therefore I think this will be best considered separately, as this is not
> > strictly necessary for FDPIC support on one hand, and may be used for
> > other purposes on the other. For this reason I have decided not to
> > include any relaxation support with the FDPIC psABI addendum.
>
> I think you will find code size and performance to be disappointing if
> linker relaxation is not considered from the start. But yes, it
> should be possible to handle relaxation as a separate task.

Thank you actually for pointing me at the R_RISCV_TPREL_ADD example. It
looks to me that indeed we ought to define a corresponding relocation for
GP in this FDPIC addendum.

> > I now actually wonder if we shouldn't have used composed relocations
> > (e.g. R_RISCV_GOT for GOT references with a corresponding %got operator,
> > R_RISCV_PCREL for PC-relative calculations with %pcrel, etc.) to avoid
> > proliferating relocation variants providing repeating patterns.
>
> We can't change existing relocs without an ABI break. But as a
> general comment, yes, the current scheme is not designed but rather
> implemented as necessary.

We are still at a relatively early stage of architecture/ABI development
and might be able to bypass some earlier choices by making smart decisions
as to how to build on the existing standard.

For instance we could retain the semantics of the existing relocations
when used standalone, but use them to indicate the relocatable field only
when last in a composed sequence of relocations. In particualar these
provisions of the ELF gABI explicitly allows us to do so:

"* In all but the last relocation operation of a composed sequence, the
result of the relocation expression is retained, rather than having
part extracted and placed in the relocated field. The result is
retained at full pointer precision of the applicable ABI processor
supplement.

"* In all but the first relocation operation of a composed sequence, the
addend used is the retained result of the previous relocation
operation, rather than that implied by the relocation type."

And the specific semantics of individual relocations is left to the
relevant psABI.

So we could use say a R_RISCV_GPREL/R_RISCV_TPREL_ADD composition to
indicate GP-relative offset relaxation, and overall produce code like:

# Outstanding static relocations
lui t0, %gprel_hi(fun) # R_RISCV_GPREL fun
# R_RISCV_TPREL_HI20
add t0, t0, gp, %gprel_add(fun) # R_RISCV_GPREL fun
# R_RISCV_TPREL_ADD
addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL fun
# R_RISCV_TPREL_LO12_I

(I'm not sure how feasible it would be in the relevant tools to implement
printing R_RISCV_HI20, R_RISCV_ADD and R_RISCV_LO12_I aliases to the TPREL
relocs here; I suppose this should be pretty straightforward and a simple
carry-over flag would do to mark the scenario, and is likely present in
linkers already to handle composed relocations in the first place).

> > Likewise, this can be done with `.reloc' like I noted above, and the
> > relaxation defined separately. I think relaxation support that requires
> > psABI support (e.g. extra relocations) should be defined in a separate
> > section of the standard. Perhaps individual sections included in the base
> > psABI and this addendum.
> >
> > If you think it is important to have relaxation defined right from the
> > beginning (why?), then I might consider doing it right away.
>
> Linker relaxation is fundamental to the design of the RISC-V
> toolchain, or perhaps I should say the RISC-V GNU toolchain. You
> won't get good code size or performance without it. I'm not sure if
> separating this stuff out to a separate section make sense. It may
> also be difficult to do that, since some of the relaxations don't
> require relocs, and some of the relaxations use the same relocs used
> elsewhere, and only some of the relaxations require unique relocs used
> only for relaxation.

I think we only need to define relaxation as a part of the RISC-V psABI,
be it this FDPIC addendum or any other piece, as far as it actually
affects the ABI and leave anything else up to linker implementers.

For instance if a specific new relocation is required, such as with the:

add t0, t0, gp, %gprel_add(fun)

instruction above, then we ought to standardise it (please note however,
had we used composed relocations from the beginning, nothing specific to
the RISC-V psABI FDPIC addendum would be required as R_RISCV_GPREL is a
general relocation and R_RISCV_ADD would have been previously defined in
the RISC-V psABI proper).

Conversely if relocations defined elsewhere are needed for some kind of
relaxation or none are required, then naturally provisions for such
relaxations have no place in this document as they are either defined by
the other document or are implementation specific.

Overall thank you for your feedback. Please let me know if you find
anything I wrote unclear or you have any other comments or questions.

Maciej

Maciej W. Rozycki

unread,
Apr 9, 2020, 8:07:52 PM4/9/20
to Jim Wilson, Fangrui Song, RISC-V SW Dev
On Thu, 26 Mar 2020, Jim Wilson wrote:

> > > I am not subscribed, so I suspect my reply will be eaten by Google
> > > Groups... I also guessed your email addresses.
> >
> > It went through as I received it at my LMO personal e-mail address too.
> > Perhaps the mailing list isn't open for posting only by subscribers after
> > all (I sought advice on that from the list owner, but haven't ever heard
> > back).
>
> Maybe the next draft can be done via the
> github.com/riscv/riscv-elf-psabi-doc tree as an issue or pull request?
> I think most all interested parties are watching that github repo.

As we discussed before off the list, I'm sceptical about the use of
GitHub for our project as they require anyone wishing to have write access
to accept their T&Cs, which they may vary according to their requirements
at any time. That may be OK to a newcomer wanting to gain some reach with
their software experiment, however for a major project like the RISC-V ISA
that does not sound good for me.

OTOH using a mailing list is safe in that even if the list server and
associated archives go down (NB we can have many, e.g. `marc.info' might
agree to add us to their archive if we ask nicely), past messages will
have been archived by at least some recipients and can be recovered.

Some essential FOSS projects like the GNU toolchain and especially the
Linux kernel have relied on mailing lists for technical reviews since
forever and while they keep an eye on alternatives they have concluded no
better medium to have appeared so far.

So I'd rather stick to e-mail for this effort, and below I have included
the current version of the document. I will try to address Stef O'Rear's
concerns next.

> > It's an implementation detail (the section is informative), assemblers
> > are free to define their own syntax, which is beyond the scope of an ABI.
>
> Well, compatibility between assemblers is useful, and I would hope
> that GCC and LLVM at least have compatible assembly syntax.

I think we cannot force everyone to use the same syntax, however if we
want to encourage doing that, then we need to give people a chance and
provide a normative reference.

> > In the GNU assembler percent-operators indicate relocation, however we
> > currently have an issue in that several operations have not been defined
> > and the compiler has no direct way to synthesize them other than with the
> > `.reloc' pseudo-op.
> >
> > In particular there is no way (or I haven't found one) for the compiler
> > to emit an instruction sequence to make a function call. Instead the
> > `call' assembly macro has to be used, that expands to a pair of
> > instructions.
>
> Yes. this is lacking.
>
> We do have an assembler manual, but it is woefully incomplete.
> https://github.com/riscv/riscv-asm-manual
> and some things like call can't be easily expressed except as a macro
> as you mentioned.

FWIW using macros looks to me like repeating old MIPS assembly language's
mistakes. While having assembly idioms for individual instructions such
as NOP or MV does appear both useful and harmless, and does not conflict
with the spirit of an assembly dialect being a human-writable way of
directly expressing machine code, providing no way but with complex macros
to produce some instruction sequences does not seem the right way to me.

> I would offer one word of warning, which is that gcc -mcmodel=medany
> -mexplicit-relocs is known to fail sometimes. it is a complex problem
> that might require an ABI change to fix. It has something like a 1 in
> a 1K chance of failing for each risky use. So we should not
> accidentally encourage use of explicit relocs in cases when it is
> known to fail.
> https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/KnziiZtEJNo/M8Vfbw9UCgAJ

Ah, it is a known issue with PC-relative addressing overall, caused by
the misalignment (with respect to the data type referred, `long long' in
this case) of the PC used in a calculation made by AUIPC causing a
carry/borrow to/from the high part to occur in the PC-relative offset when
accessing subsequent words of a multi-word data type (or multi-dword data
in the RV64 case) that crosses the boundary of the 12-bit range spanned by
the low part. And I don't actually think that the use, or the lack, of
explicit relocations is going to change anything here: if a carry/borrow
happens at the static link time, the issue will strike.

This could be solved in hardware by masking off a number of low-order
bits of the PC in the AUIPC calculation. It might be hard to determine
what number would be right though: if too low it would only support
narrower data types, if too high it would waste some text memory by the
hightened alignment requirement (although this would be per-segment rather
than per-object or per-function, so perhaps not a big deal). Anyway, we
don't have it, so we need to address it via software means.

There are a couple of easy ways to tackle it, some without and some with
a need to update the psABI. Taking the code from your example we have:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
lw a1, %pcrel_lo(.LA0 + 4)(a5)
ret

as it stands.

One way is to make sure the PC is correctly aligned WRT data referred, so
taking RV32 as the target (from now on) for a 64-bit access, like a DImode
integer or a DFmode real type we can instead emit:

sub:
.balign 8
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
lw a1, %pcrel_lo(.LA0 + 4)(a5)
ret

although at the cost of 2 bytes of code wasted on average for the sequence
itself (plus any alignment increase for functions causing extra padding).
Likewise with a 128-bit access, like a TImode integer, a DFmode complex
real or some kind of a vector type:

sub:
.balign 16
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
lw a1, %pcrel_lo(.LA0 + 4)(a5)
lw a2, %pcrel_lo(.LA0 + 8)(a5)
lw a3, %pcrel_lo(.LA0 + 12)(a5)
ret

at the cost of 6 bytes of code wasted on average (plus function
alignment). Of course in both cases there will be extra execution time
required for any alignment NOPs inserted. This however does not require
any psABI update and will work as it stands.

Another way is to preload the address of the data accessed and offset it
separately:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
addi a5, a5, %pcrel_lo(.LA0)
lw a0, 0(a5)
lw a1, 4(a5)
ret

This takes the same amount of space as the original on RV32C, however
takes once cycle more on scalar implementations and takes a fixed amount
of 4 bytes extra in the absence of the C extension. Similarly:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
addi a5, a5, %pcrel_lo(.LA0)
lw a0, 0(a5)
lw a1, 4(a5)
lw a2, 8(a5)
lw a3, 12(a5)
ret

takes 4 bytes less on RV32C, one cycle more on scalar implementations and
a fixed amount of 4 bytes extra in the absence of the C extension.
Neither require any psABI update either.

Finally we can emit full individual load sequences:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
.LA1: auipc a5, %pcrel_hi(ll + 4), %pcrel_auipc(ll)
lw a1, %pcrel_lo(.LA1)(a5)
ret

and:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
.LA1: auipc a5, %pcrel_hi(ll + 4), %pcrel_auipc(ll)
lw a1, %pcrel_lo(.LA1)(a5)
.LA2: auipc a5, %pcrel_hi(ll + 8), %pcrel_auipc(ll + 4)
lw a2, %pcrel_lo(.LA2)(a5)
.LA3: auipc a5, %pcrel_hi(ll + 12), %pcrel_auipc(ll + 8)
lw a3, %pcrel_lo(.LA3)(a5)
ret

where %pcrel_auipc emits an R_RISCV_PCREL_AUIPC relocation used in linker
relaxation to remove an AUIPC instruction with an R_RISCV_PCREL_HI20
relocation attached iff the calculation of both expressions associated
with there relocations works out at the same value as far as the high
20-bit part is concerned.

These sequences do require a psABI update and preclude the use of
compressed instructions (which may actually be a good idea at `-Os' even
if we have this implemented), but at the static link time they only leave
an extra AUIPC instruction (at most once per sequence) if it is indeed
required and cause no wasted extra execution cycles.

Please note however that this issue only affects PC-relative addressing,
because the PC changes as execution goes. Whereas GP remains constant and
aligned according to the alignment of the data segment, which has to be no
smaller than the largest alignment of all the data types used within.

BTW, why has such peculiar (and possibly limiting) semantics of the
low-part relocation been chosen rather than the obvious:

sub:
0: