[RFC] RISC-V ELF FDPIC psABI addendum

329 views
Skip to first unread message

Maciej W. Rozycki

unread,
Mar 1, 2020, 3:57:10 AM3/1/20
to sw-...@groups.riscv.org
Hi,

I am currently working on FDPIC support for RISC-V/Linux targets in the
GNU toolchain (GCC and GNU binutils) and a couple of runtimes (uClibc,
musl, possibly glibc). While at this time I only intend to implement the
pieces I named above, the psABI extension I am going to base this stuff on
is meant to become a part of the RISC-V ELF psABI, once proved with the
implementation, available for everyone to suit their requirements.

Therefore below I am sending a preliminary document that specifies the
technical details of the extension in hope someone finds it useful or
would like to comment on it at this early stage of development.

This design has been originally presented at LCA 2020 and a recording is
available here: <https://www.youtube.com/watch?v=GydyykyNjxs>.

I will appreciate your questions, comments and any other kind of
feedback.

Maciej

--------------------------------------------------------------------------
RISC-V FDPIC ELF psABI Addendum


Chapter 4 Object Files

4.1 Machine Information (normative)

A bit in the `e_flags' member of the ELF header shall identify, when set,
a file that conforms to this ABI:

#define EF_RISCV_FDPIC 0x0010


4.2 Relocation Types (normative)

The following relocation types have been defined to support this ABI.

Figure 4.1 Relocatable Fields, Relocated Bits Marked With X's

15 12 0 15 0
+----+-----------+ +----------------+
|XXXX| | |XXXXXXXXXXXXXXXX|
+----+-----------+ +----------------+
hi20[15:12] hi20[31:16]

15 0 15 4 0
+----------------+ +------------+---+
| | |XXXXXXXXXXXX| |
+----------------+ +------------+---+
lo12i[11:0]

15 11 7 0 15 9 0
+---+-----+------+ +-------+--------+
| |XXXXX| | |XXXXXXX| |
+---+-----+------+ +-------+--------+
lo12s[4:0] lo12s[11:5]

31 0
+--------------------------------+
|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
+--------------------------------+
word32[31:0]

15 12 0 15 0 15 0 15 4 0
+----+-----------+ +----------------+ +----------------+ +------------+---+
|XXXX| | |XXXXXXXXXXXXXXXX| | | |XXXXXXXXXXXX| |
+----+-----------+ +----------------+ +----------------+ +------------+---+
hi20lo12i[15:12] hi20lo12i[31:16] hi20lo12i[11:0]

63 0
+----------------------------------------------------------------+
|XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX|
+----------------------------------------------------------------+
word64[63:0]

Note: For the value inserted into these fields T specifies truncation and
V specifies a signed overflow check on a relocation by relocation basis.
In the T case any high-order bits that extend beyond the width of the
field and are not equal to the highest-order bit that still fits are
silently ignored. In the V case the presence of such high-order bits
causes the static linker to produce a link error.

Table 4.1 Relocation Operands

Operand | Description
=========+================================================================
A | Relocation addend.
---------+----------------------------------------------------------------
DBA | Data segment's base address; 0 in static link.
---------+----------------------------------------------------------------
G | The offset from GP of a GOT entry for the symbol referred by
| the relocation.
---------+----------------------------------------------------------------
GP | The value of GP associated with the symbol referred, nominally
| (DVMA + DBA + 2048).
---------+----------------------------------------------------------------
P | The place (offset or address) of the storage unit affected by
| the relocation.
---------+----------------------------------------------------------------
PLTE | The address of a PLT entry associated with the symbol referred.
---------+----------------------------------------------------------------
PLTI | The address of a PLT entry designated to make indirect calls.
---------+----------------------------------------------------------------
S | The value of the symbol referred by the relocation.
---------+----------------------------------------------------------------
TBA | Text segment's base address; 0 in static link.

Table 4.2 Relocation Types

Name | Value | Field | Symbol | Calculation
==========================+=======+=============+===========+=============
R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
R_RISCV_REL_TEXT (alias) | | | |
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GP | 12 | T-word32,64 | any | GP
--------------------------+-------+-------------+-----------+-------------
R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A
==========================+=======+=============+===========+=============
| | | local | S - P
R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
| | | n/a | PLTI - P
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_HI20 | 59 | V-hi20 | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_GOT_HI20 | 62 | V-hi20 | any | G
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_GOT_LO12_I | 63 | T-lo12i | any | G

Local symbols are never preempted and therefore they can be addressed
with relative addressing in PIC code. For text symbols PC-relative
addressing can be used both in ordinary PIC and FDPIC code and therefore
the same relocations are used in both cases.

PC-relative addressing cannot however be used in FDPIC code for data
symbols as the relative position of text and data with respect to each
other is not fixed and therefore a separate global pointer (GP) has to be
maintained. This ABI designates the x3 register to hold the value of the
GP and defines gp as an alias ABI name of this register. This register
is used to access local data using direct GP-relative addressing.

The R_RISCV_GPREL_HI20, R_RISCV_GPREL_LO12_I and R_RISCV_GPREL_LO12_S
static relocations are defined to support direct GP-relative addressing
suitable for local data access.

External symbols can be preempted and therefore have to be addressed
indirectly. The Global Offset Table (GOT) is used to hold the addresses
of external data symbols. GOT itself is local data and can therefore be
accessed with GP-relative addressing.

The R_RISCV_GPREL_GOT_HI20 and R_RISCV_GPREL_GOT_LO12_I static
relocations are defined to support indirect GP-relative addressing
suitable for external data access.

Occasionally a GOT entry will be created for local data to satisfy the
use of R_RISCV_GPREL_GOT_HI20 and R_RISCV_GPREL_GOT_LO12_I relocations in
code referring to such data. The R_RISCV_REL_DATA dynamic relocation is
defined to support GP-relative relocation of such GOT entries at program
load time.


4.3 Procedure Calls (normative)

Local procedure calls use the same code sequence as with ordinary PIC
code. PC-relative addressing can be used as all code locations are fixed
with respect to each other and the address is not interpreted beyond
making the jump itself. GP does not change in the process of making a
local procedure call as control remains in the same module.

External calls need to set the PC and the GP both at a time. This is
because external symbols can be preempted, in which case a call will pass
control to another module, which will usually require access to its local
data.

A data structure called Function Descriptor Table (FDT) is created by the
static linker to hold PC/GP pairs used in external procedure calls.
Addresses of individual FDT entries serve as pointers to the respective
procedures. An FDT entry is therefore created for each function symbol
that is external, whether defined or not, or whose address is taken for
a purpose other than making a call.

As the ultimate values of the PC and the GP are only determined at load
time the static linker attaches dynamic relocations to data in the FDT.
For external function symbols the R_RISCV_JUMP_SLOT and R_RISCV_GP
relocations are used for the PC and GP respectively, both referring to
the function symbol. For local function symbols whose address is taken
the R_RISCV_REL_TEXT and R_RISCV_GP relocations are used with no symbol
referred.

Figure 4.2 Function Description Table

FDT Outstanding dynamic relocations
__riscv_fdt_func1 ---> +------------------+
| Text Pointer 1 | R_RISCV_JUMP_SLOT func1
+------------------+
| Global Pointer 1 | R_RISCV_GP func1
__riscv_fdt_func2 ---> +==================+
| Text Pointer 2 | R_RISCV_JUMP_SLOT func2
+------------------+
| Global Pointer 2 | R_RISCV_GP func2
__riscv_fdt_func3 ---> +==================+
| Text Pointer 3 | R_RISCV_REL_TEXT
+------------------+
| Global Pointer 3 | R_RISCV_GP
+==================+
| . . . |

A Procedure Linkage Table (PLT) is created to handle calls via the FDT,
so that the same code sequence is used in the program proper to make
direct procedure calls regardless of whether the function symbol called
is local or external. Since the PLT is local to the module its entries
can be reached with PC-relative addressing. Individual PLT entries are
created and called into for each external procedure called.

For direct calls an FDT entry is used that corresponds to the procedure
called and has been created in the module making the call. Therefore
code in the PLT can access the FDT entry directly as local data, using
GP-relative addressing.

For indirect calls the PLT is also used and an FDT entry is used that
corresponds to the procedure called and has been created in the module
providing the function symbol of the procedure.

If a function symbol is local, then the GP-relative address of the FDT
entry is directly used by the static linker as the value retrieved in
taking a function's address.

If a function symbol is external, then an external dynamic data symbol is
created that refers to that FDT entry and whose name is constructed by
prepending `__riscv_fdt_' to the function's symbol name.

If the address of an external function symbol is taken, then a GOT entry
is created for the corresponding `__riscv_fdt_' dynamic data symbol and
used to satisfy the reference.

When making an indirect call a dedicated PLT entry is used that is common
to all indirect calls and upon invocation of that PLT entry the x5 (t0)
register holds the address of the FDT entry in the module providing the
function symbol of the procedure to call.

Since the GP is different for each module the value held in the x3 (gp)
register can change in the course of making a procedure call. Therefore
under the FDPIC calling convention the x3 (gp) register is considered
call-clobbered and it has to be preserved by the caller when making a
call to an external function symbol unless it is known that the call does
not return or that the GP is no longer referred after the return from the
procedure called. A stack slot has to be typically allocated and
initialized in a function's prologue to preserve the x3 (gp) register
across calls.


4.4 Typical Code Sequences (informative)

In the sequences below expressions on the right-hand side of relocation
names denote the symbol and the addend specified with the relocation. In
the absence of a `+' operator only a symbol is specified, otherwise the
left-hand side of the addition is a symbol and the right-hand side is an
addend. If a symbol is specified as `*ABS*', then the value is 0 (the
symbol index is STN_UNDEF in the relocation). The value of ABS() is the
absolute (static-link-time) value of the expression in the parentheses.

4.4.1 Local Data Addressing

Ordinary PIC code, using PC-relative addressing:

# Outstanding static relocations
label:
auipc t0, %pcrel_hi(var+addend) # R_RISCV_PCREL_HI20 var+addend
lbu t1, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
sb t2, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_S label

Corresponding FDPIC code, using GP-relative addressing:

# Outstanding static relocations
lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
c.add t0, gp
lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend


4.4.2 External Data Addressing

Ordinary PIC code, using GOT and PC-relative addressing:

# Outstanding static relocations
label:
auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
lb t1, addend(t0)
sb t2, addend(t0)

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 var

# or if the data symbol turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(var)

Corresponding FDPIC code, using GOT and GP-relative addressing:

# Outstanding static relocations
lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
c.add t0, gp
l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
lbu t1, addend(t0)
sb t2, addend(t0)

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 var

# or if the function turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(var)


4.4.3 Taking a Function's Address

FDPIC code, local function:

# Outstanding static relocations
lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
c.add t0, gp
addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun

FDPIC code, external function:

# Outstanding static relocations
lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
c.add t0, gp
addi t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 __riscv_fdt_fun

# or if the function symbol turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(__riscv_fdt_fun)


4.4.4 Procedure Calls Using the PLT

FDPIC code, direct call:

# Outstanding static relocations
label:
auipc ra, %pcrel_call_hi(fun@PLT) # R_RISCV_CALL_PLT fun
jalr ra, ra, %pcrel_call_lo(label)
l[w|d] gp, <gp_slot>(sp)

FDPIC code, indirect call (to a2):

# Outstanding static relocations
c.mv t0, a2
label:
auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
jalr ra, ra, %pcrel_call_lo(label)
l[w|d] gp, <gp_slot>(sp)

# The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
# the PLT entry associated with indirect calls.


Chapter 5 Program Loading

5.1 Base Addresses (normative)

A single individual base address is defined by the ELF gABI for a module
being loaded that determines the amount to relocate the module by. This
is unsuitable for FDPIC modules, which need to have their text segments
and data segments mapped in memory separately. This is so that where a
module is mapped multiple times in a no-MMU system, only a single copy of
its text segments is present in memory and serves all the mappings, while
a separate copy of its data segments is present in memory for each of the
mappings. Consequently the distance between text and data segments is no
longer constant between mappings and there is no single base address.

Instead a separate text base address and a data base address is defined
as a difference between the load address and the link address of the text
segment and the data segment respectively. These two base addresses are
used by the dynamic loader to relocate text and data respectively.

In the initial module, such as a program interpreter, loaded by an OS or
other executive runtime the text base address of said initial module can
be determined by calculating a run-time difference between the actual
value of the PC for a given location, such as the beginning of the text
segment, obtained with a PC-relative reference to a symbol associated
with that location and the value of a corresponding absolute symbol
associated with the same location. The way to determine the data base
address and therefore the value of GP of the initial module is specific
to the individual OS or other executive runtime and therefore beyond the
scope of this specification. Possibilities include passing suitable
information via the initial stack, such as in the auxiliary vector,
preinitializing a processor register, providing a system call to retrieve
it, etc.

The presence of a separate text base address and a data base address also
means that ET_EXEC images cannot be supported with the FDPIC psABI as it
is not possible to make multiple copies of such image's data segment in a
no-MMU system without the ability to relocate it at load time.


5.2 Lazy Binding (normative)

Lazy binding can be optionally implemented by the dynamic loader. If it
is implemented, then the run-time relocation of R_RISCV_JUMP_SLOT and
their associated R_RISCV_GP relocations present in the FDT is done in two
stages.

In the first stage, which is done by the dynamic loader at the time a
module is loaded, R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are
resolved respectively to the address of the lazy resolver and the value
of the global pointer associated with the module providing the lazy
resolver.

In the second stage, which is done when the lazy resolver is reached by
means of making a call through an FDT entry referring to it,
R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are resolved respectively
to the address of the function symbol associated with the FDT entry and
the value of the global pointer associated with the module providing the
function symbol. To be able to do its work the lazy resolver is called
with certain registers containing values as follows:

* x3 (gp) holds the dynamic loader's GP value as with an ordinary FDT
entry (this is a consequence of the first stage of run-time relocation)

* x5 (t0) holds a pointer to the FDT entry to relocate

* x6 (t1) holds the caller's GP value

Registers have been assigned such as to work with the RV32E instruction
set as well.

Upon completion of the second stage the lazy resolver makes a jump to the
newly resolved address of the function symbol.


5.3 Example PLT Code (informative)

@PLT:
l[w|d] t2, 0(t0)
mv t1, gp
l[w|d] gp, [4|8](t0)
jr t2
fun1@PLT:
lui t0, %gprel_hi(FDT[fun1])
addi t0, %gprel_lo(FDT[fun1])
add t0, gp
j @PLT
fun2@PLT:
lui t0, %gprel_hi(FDT[fun2])
addi t0, %gprel_lo(FDT[fun2])
add t0, gp
j @PLT

Jim Wilson

unread,
Mar 6, 2020, 7:51:53 PM3/6/20
to Maciej W. Rozycki, RISC-V SW Dev
On Sun, Mar 1, 2020 at 12:57 AM Maciej W. Rozycki <ma...@wdc.com> wrote:
> I will appreciate your questions, comments and any other kind of
> feedback.

The style is different from the existing psABI, though it looks like a
better style. Maybe you could rewrite our existing psABI to improve
it?

This was mentioned in the RISC-V software meeting, and in a
riscv-elf-psabi-doc issue, so if there are others with opinions they
should comment soon. And if not, I think we should just go forward
with this plan.

> ---------+----------------------------------------------------------------
> GP | The value of GP associated with the symbol referred, nominally
> | (DVMA + DBA + 2048).

This uses DVMA without defining it.

> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A

This is identical to the existing R_RISCV_GPREL_I reloc.

> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A

This is identical to the existing R_RISCV_GPREL_S reloc.

Currently, the R_RISCV_GPREL_I and R_RISCV_GPREL_S can only be created
by linker relaxation, so we don't have assembler support for them, and
this is maybe also why the names are a little different than what you
expect.

> Corresponding FDPIC code, using GP-relative addressing:
>
> # Outstanding static relocations
> lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
> c.add t0, gp
> lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
> sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend

Not all targets have compressed instructions. The assembler will
convert regular instructions to compressed instructions if it can, so
using add instead of c.add is more general with no code size
optimization loss.

For relaxation purposes, there should be a reloc on the add, so it should be
add t0,t0,gp,%gprel_add(var+addend)
With this extra reloc, if %gprel_hi(var+addend) is zero, then we can
relax the three instruction sequence for the load to one instruction,
deleting the first two, and modifying the load to
lbu t1,%gprel_lo(var+addend)(gp)
and likewise for the store.

See for instance the tprel_add reloc used for TLS which works the same
way. There is an example in the psABI doc.

> auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var

This should be %got_pcrel_hi(var). It was first added to llvm, and
then just added to GNU Binutils this week. It is already mentioned in
riscv-asm-manual, but needs to be mentioned in the psABI. That is on
my todo list.

> lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> c.add t0, gp
> l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> lbu t1, addend(t0)

As above, adding a reloc, e.g.. %gprel_got_add, to the add makes this relaxable.

Jim

Stef O'Rear

unread,
Mar 7, 2020, 10:48:56 PM3/7/20
to Maciej W. Rozycki, RISC-V SW Dev, i...@maskray.me, dal...@aerifal.cx
On Sunday, March 1, 2020 at 3:57:10 AM UTC-5, Maciej W. Rozycki wrote:
> Hi,
>
> I am currently working on FDPIC support for RISC-V/Linux targets in the
> GNU toolchain (GCC and GNU binutils) and a couple of runtimes (uClibc,
> musl, possibly glibc). While at this time I only intend to implement the
> pieces I named above, the psABI extension I am going to base this stuff on
> is meant to become a part of the RISC-V ELF psABI, once proved with the
> implementation, available for everyone to suit their requirements.
>
> Therefore below I am sending a preliminary document that specifies the
> technical details of the extension in hope someone finds it useful or
> would like to comment on it at this early stage of development.
>
> This design has been originally presented at LCA 2020 and a recording is
> available here: <https://www.youtube.com/watch?v=GydyykyNjxs>.
>
> I will appreciate your questions, comments and any other kind of
> feedback.

I've discussed this proposal with Rich Felker and Fangrui Song (CCed)
in #musl; the
following comments are exclusively mine.

The register usage is exactly what I had in mind, and most of the code
sequences seem approximately fine (several are not), but the relocation
structure is extremely different from the other existing FDPIC ABIs[1][2][3],
in a way which will make it difficult to support in generic code such as musl;
I believe the ABI should be made as consistent as possible to avoid surprises
like what we went through with TLS copy relocs.

[1]: http://ftp.redhat.com/pub/redhat/gnupro/FRV/FDPIC-ABI.txt
[2]: https://j-core.org/downloads/fdpic-sh.txt
[3]: https://github.com/mickael-guene/fdpic_doc/blob/master/abi.txt
None of the SH, FRV, or ARM FDPIC ABIs define anything equivalent to REL_DATA
or GP. Why is it there?

"Data segment base address" does not seem to be defined anywhere?

> ==========================+=======+=============+===========+=============
> | | | local | S - P
> R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
> | | | n/a | PLTI - P

None of the SH, FRV, or ARM ABIs use anything like PLTI.

> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_HI20 | 59 | V-hi20 | local | S - GP + A
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_GOT_HI20 | 62 | V-hi20 | any | G
> --------------------------+-------+-------------+-----------+-------------
> R_RISCV_GPREL_GOT_LO12_I | 63 | T-lo12i | any | G

The GPREL and GPREL_GOT relocations look correct. We also need assembler
syntax for them, and to decide whether they are %functions or @MODIFIERS.

We also need R_RISCV_FUNCDESC (canonical function descriptor),
R_RISCV_FUNCDESC_VALUE (copy of function descriptor),
R_RISCV_GPREL_GOTFUNCDESC_(HI20, LO12_I) (offset within GOT of a pointer-sized
slot which will receive a pointer to the canonical function descriptor),
R_RISCV_GPREL_FUNCDESC_(HI20, LO12).

R_RISCV_FUNCDESC and R_RISCV_FUNCDESC_VALUE are dynamic relocations.
Why do you need REL_DATA when ARM, FRV, and SH don't?

> 4.3 Procedure Calls (normative)
>
> Local procedure calls use the same code sequence as with ordinary PIC
> code. PC-relative addressing can be used as all code locations are fixed
> with respect to each other and the address is not interpreted beyond
> making the jump itself. GP does not change in the process of making a
> local procedure call as control remains in the same module.

Should clarify that while GP does not change as part of the call instruction
itself, the called procedure is allowed to clobber GP (this is necessary for
external tail calls).

> External calls need to set the PC and the GP both at a time. This is
> because external symbols can be preempted, in which case a call will pass
> control to another module, which will usually require access to its local
> data.
>
> A data structure called Function Descriptor Table (FDT) is created by the
> static linker to hold PC/GP pairs used in external procedure calls.
> Addresses of individual FDT entries serve as pointers to the respective
> procedures. An FDT entry is therefore created for each function symbol
> that is external, whether defined or not, or whose address is taken for
> a purpose other than making a call.

Canonical function descriptors are created by the *dynamic* linker, not ld,
and they exist outside of any load segment (except possibly when static
linking). Every function which is referred to gets a single canonical
function descriptor. Other FDPIC ABIs don't use the "FDT" term and I
think it detracts from clarity to use it here.

R_arch_FUNCDESC_VALUE can create a copy of a function descriptor at any
two-word aligned address in the load segment, but there is no "descriptor
table" as a cohesive entity.

> As the ultimate values of the PC and the GP are only determined at load
> time the static linker attaches dynamic relocations to data in the FDT.
> For external function symbols the R_RISCV_JUMP_SLOT and R_RISCV_GP
> relocations are used for the PC and GP respectively, both referring to
> the function symbol. For local function symbols whose address is taken
> the R_RISCV_REL_TEXT and R_RISCV_GP relocations are used with no symbol
> referred.

Every other FDPIC ABI uses a R_ARCH_FUNCDESC_VALUE relocation to fill in both
words of a function descriptor copy at once.

>
> Figure 4.2 Function Description Table
>
> FDT Outstanding dynamic relocations
> __riscv_fdt_func1 ---> +------------------+
> | Text Pointer 1 | R_RISCV_JUMP_SLOT func1
> +------------------+
> | Global Pointer 1 | R_RISCV_GP func1
> __riscv_fdt_func2 ---> +==================+
> | Text Pointer 2 | R_RISCV_JUMP_SLOT func2
> +------------------+
> | Global Pointer 2 | R_RISCV_GP func2
> __riscv_fdt_func3 ---> +==================+
> | Text Pointer 3 | R_RISCV_REL_TEXT
> +------------------+
> | Global Pointer 3 | R_RISCV_GP
> +==================+
> | . . . |

again, this is gratuitously different from what every other arch does.

Other arches use 1 relocation per function descriptor copy, and they don't
create duplicate symbols.

> A Procedure Linkage Table (PLT) is created to handle calls via the FDT,
> so that the same code sequence is used in the program proper to make
> direct procedure calls regardless of whether the function symbol called
> is local or external. Since the PLT is local to the module its entries
> can be reached with PC-relative addressing. Individual PLT entries are
> created and called into for each external procedure called.
>
> For direct calls an FDT entry is used that corresponds to the procedure
> called and has been created in the module making the call. Therefore
> code in the PLT can access the FDT entry directly as local data, using
> GP-relative addressing.

Again, "FDT" is misleading about how function descriptors are created.

> For indirect calls the PLT is also used and an FDT entry is used that
> corresponds to the procedure called and has been created in the module
> providing the function symbol of the procedure.

This seems a bad idea and gratuitously different from every other FDPIC ABI.
Other FDPIC ABIs use code at the call site for indirect calls. If you are
doing this for code size reasons, a compiler generated function in a
.gnu.linkonce section is a much better idea because it does not create an ABI
constraint.

> If a function symbol is local, then the GP-relative address of the FDT
> entry is directly used by the static linker as the value retrieved in
> taking a function's address.

> If a function symbol is external, then an external dynamic data symbol is
> created that refers to that FDT entry and whose name is constructed by
> prepending `__riscv_fdt_' to the function's symbol name.

This is gratuitously different from other FDPIC ABIs, which use *FUNCDESC*
relocations to generate function descriptors.

It is also very inefficient since it doubles the number of symbols and symbol
names in a library.

> If the address of an external function symbol is taken, then a GOT entry
> is created for the corresponding `__riscv_fdt_' dynamic data symbol and
> used to satisfy the reference.

The compiler should generate an @GOTFUNCDESC reference and the linker should
generate a R_RISCV_FUNCDESC relocation, not create a new symbol.

> When making an indirect call a dedicated PLT entry is used that is common
> to all indirect calls and upon invocation of that PLT entry the x5 (t0)
> register holds the address of the FDT entry in the module providing the
> function symbol of the procedure to call.

No other FDPIC ABI does this.
This is good, subject to Jim's point about add vs c.add

> 4.4.2 External Data Addressing
>
> Ordinary PIC code, using GOT and PC-relative addressing:
>
> # Outstanding static relocations
> label:
> auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
> l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
> lb t1, addend(t0)
> sb t2, addend(t0)

> # Outstanding dynamic relocations for the GOT entry
> # R_RISCV_32,64 var

So far so good

> # or if the data symbol turns out local at static link time
> # R_RISCV_REL_DATA *ABS*+ABS(var)

I don't think this actually works, for one thing var might be in rodata, there
could also be multiple data segments. I don't see anything like REL_DATA in
other FDPIC ABIs, I think it always has to be R_RISCV_{32,64}, or whatever the
other arches do.

> Corresponding FDPIC code, using GOT and GP-relative addressing:
>
> # Outstanding static relocations
> lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> c.add t0, gp
> l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> lbu t1, addend(t0)
> sb t2, addend(t0)
>
> # Outstanding dynamic relocations for the GOT entry
> # R_RISCV_32,64 var
>
> # or if the function turns out local at static link time
> # R_RISCV_REL_DATA *ABS*+ABS(var)

Code looks good, same concern about REL_DATA.

> 4.4.3 Taking a Function's Address
>
> FDPIC code, local function:
>
> # Outstanding static relocations
> lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
> c.add t0, gp
> addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun
>
> FDPIC code, external function:
>
> # Outstanding static relocations
> lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
> c.add t0, gp
> addi t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun

These are, unfortunately, not compatible with dynamic linking semantics. A
function needs to have the same address regardless of which module its address
is taken in, so you have to always get the canonical function descriptor, which
has to come from the GOT because canonical function descriptors are created by
the dynamic linker. This should be something like (same for both local and
external):

lui t0, %gprel_got_hi(fun@FUNCDESC) #
R_RISCV_GPREL_GOTFUNCDESC_HI20 fun
add t0, t0, gp
l[w|d] t0, %gprel_got_lo(fun@FUNCDESC)(t0) #
R_RISCV_GPREL_GOTFUNCDESC_LO12 fun

eventually resulting in dynamic relocations for the GOT entry:

R_RISCV_FUNCDESC fun

> # Outstanding dynamic relocations for the GOT entry
> # R_RISCV_32,64 __riscv_fdt_fun
>
> # or if the function symbol turns out local at static link time
> # R_RISCV_REL_DATA *ABS*+ABS(__riscv_fdt_fun)
>
>
> 4.4.4 Procedure Calls Using the PLT
>
> FDPIC code, direct call:
>
> # Outstanding static relocations
> label:
> auipc ra, %pcrel_call_hi(fun@PLT) # R_RISCV_CALL_PLT fun
> jalr ra, ra, %pcrel_call_lo(label)
> l[w|d] gp, <gp_slot>(sp)

This is the same as the local call case and looks correct.

> FDPIC code, indirect call (to a2):
>
> # Outstanding static relocations
> c.mv t0, a2
> label:
> auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
> jalr ra, ra, %pcrel_call_lo(label)
> l[w|d] gp, <gp_slot>(sp)
>
> # The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
> # the PLT entry associated with indirect calls.

As above I don't think it makes sense to handle this as a PLT entry. The call
should be generated inline:

lw t1, 0(a2)
lw gp, 4(a2)
jalr ra, t1
lw gp, <gp_slot>(sp)

> Chapter 5 Program Loading
>
> 5.1 Base Addresses (normative)
>
> A single individual base address is defined by the ELF gABI for a module
> being loaded that determines the amount to relocate the module by. This
> is unsuitable for FDPIC modules, which need to have their text segments
> and data segments mapped in memory separately. This is so that where a
> module is mapped multiple times in a no-MMU system, only a single copy of
> its text segments is present in memory and serves all the mappings, while
> a separate copy of its data segments is present in memory for each of the
> mappings. Consequently the distance between text and data segments is no
> longer constant between mappings and there is no single base address.
>
> Instead a separate text base address and a data base address is defined
> as a difference between the load address and the link address of the text
> segment and the data segment respectively. These two base addresses are
> used by the dynamic loader to relocate text and data respectively.

FDPIC does not have a "data base address"; there are one or more load segments,
relocated independently using a load map.

> In the initial module, such as a program interpreter, loaded by an OS or
> other executive runtime the text base address of said initial module can
> be determined by calculating a run-time difference between the actual
> value of the PC for a given location, such as the beginning of the text
> segment, obtained with a PC-relative reference to a symbol associated
> with that location and the value of a corresponding absolute symbol
> associated with the same location. The way to determine the data base
> address and therefore the value of GP of the initial module is specific
> to the individual OS or other executive runtime and therefore beyond the
> scope of this specification. Possibilities include passing suitable

Every other FDPIC ABI has a normative Start up section that specifies how
Linux will pass a elf32_fdpic_loadmap struct; it's in scope here.

Note that the Linux FDPIC support currently has 32-bit assumptions and
64-bit FDPIC will need to be documented here, much as the FRV ABI
supplement defined 32-bit FDPIC ptrace calls.

> information via the initial stack, such as in the auxiliary vector,
> preinitializing a processor register, providing a system call to retrieve
> it, etc.
>
> The presence of a separate text base address and a data base address also
> means that ET_EXEC images cannot be supported with the FDPIC psABI as it
> is not possible to make multiple copies of such image's data segment in a
> no-MMU system without the ability to relocate it at load time.
>
>
> 5.2 Lazy Binding (normative)
>
> Lazy binding can be optionally implemented by the dynamic loader. If it
> is implemented, then the run-time relocation of R_RISCV_JUMP_SLOT and
> their associated R_RISCV_GP relocations present in the FDT is done in two
> stages.

these should be a single relocation for consistency with other FDPIC ABIs.

Properly supporting lazy binding on FDPIC is very difficult for multithreaded
programs because it is impossible (on baseline RV*IA) to atomically update both
words that compose a function descriptor copy. Lazy binding is disabled on
modern distros as a hardening measure and not supported by musl as a matter of
policy, so it is likely not worth trying to make it work.

If you were to attempt to do this, it would be necessary to specify the order
of loads in PLT entries (always load the entry point first and the GOT second);
updates would write the correct GOT, issue a membarrier() syscall (a no-op on
uniprocessor or sequentially consistent systems, required for ordering
otherwise), and then write the new entry point.

This guarantees that the entry point can only be reached with the corresponding
GOT, however, it allows the lazy resolver to be called with _either_ the
initial GOT value for the lazy descriptor, _or_ the final symbol's GOT. As
such, the lazy resolver cannot depend(!) on the GOT register it receives.

> In the first stage, which is done by the dynamic loader at the time a
> module is loaded, R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are
> resolved respectively to the address of the lazy resolver and the value
> of the global pointer associated with the module providing the lazy
> resolver.

> In the second stage, which is done when the lazy resolver is reached by
> means of making a call through an FDT entry referring to it,
> R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are resolved respectively
> to the address of the function symbol associated with the FDT entry and
> the value of the global pointer associated with the module providing the
> function symbol. To be able to do its work the lazy resolver is called
> with certain registers containing values as follows:
>
> * x3 (gp) holds the dynamic loader's GP value as with an ordinary FDT
> entry (this is a consequence of the first stage of run-time relocation)

The dynamic loader needs to be able to tolerate _any_ valid gp value. This
could be achieved by reserving a few words near gp and having the dynamic
loader store a pointer to its own state at a known offset from every GOT.

> * x5 (t0) holds a pointer to the FDT entry to relocate
>
> * x6 (t1) holds the caller's GP value

I don't think this is actually needed - the SH and ARM FDPIC ABIs
unconditionally clobber the caller's GP. Given a pointer to a function
descriptor copy (which is within one of the caller's data segments) the dynamic
linker can easily find the caller by walking a list of loaded libraries.

> Registers have been assigned such as to work with the RV32E instruction
> set as well.
>
> Upon completion of the second stage the lazy resolver makes a jump to the
> newly resolved address of the function symbol.

> 5.3 Example PLT Code (informative)
>
> @PLT:
> l[w|d] t2, 0(t0)
> mv t1, gp

We don't need to save t1 here; we could save 2 bytes per PLT entry by moving
the adds into this function.

> l[w|d] gp, [4|8](t0)
> jr t2
> fun1@PLT:
> lui t0, %gprel_hi(FDT[fun1])
> addi t0, %gprel_lo(FDT[fun1])
> add t0, gp
> j @PLT
> fun2@PLT:
> lui t0, %gprel_hi(FDT[fun2])
> addi t0, %gprel_lo(FDT[fun2])
> add t0, gp
> j @PLT

-s

Fangrui Song

unread,
Mar 7, 2020, 10:50:47 PM3/7/20
to Maciej W. Rozycki, RISC-V SW Dev, Jim Wilson
I am not subscribed, so I suspect my reply will be eaten by Google Groups... I also guessed your email addresses.
How is the data segment defined? The PT_LOAD segment containing .data,
.sdata, or something else?

> G | The offset from GP of a GOT entry for the symbol referred by
> | the relocation.
>---------+----------------------------------------------------------------
> GP | The value of GP associated with the symbol referred, nominally
> | (DVMA + DBA + 2048).

GNU ld seems to define __global_pointer$ = .sdata + 0x800
In lld, I arbitrarily set it to (exists(.sdata) ? .sdata : __ehdr_start) + 0x800

>---------+----------------------------------------------------------------
> P | The place (offset or address) of the storage unit affected by
> | the relocation.
>---------+----------------------------------------------------------------
> PLTE | The address of a PLT entry associated with the symbol referred.
>---------+----------------------------------------------------------------
> PLTI | The address of a PLT entry designated to make indirect calls.

I am confused by PLTE/PLTI.

Some PLT entries do not need a .symtab/.dyntab entry:

As an example, bl foo (R_PPC64_REL24) can cause the creation of PLT
call stubs. There can be several stubs for one symbol, because each
call stub can only be accessed within +-32MB.

R_AARCH64_{CALL,JUMP}26 can cause the creation of similar call stubs (veneers).

Some PLT entries need a .dynsym entry: canonical PLT entry (st_value>0, st_shndx=0).
Such a PLT is caused by non-pic code, create by the linker for non-GOT-non-PLT relocation
types to an external function.

What are PLTE and PLTI?

> S | The value of the symbol referred by the relocation.
>---------+----------------------------------------------------------------
> TBA | Text segment's base address; 0 in static link.

GNU ld -z separate-code (default on Linux x86 since 2.31) has the following segment layout:

R
RX
R
RW (relro ; non-relro)

lld has the following segment layout (since lld 9):

R
RX
RW(RELRO)
RW(non-RELRO)

The first PT_LOAD is not executable. Does the mandatory 0 in a static
link cause confusion?

>Table 4.2 Relocation Types
>
> Name | Value | Field | Symbol | Calculation
>==========================+=======+=============+===========+=============
> R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
> R_RISCV_REL_TEXT (alias) | | | |
>--------------------------+-------+-------------+-----------+-------------
> R_RISCV_GP | 12 | T-word32,64 | any | GP
>--------------------------+-------+-------------+-----------+-------------
> R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A

AFAIK no relocation type uses the start of a segment for calculation.
A concrete section is needed.
%pcrel_call_hi is not defined.

"The R_RISCV_CALL_PLT relocation with no symbol" - does it refer to the
PLT header?

Sam Elliott

unread,
Mar 20, 2020, 12:24:00 PM3/20/20
to Maciej W. Rozycki, sw-...@groups.riscv.org
Hi Maceij,

Thank you for this proposal. I realise you have had quite a bit of feedback already, I would like to add some from lowRISC’s point of view.

Recently I have been investigating Embedded PIC, along with some collaborators at the Oxide Computer Company.

In our system, we do not have a MMU, and want as simple a loader as possible. With this in mind we will be be statically linking all our embedded application executables. Thus, we are most interested in ROPI/RWPI, rather than FDPIC.

However, I think that FDPIC is not entirely orthogonal to ROPI/RWPI. It seems very likely that most of the GP-relative relocations and code sequences you propose here for local data addressing would also be useful for ROPI/RWPI (when combined with pc-relative, non-PLT function calls).

I am not convinced the code sequence for taking the address of a local function is correct, for either FDPIC or statically linked ROPI/RWPI executables, because I don’t think you can do the static relocation required for R_RISCV_GPREL_HI20(fun) if you don’t know the distance between the text and data sections (something you only know at runtime). I note you’ve had feedback that the sequences may need to be changed for FDPIC anyway, but I think ROPI/RWPI may just use the conventional pc-relative code sequences.

I am keen to see your revised specification, in light of the feedback so far.

Sam
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/alpine.LFD.2.21.2002282057330.18621%40redsun52.ssa.fujisawa.hgst.com.

--
Sam Elliott
Software Developer - LLVM and OpenTitan
lowRISC CIC

Sam Elliott

unread,
Mar 23, 2020, 1:01:42 PM3/23/20
to Fangrui Song, RISC-V SW Dev
Fangrui,

Something you said on the FDPIC thread has me concerned.

> On 8 Mar 2020, at 3:50 am, Fangrui Song <i...@maskray.me> wrote:
>
> GNU ld seems to define __global_pointer$ = .sdata + 0x800
> In lld, I arbitrarily set it to (exists(.sdata) ? .sdata : __ehdr_start) + 0x800

Isn’t this mismatch a cause for concern?

If I’m linking for an embedded system, and forgot to define `__global_pointer$` (maybe because I don’t know about this special symbol), then GNU ld may perform linker relaxations (and associated relocations) based on the assumption that `gp` (the register) has a value of 0. On lld, it’s going to perform the same relocations based on the assumption that `gp` (the register) has quite a different value. While this isn’t an issue today, I think it may become one if LLD implements linker relaxations.

I presume that lld does not have the concept of a “default linker script”, which is maybe where this mismatch has come from, and why it has to programatically define these symbols.

As an aside, I will update the psABI to mention `__global_pointer$`.

What are your thoughts on this issue?

Sam

Tommy Murphy

unread,
Mar 23, 2020, 1:11:14 PM3/23/20
to RISC-V SW Dev
> then GNU ld may perform linker relaxations (and associated relocations) based on the assumption that `gp` (the register) has a value of 0.

Does it actually assume 0 in that case?

Certainly if relaxations are performed at compile/link time and the (startup) code doesn't initialize $gp appropriately then execution will use whatever garbage/uninitialised value $gp happens to contain and will almost certainly fail/crash.

Maciej W. Rozycki

unread,
Mar 23, 2020, 1:13:34 PM3/23/20
to Jim Wilson, RISC-V SW Dev
Hi Jim,

I am now back to this effort after a holiday and a short period to catch
up.

Thank you for your feedback. I will be making amendments to the proposal
as I go through your notes. I yet have to address Stef's extensive input,
so some of this stuff might be iteratively updated.

> > I will appreciate your questions, comments and any other kind of
> > feedback.
>
> The style is different from the existing psABI, though it looks like a
> better style. Maybe you could rewrite our existing psABI to improve
> it?

This has been written with the ELF gABI as a reference, and with some
influence from the style the original MIPS psABIs used that I have found
quite comprehensible and got used to over the years.

I can look into improving the base RISC-V psABI once we have got through
the implementation of this FDPIC extension.

> > ---------+----------------------------------------------------------------
> > GP | The value of GP associated with the symbol referred, nominally
> > | (DVMA + DBA + 2048).
>
> This uses DVMA without defining it.

Now defined, in terms of `p_vaddr'.

> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
>
> This is identical to the existing R_RISCV_GPREL_I reloc.
>
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
>
> This is identical to the existing R_RISCV_GPREL_S reloc.

Except for overflow detection. Ones I have defined cause no overflow
detection as they assume the corresponding high part to also be present.

> Currently, the R_RISCV_GPREL_I and R_RISCV_GPREL_S can only be created
> by linker relaxation, so we don't have assembler support for them, and
> this is maybe also why the names are a little different than what you
> expect.

Well, as long as BFD provides them you can always use `.reloc' to emit
them with GAS. This doesn't solve the issue of link-time overflow
detection however; they do not have a corresponding high-part relocation
so we do expect them to catch overflows to facilitate code that has been
written for `.sdata'/`.sbss' support, don't we? Or otherwise what is the
purpose of their existence?

> > Corresponding FDPIC code, using GP-relative addressing:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
> > c.add t0, gp
> > lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
> > sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend
>
> Not all targets have compressed instructions. The assembler will
> convert regular instructions to compressed instructions if it can, so
> using add instead of c.add is more general with no code size
> optimization loss.

I have deliberately left relaxation out and given the code sequences (in
informational sections) as examples rather than requirements (in normative
sections). You are of course right that some configurations will lack
compressed instructions and code is obviously allowed to use base encoding
equivalents or different sequences e.g. due to compiler optimisations.

Also as a side note I think it is GCC (or any other compiler) that should
produce the intended assembly right from the beginning, so as to get the
code size right and avoid unnecessary longer sequences such as with
branches that seem out of range due to size estimate pessimisation but are
not (of course some sizes are only known at link stage making certain
kinds of optimisations possible in the linker anyway).

> For relaxation purposes, there should be a reloc on the add, so it should be
> add t0,t0,gp,%gprel_add(var+addend)

I don't think we need to invent extra syntax here for this as we have the
`.reloc' pseudo-op for such use cases, e.g. where no instruction operand
refers to a symbol or there's no symbol involved (cf. R_MIPS_JALR). This
could look like:

0:
add t0, t0, gp
.reloc 0b, R_RISCV_GPREL_ADD, var + addend

> With this extra reloc, if %gprel_hi(var+addend) is zero, then we can
> relax the three instruction sequence for the load to one instruction,
> deleting the first two, and modifying the load to
> lbu t1,%gprel_lo(var+addend)(gp)
> and likewise for the store.
>
> See for instance the tprel_add reloc used for TLS which works the same
> way. There is an example in the psABI doc.

Relaxation optimisations like this were considered and comprehensively
implemented with the nanoMIPS target in GOLD, publicly available. I think
we ought to follow suite.

Therefore I think this will be best considered separately, as this is not
strictly necessary for FDPIC support on one hand, and may be used for
other purposes on the other. For this reason I have decided not to
include any relaxation support with the FDPIC psABI addendum.

> > auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
>
> This should be %got_pcrel_hi(var). It was first added to llvm, and
> then just added to GNU Binutils this week. It is already mentioned in
> riscv-asm-manual, but needs to be mentioned in the psABI. That is on
> my todo list.

Yep, I have now seen the patch posted to the binutils mailing list. I
have updated the document accordingly throughout.

I now actually wonder if we shouldn't have used composed relocations
(e.g. R_RISCV_GOT for GOT references with a corresponding %got operator,
R_RISCV_PCREL for PC-relative calculations with %pcrel, etc.) to avoid
proliferating relocation variants providing repeating patterns.

> > lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> > c.add t0, gp
> > l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> > lbu t1, addend(t0)
>
> As above, adding a reloc, e.g.. %gprel_got_add, to the add makes this
> relaxable.

Likewise, this can be done with `.reloc' like I noted above, and the
relaxation defined separately. I think relaxation support that requires
psABI support (e.g. extra relocations) should be defined in a separate
section of the standard. Perhaps individual sections included in the base
psABI and this addendum.

If you think it is important to have relaxation defined right from the
beginning (why?), then I might consider doing it right away.

Maciej

Sam Elliott

unread,
Mar 23, 2020, 1:39:25 PM3/23/20
to RISC-V SW Dev, Tommy Murphy
Judging by this code, yes I believe it assumes zero.

https://github.com/bminor/binutils-gdb/blob/master/bfd/elfnn-riscv.c#L1408-L1420

Yes, I agree that you have problems if you haven’t loaded 0 or `__global_pointer$` into gp, which suggests to me that the linker should raise an error if you try to use these relocations and `__global_pointer$` is not defined. This seems better than getting some unknown run-time failure where a completely wrong address was loaded.

I’m coming from the point of view of working on an embedded project, with a custom linker script (that doesn't define `__global_pointer$`), and not finding any documentation of this special symbol, not even in the psABI (something I will attempt to correct today). Inadvertently, my project has got it right as it zeroes all registers at startup.

Sam
> --
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/547a2caa-6306-4387-93d2-0e8a03159362%40groups.riscv.org.

Andrew Waterman

unread,
Mar 23, 2020, 2:38:06 PM3/23/20
to Sam Elliott, RISC-V SW Dev, Tommy Murphy
On Mon, Mar 23, 2020 at 10:39 AM Sam Elliott <sell...@lowrisc.org> wrote:
Judging by this code, yes I believe it assumes zero.

Actually, zero is just used as a sentinel value in that code. The semantics are, if __global_pointer$ is not defined, then the linker won’t perform relaxations against the global pointer.

Sam Elliott

unread,
Mar 23, 2020, 5:03:26 PM3/23/20
to Andrew Waterman, RISC-V SW Dev, Tommy Murphy
Oh, now I see how that works. The interaction with `max_alignment` is not so easy to understand, until I noticed it was `(bfd_vma) -1` by default.

This has also answered some questions I have about how to prevent `gp`-relative relaxations being used to reference symbols in a different output section (just in case I choose to start moving, for example, the data section).

Thanks for your clarification!

Sam

Fangrui Song

unread,
Mar 23, 2020, 6:30:36 PM3/23/20
to Sam Elliott, RISC-V SW Dev
While I was making glibc applications linkable with lld, I noticed that
a linker had to define __global_pointer$ because glibc Scrt1.o
(sysdeps/riscv/start.S) requires it. To be honest I am not too sure why
.sdata and __global_pointer$ is used by RISC-V. I am always wondering
whether it is legacy cruft copied from elsewhere (e.g. MIPS).

MIPS needs GP just because it lacks a PC-relative instruction. It needs
a register to amortize the PIC cost. Similarly, PPC64 does this via a
dedicated TOC register.

I don't follow RISC-V development that closely so I may be wrong. If no
code is using .sdata, then it does not matter that much how the linker
defines __global_pointer$

> x3 gp Global pointer -- (Unallocatable)

This just wastes a register for no good reason for most applications.

Andrew Waterman

unread,
Mar 23, 2020, 6:40:47 PM3/23/20
to Fangrui Song, Sam Elliott, RISC-V SW Dev
You know, you could ask why it's there rather than just assuming we don't know what we're doing...

It's not legacy MIPS cruft.  It works quite a bit differently, relying on linker relaxations to opportunistically shorten global-variable accesses.  Earlier RISC-V ABIs didn't have gp; it was added when it was found that it was a better use of that register than another temporary or callee-saved register.


--
You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+un...@groups.riscv.org.

Fangrui Song

unread,
Mar 23, 2020, 8:54:42 PM3/23/20
to Andrew Waterman, Sam Elliott, RISC-V SW Dev
Then when is gp actually beneficial? Only in a sub-ABI like FDPIC? I
don't think clang or GCC can generate it.

To use it in source code, some attribute annotation is required.
Alternatively, let a post-link optimizer rewrite some PC relative
load/store instructions.

In addition, it is not clear how gp should be manipulated while calling
an external component (another shared object).

I wish gp were reserved with all the specifications coming with it.

Andrew Waterman

unread,
Mar 23, 2020, 9:03:42 PM3/23/20
to Fangrui Song, Sam Elliott, RISC-V SW Dev
Linker relaxation is the only way that gp gets used.  For following trivial program

  int x;
  int main() { return x; }

GCC emits the code sequence

  lui a5,%hi(x)
  lw a0,%lo(x)(a5)
  ret

but after linking, the executable contains

  lw      a0,-104(gp) # 11be0 <x>
  ret


In addition, it is not clear how gp should be manipulated while calling
an external component (another shared object).

gp is only used by the main executable; shared libraries do not use it, so it doesn't need to be manipulated when crossing into or between shared objects.

Maciej W. Rozycki

unread,
Mar 23, 2020, 9:33:53 PM3/23/20
to Andrew Waterman, Fangrui Song, Sam Elliott, RISC-V SW Dev
On Mon, 23 Mar 2020, Andrew Waterman wrote:

> > In addition, it is not clear how gp should be manipulated while calling
> > an external component (another shared object).
>
> gp is only used by the main executable; shared libraries do not use it, so
> it doesn't need to be manipulated when crossing into or between shared
> objects.

I suppose GP could be used to reduce the number of instructions needed
for accesses to the GOT (combined with `.sdata' and `.sbss' if required)
in DSOs, however the limited 12-bit span of offsets supported by machine
instructions combined with the presence of PC-relative addressing possible
with just two hardware instructions makes such an optimisation somewhat
questionable compared to architectures such as Alpha, MIPS or Power that
have 16-bit offsets and no reasonable PC-relative addressing (at least in
the classic instruction sets).

Maciej

Maciej W. Rozycki

unread,
Mar 23, 2020, 9:49:35 PM3/23/20
to Fangrui Song, RISC-V SW Dev, Jim Wilson
Hi Fangrui,

Thank you for your input.

> I am not subscribed, so I suspect my reply will be eaten by Google
> Groups... I also guessed your email addresses.

It went through as I received it at my LMO personal e-mail address too.
Perhaps the mailing list isn't open for posting only by subscribers after
all (I sought advice on that from the list owner, but haven't ever heard
back).

> > Operand | Description
> >=========+================================================================
> > A | Relocation addend.
> >---------+----------------------------------------------------------------
> > DBA | Data segment's base address; 0 in static link.
>
> How is the data segment defined? The PT_LOAD segment containing .data,
> .sdata, or something else?

The data segment here are the r/w PT_LOAD segments (i.e. whose `p_flags'
have the PF_R and PF_W bits set), combined. As opposed to the r/x PT_LOAD
segments (with PF_X and/or PF_R set), also combined, which are the text
segment.

> > G | The offset from GP of a GOT entry for the symbol referred by
> > | the relocation.
> >---------+----------------------------------------------------------------
> > GP | The value of GP associated with the symbol referred, nominally
> > | (DVMA + DBA + 2048).
>
> GNU ld seems to define __global_pointer$ = .sdata + 0x800
> In lld, I arbitrarily set it to (exists(.sdata) ? .sdata : __ehdr_start)
> + 0x800

There's no clash here I believe. A distinct linker script will likely be
required for the FDPIC configuration, but that's an implementation detail.

> >---------+----------------------------------------------------------------
> > P | The place (offset or address) of the storage unit affected by
> > | the relocation.
> >---------+----------------------------------------------------------------
> > PLTE | The address of a PLT entry associated with the symbol referred.
> >---------+----------------------------------------------------------------
> > PLTI | The address of a PLT entry designated to make indirect calls.
>
> I am confused by PLTE/PLTI.

It is further described in 4.3 "Procedure Calls (normative)" although the
acronyms are not referred, which was a mistake. I have corrected it now.

> Some PLT entries do not need a .symtab/.dyntab entry:
>
> As an example, bl foo (R_PPC64_REL24) can cause the creation of PLT
> call stubs. There can be several stubs for one symbol, because each
> call stub can only be accessed within +-32MB.
>
> R_AARCH64_{CALL,JUMP}26 can cause the creation of similar call stubs
> (veneers).
>
> Some PLT entries need a .dynsym entry: canonical PLT entry (st_value>0,
> st_shndx=0).
> Such a PLT is caused by non-pic code, create by the linker for
> non-GOT-non-PLT relocation
> types to an external function.
>
> What are PLTE and PLTI?

PLTE entries are individually associated with external function symbols
calls to which are made directly. There is only one PLTI entry used for
making indirect calls.

> > S | The value of the symbol referred by the relocation.
> >---------+----------------------------------------------------------------
> > TBA | Text segment's base address; 0 in static link.
>
> GNU ld -z separate-code (default on Linux x86 since 2.31) has the
> following segment layout:
>
> R
> RX
> R

These are the text segment, relocated together at load time.

> RW (relro ; non-relro)

This is the data segment, relocated together at load time.

> lld has the following segment layout (since lld 9):
>
> R
> RX

These are the text segment, relocated together at load time.

> RW(RELRO)
> RW(non-RELRO)

These are the data segment, relocated together at load time.

> The first PT_LOAD is not executable. Does the mandatory 0 in a static
> link cause confusion?

The base address is 0 in static-link calculation, because the
dynamic-load relocation does not yet apply at this stage; the relevant
segment's VMA (as with `p_vaddr') is the actual address used for
calculation. A relative relocation (either R_RISCV_REL_TEXT or
R_RISCV_REL_DATA, as applicable) may have to be attached to the result of
such calculation for dynamic-load relocation.

FAOD I have been using the explicit terms: "static linker" and "dynamic
loader", and derived grammatical forms such as "static linking",
"static-link", etc. and "dynamic loading", "dynamic-load", etc. to avoid
confusion in terminology like with "dynamic linker", which makes lone
"linker" have two meanings. This has nothing to do with static vs dynamic
executables (all FDPIC executables are PIE anyway and go through the
dynamic load stage at run time, although the dynamic loader code may be
embedded within the executable rather than standalone).

> >Table 4.2 Relocation Types
> >
> > Name | Value | Field | Symbol | Calculation
> >==========================+=======+=============+===========+=============
> > R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
> > R_RISCV_REL_TEXT (alias) | | | |
> >--------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GP | 12 | T-word32,64 | any | GP
> >--------------------------+-------+-------------+-----------+-------------
> > R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A
>
> AFAIK no relocation type uses the start of a segment for calculation.
> A concrete section is needed.

A relocation whose symbol index in `r_info' is STN_UNDEF does not refer a
symbol nor consequently a section. Instead a value of 0 is used for
calculation; this has been explicitly defined in the ELF gABI.

This value is still relocated in dynamic loading by the base address as
are all actual symbols (save for SHN_ABS ones); since we have separate
base addresses for text and data in this specification this will be the
text base address and the data base address respectively and distinct
relocations are therefore required.

There are many existing examples of such relocation calculation across
various psABIs, e.g. R_ALPHA_RELATIVE, R_386_RELATIVE, etc.

> >FDPIC code, indirect call (to a2):
> >
> > # Outstanding static relocations
> > c.mv t0, a2
> >label:
> > auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
> > jalr ra, ra, %pcrel_call_lo(label)
> > l[w|d] gp, <gp_slot>(sp)
> >
> > # The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
> > # the PLT entry associated with indirect calls.
>
> %pcrel_call_hi is not defined.

It's an implementation detail (the section is informative), assemblers
are free to define their own syntax, which is beyond the scope of an ABI.

In the GNU assembler percent-operators indicate relocation, however we
currently have an issue in that several operations have not been defined
and the compiler has no direct way to synthesize them other than with the
`.reloc' pseudo-op.

In particular there is no way (or I haven't found one) for the compiler
to emit an instruction sequence to make a function call. Instead the
`call' assembly macro has to be used, that expands to a pair of
instructions.

So I used this synthetic example instead using inexistent percent-ops.
Perhaps this could be expressed in a better way; suggestions are welcome.
Maybe this could be just:

# Outstanding static relocations
c.mv t0, a2
auipc ra, %call_plt(@PLT) # R_RISCV_CALL_PLT
jalr ra, ra, 0
l[w|d] gp, <gp_slot>(sp)

instead (observing that the R_RISCV_CALL_PLT relocation has its relocated
fields spread across two instructions). I have updated my code examples
accordingly.

> "The R_RISCV_CALL_PLT relocation with no symbol" - does it refer to the
> PLT header?

Yes, aka PLTI, according to this definition:

Name | Value | Field | Symbol | Calculation
==========================+=======+=============+===========+=============
| | | local | S - P
R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
| | | n/a | PLTI - P
--------------------------+-------+-------------+-----------+-------------

-- no symbol referred here, so the third calculation applies.

Do these explanations and corrections clear your concerns?

Maciej

Andrew Waterman

unread,
Mar 23, 2020, 10:39:05 PM3/23/20
to Maciej W. Rozycki, Fangrui Song, Sam Elliott, RISC-V SW Dev
Agreed.  Fortunately, if, down the road, that proves to be a good use of gp, it can be done without changing the ABI.


  Maciej

Maciej W. Rozycki

unread,
Mar 25, 2020, 3:17:44 PM3/25/20
to Sam Elliott, sw-...@groups.riscv.org
Hi Sam,

Thank you for your input.

> In our system, we do not have a MMU, and want as simple a loader as
> possible. With this in mind we will be be statically linking all our
> embedded application executables. Thus, we are most interested in
> ROPI/RWPI, rather than FDPIC.
>
> However, I think that FDPIC is not entirely orthogonal to ROPI/RWPI. It
> seems very likely that most of the GP-relative relocations and code
> sequences you propose here for local data addressing would also be
> useful for ROPI/RWPI (when combined with pc-relative, non-PLT function
> calls).

That is correct, static PIE is just a special case where you have no
additional modules loaded. So all the dynamic relocation processing is
still done as required for text and data segment separation (you still
want to map text once with multiple instances of the PIE running), but you
don't need a PLT or FDT because all symbols by definition resolve locally.

NB if you don't need running multiple instances of the same executable,
then you can just get away with the flat binary format already supported,
so long as you build your software using the static PIE format.

> I am not convinced the code sequence for taking the address of a local
> function is correct, for either FDPIC or statically linked ROPI/RWPI
> executables, because I don?t think you can do the static relocation
> required for R_RISCV_GPREL_HI20(fun) if you don?t know the distance
> between the text and data sections (something you only know at runtime).

This is a GP-relative rather than a PC-relative reference so the local
sequence is right (the external one has an editorial mistake, as noticed
by Stef already), as the pointer taken will be to the relevant FDT entry,
which is data (poked at by the dynamic loader) and therefore in the data
segment. The whole point of a separate GP is to have local data offsets
constant with respect to it.

> I note you?ve had feedback that the sequences may need to be changed for
> FDPIC anyway, but I think ROPI/RWPI may just use the conventional
> pc-relative code sequences.

I think you are right here as in your environment you won't ever pass
function pointers externally, and therefore you don't need to update GP.

This scenario would actually correspond to the STV_INTERNAL export class
(visibility) in the usual dynamic load scenario, including the FDPIC ABI
in particular. So I think it may be worth it to permit function symbols
marked STV_INTERNAL to be referred directly not only for calls, but for
for taking their address as well in the FDPIC psABI. In that case no FDT
entry will be required and the address can be taken with a PC-relative
reference.

There is no way that I know of however to verify that such a pointer is
not passed externally (except perhaps by static analysis, which is beyond
the scope of a compiler toolchain), so the onus would be on the software
writer to make sure the restriction has not been violated.

I think this would actually be a useful enhancement to the FDPIC psABI
addendum. By having the semantics of STV_INTERNAL symbols defined like
this in the specification we'll have both the FDPIC and the ROPI/RWPI use
cases covered with a single ABI (the latter as a special case of the more
general FDPIC case). With such semantics to build a static PIE program
for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
GCC will be using the `-mfdpic -fvisibility=internal' command-line options
(and of course nothing prevents us from making that the default for the
compiler at its build time, based either on target selection or `--with-*'
configuration options if that made people's life easier). Other compilers
may follow suit.

NB regular FDPIC static PIE programs will still require FDT entries to be
created for function pointers passed to modules loaded with dlopen(3).

Does this reply answer your questions and clear your concerns?

Maciej

Maciej W. Rozycki

unread,
Mar 26, 2020, 9:26:43 AM3/26/20
to Sam Elliott, sw-...@groups.riscv.org
On Wed, 25 Mar 2020, Maciej W. Rozycki wrote:

> This scenario would actually correspond to the STV_INTERNAL export class
> (visibility) in the usual dynamic load scenario, including the FDPIC ABI
> in particular. So I think it may be worth it to permit function symbols
> marked STV_INTERNAL to be referred directly not only for calls, but for
> for taking their address as well in the FDPIC psABI. In that case no FDT
> entry will be required and the address can be taken with a PC-relative
> reference.
>
> There is no way that I know of however to verify that such a pointer is
> not passed externally (except perhaps by static analysis, which is beyond
> the scope of a compiler toolchain), so the onus would be on the software
> writer to make sure the restriction has not been violated.
>
> I think this would actually be a useful enhancement to the FDPIC psABI
> addendum. By having the semantics of STV_INTERNAL symbols defined like
> this in the specification we'll have both the FDPIC and the ROPI/RWPI use
> cases covered with a single ABI (the latter as a special case of the more
> general FDPIC case). With such semantics to build a static PIE program
> for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
> GCC will be using the `-mfdpic -fvisibility=internal' command-line options
> (and of course nothing prevents us from making that the default for the
> compiler at its build time, based either on target selection or `--with-*'
> configuration options if that made people's life easier). Other compilers
> may follow suit.

Ditch it! There is no way I can think of to actually track visibility at
a pointer's *use* place and we have no way to restrict a function pointer
type/variable to only accept assignments from STV_INTERNAL function
references, which would be a guarantee that only an STV_INTERNAL function
could be pointed at. At least with the high-level-language/toolchain
infrastructure we have, down to the static linker.

Barring that we need to keep the function pointer format uniform whether
for local or external references so that one piece of code works for both,
and for FDPIC that means using a pointer to a function descriptor rather
than the entry point as a function pointer.

So while it looked like a nice idea at first unfortunately it does not
appear feasible. Sigh.

Maciej

Sam Elliott

unread,
Mar 26, 2020, 9:48:17 AM3/26/20
to Maciej W. Rozycki, sw-...@groups.riscv.org
Thanks for the reply!
Ah, I see, so the R_RISCV_GPREL_HI20(fun) is actually closer to something like R_RISCV_GPREL_HI20(__riscv_fdt_fun). And we know the FDTs are in the data section.

I have two notational points, which I feel add clarity and ensure this specification lines up with the existing assembler conventions:

1. Can I propose we use the fun@FDT notation, given the psABI already using the @PLT notation for the PLT? Thus above, the relocation would be R_RISCV_GPREL_HI20(fun@FDT) - I don't think the symbols need to change, but I think this makes it more obvious that you're really pointing at the FDT entry here.

2. It came up in a different reply of yours, but you stated you would prefer not to add %-based assembly operators in this proposal (specifically gprel_add). I think this proposal is exactly the time to propose these operators, in line with existing conventions, especially as one of the replies has advocated for trying to avoid the explosion of relocations that are needed to cover GOT, non-GOT, PLT relocations etc.

>
>> I note you?ve had feedback that the sequences may need to be changed for
>> FDPIC anyway, but I think ROPI/RWPI may just use the conventional
>> pc-relative code sequences.
>
> I think you are right here as in your environment you won't ever pass
> function pointers externally, and therefore you don't need to update GP.
>
> This scenario would actually correspond to the STV_INTERNAL export class
> (visibility) in the usual dynamic load scenario, including the FDPIC ABI
> in particular. So I think it may be worth it to permit function symbols
> marked STV_INTERNAL to be referred directly not only for calls, but for
> for taking their address as well in the FDPIC psABI. In that case no FDT
> entry will be required and the address can be taken with a PC-relative
> reference.
>
> There is no way that I know of however to verify that such a pointer is
> not passed externally (except perhaps by static analysis, which is beyond
> the scope of a compiler toolchain), so the onus would be on the software
> writer to make sure the restriction has not been violated.

Yeah this does sound like an issue, but there are other not dissimilar issues on ROPI/RWPI anyway to do with where constant vs non-constant data is placed (and issues around "constant" pointers to non-constant data). I would err towards not creating this semantic issue in FDPIC, but on the other hand if it helps us avoid another psABI, I could see the advantage of using STV_INTERNAL in this way.

>
> I think this would actually be a useful enhancement to the FDPIC psABI
> addendum. By having the semantics of STV_INTERNAL symbols defined like
> this in the specification we'll have both the FDPIC and the ROPI/RWPI use
> cases covered with a single ABI (the latter as a special case of the more
> general FDPIC case). With such semantics to build a static PIE program
> for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
> GCC will be using the `-mfdpic -fvisibility=internal' command-line options
> (and of course nothing prevents us from making that the default for the
> compiler at its build time, based either on target selection or `--with-*'
> configuration options if that made people's life easier). Other compilers
> may follow suit.
>
> NB regular FDPIC static PIE programs will still require FDT entries to be
> created for function pointers passed to modules loaded with dlopen(3).

One of the reasons for us choosing ROPI/RWPI is that it should have lower loading overhead than the full FDPIC implementation, so should be compatible with small embedded systems. Given we want as simple a loader as possible, it's likely the platform will also not provide dlopen(3) in any capacity either.

>
> Does this reply answer your questions and clear your concerns?

I think it does, and I am more satisfied with the "normal FDPIC" proposal. I do need more time to think about how FDPIC+static-PIE may be compatible or not with what I expected to propose for ROPI/RWPI.

Thanks for helping clarify my understanding of how ROPI/RWPI and FDPIC relate to each other

Sam

>
> Maciej

Jim Wilson

unread,
Mar 26, 2020, 11:15:21 PM3/26/20
to Maciej W. Rozycki, RISC-V SW Dev
On Mon, Mar 23, 2020 at 10:13 AM Maciej W. Rozycki <ma...@wdc.com> wrote:
> > Currently, the R_RISCV_GPREL_I and R_RISCV_GPREL_S can only be created
> > by linker relaxation, so we don't have assembler support for them, and
> > this is maybe also why the names are a little different than what you
> > expect.
>
> Well, as long as BFD provides them you can always use `.reloc' to emit
> them with GAS. This doesn't solve the issue of link-time overflow
> detection however; they do not have a corresponding high-part relocation
> so we do expect them to catch overflows to facilitate code that has been
> written for `.sdata'/`.sbss' support, don't we? Or otherwise what is the
> purpose of their existence?

Linker relaxation only creates them if they are in range. This is a
code size optimization. So for a testcase

int i;
int main (void) { return i; }

Using riscv64-unknown-linux-gcc -O -c to compile it and running
objdump on the output I see

0000000000000000 <main>:
0: 000007b7 lui a5,0x0
0: R_RISCV_HI20 i
0: R_RISCV_RELAX *ABS*
4: 0007a503 lw a0,0(a5) # 0 <main>
4: R_RISCV_LO12_I i
4: R_RISCV_RELAX *ABS*
8: 8082 ret

Then at link time if the variable i is within range of gp, then linker
relaxation deletes the lui instruction, changes its reloc to
R_RISCV_NONE, changes the lw to use gp as the base address, and
changes the reloc to R_RISCV_GPREL_I. Adding --emit-relocs to the
link, and running objdump I see

0000000000010436 <main>:
10436: 8341a503 lw a0,-1996(gp) # 12034 <i>
10436: R_RISCV_NONE *ABS*
10436: R_RISCV_RELAX *ABS*
10436: R_RISCV_GPREL_I i-0x12800
10436: R_RISCV_RELAX *ABS*
1043a: 8082 ret

This linker relaxation support is an important part of the RISC-V
toolchain support for reducing code size, and improving performance.
We handle a number of different cases in linker relaxation, and I
expect that we will add more.

Your point about overflows is a good one. If these relaxation relocs
overflow, then it is a linker bug. We have had a few bugs in this
area that I have had to fix. With your proposal where we have both hi
and lo part gprel relocs, overflow should not be a problem. It isn't
immediately obvious to me if that means that they need to be different
reloc numbers though. I suppose it will depend on how the relocs are
represented, but different reloc numbers may be necessary so we can
handle overflow differently for them.

> Also as a side note I think it is GCC (or any other compiler) that should
> produce the intended assembly right from the beginning, so as to get the
> code size right and avoid unnecessary longer sequences such as with
> branches that seem out of range due to size estimate pessimisation but are
> not (of course some sizes are only known at link stage making certain
> kinds of optimisations possible in the linker anyway).

It isn't possible for gcc to produce the smallest code size directly.
Gcc doesn't emit compressed instructions; the assembler does this. So
gcc doesn't know the size of the code. This is fixable in theory, but
doesn't really help. Neither gcc nor the assembler know link time
addresses, and hence some compressed instructions can only be
generated at link time via relaxation. Also, we need link time
address info to perform relaxations like converting lui/add or lui/lw
to a single add or lw instruction off of the gp reg when the address
is in range. There are also other relaxations performed in the
linker. Since code size reduction via relaxation can change function
and variable addresses, we can't know any address until linker
relaxation is done. Like it or not, linker relaxation is a very
important part of the RISC-V toolchain.

> > For relaxation purposes, there should be a reloc on the add, so it should be
> > add t0,t0,gp,%gprel_add(var+addend)
>
> I don't think we need to invent extra syntax here for this as we have the
> `.reloc' pseudo-op for such use cases, e.g. where no instruction operand
> refers to a symbol or there's no symbol involved (cf. R_MIPS_JALR). This
> could look like:

Good point about .reloc. Unfortunately, we already support the four
operand add for the tls reloc, and can't drop that without
compatibility break, but we could consider using .reloc going forward.

Though trying this, I see it gets a little complicated. Given the testcase

__thread int i;
int main (void) { return i; }

riscv64-unknown-linux-gnu-gcc -O -S generates

main:
lui a5,%tprel_hi(i)
add a5,a5,tp,%tprel_add(i)
lw a0,%tprel_lo(i)(a5)
ret

Note the four operand add for the extra reloc. assembling and running
objdump I get

0000000000000000 <main>:
0: 000007b7 lui a5,0x0
0: R_RISCV_TPREL_HI20 i
0: R_RISCV_RELAX *ABS*
4: 004787b3 add a5,a5,tp
4: R_RISCV_TPREL_ADD i
4: R_RISCV_RELAX *ABS*
8: 0007a503 lw a0,0(a5) # 0 <main>
8: R_RISCV_TPREL_LO12_I i
8: R_RISCV_RELAX *ABS*
c: 8082 ret

and then linking with relaxation and objdump I get

0000000000010466 <main>:
10466: 00022503 lw a0,0(tp) # 0 <i>
10466: R_RISCV_NONE *ABS*
10466: R_RISCV_RELAX *ABS*
10466: R_RISCV_NONE *ABS*
10466: R_RISCV_RELAX *ABS*
10466: R_RISCV_TPREL_I i
10466: R_RISCV_RELAX *ABS*
1046a: 8082 ret

Now trying this with .reloc, I was able to make it work, but I need to
add two relocs to the gp add, the TPREL_ADD reloc and a RELAX reloc.
I then ran into the problem that absent the reloc, the assembler
converts the add into a compressed add, and as a compresssed add the
relaxation doesn't work. So I had to disable assembler compression
for the add. That gives me

main:
lui a5,%tprel_hi(i)
.option push
.option norvc
0:
add a5,a5,tp
.reloc 0b, R_RISCV_TPREL_ADD, i
.reloc 0b, R_RISCV_RELAX
.option pop
lw a0,%tprel_lo(i)(a5)
ret

This does work, but it isn't very convenient. The four operand add
got expanded into seven lines of code in the assembly output. Now the
relaxation problem with a compressed add could perhaps be considered a
relaxation bug, and might be fixable. if that is fixable, then the
four operand add only gets expanded to four lines of assembler code,
which isn't as bad as 7, but could still be inconvenient. The current
syntax is much friendlier to people trying to write assembly code.

> Relaxation optimisations like this were considered and comprehensively
> implemented with the nanoMIPS target in GOLD, publicly available. I think
> we ought to follow suite.

FYI there is no RISC-V GOLD support, if someone wants to volunteer to
do that work.

> Therefore I think this will be best considered separately, as this is not
> strictly necessary for FDPIC support on one hand, and may be used for
> other purposes on the other. For this reason I have decided not to
> include any relaxation support with the FDPIC psABI addendum.

I think you will find code size and performance to be disappointing if
linker relaxation is not considered from the start. But yes, it
should be possible to handle relaxation as a separate task.

> I now actually wonder if we shouldn't have used composed relocations
> (e.g. R_RISCV_GOT for GOT references with a corresponding %got operator,
> R_RISCV_PCREL for PC-relative calculations with %pcrel, etc.) to avoid
> proliferating relocation variants providing repeating patterns.

We can't change existing relocs without an ABI break. But as a
general comment, yes, the current scheme is not designed but rather
implemented as necessary.

> Likewise, this can be done with `.reloc' like I noted above, and the
> relaxation defined separately. I think relaxation support that requires
> psABI support (e.g. extra relocations) should be defined in a separate
> section of the standard. Perhaps individual sections included in the base
> psABI and this addendum.
>
> If you think it is important to have relaxation defined right from the
> beginning (why?), then I might consider doing it right away.

Linker relaxation is fundamental to the design of the RISC-V
toolchain, or perhaps I should say the RISC-V GNU toolchain. You
won't get good code size or performance without it. I'm not sure if
separating this stuff out to a separate section make sense. It may
also be difficult to do that, since some of the relaxations don't
require relocs, and some of the relaxations use the same relocs used
elsewhere, and only some of the relaxations require unique relocs used
only for relaxation.

Jim

Jim Wilson

unread,
Mar 27, 2020, 12:05:28 AM3/27/20
to Maciej W. Rozycki, Fangrui Song, RISC-V SW Dev
On Mon, Mar 23, 2020 at 6:49 PM Maciej W. Rozycki <ma...@wdc.com> wrote:
> Hi Fangrui,
> > I am not subscribed, so I suspect my reply will be eaten by Google
> > Groups... I also guessed your email addresses.
>
> It went through as I received it at my LMO personal e-mail address too.
> Perhaps the mailing list isn't open for posting only by subscribers after
> all (I sought advice on that from the list owner, but haven't ever heard
> back).

Maybe the next draft can be done via the
github.com/riscv/riscv-elf-psabi-doc tree as an issue or pull request?
I think most all interested parties are watching that github repo.

> It's an implementation detail (the section is informative), assemblers
> are free to define their own syntax, which is beyond the scope of an ABI.

Well, compatibility between assemblers is useful, and I would hope
that GCC and LLVM at least have compatible assembly syntax.

> In the GNU assembler percent-operators indicate relocation, however we
> currently have an issue in that several operations have not been defined
> and the compiler has no direct way to synthesize them other than with the
> `.reloc' pseudo-op.
>
> In particular there is no way (or I haven't found one) for the compiler
> to emit an instruction sequence to make a function call. Instead the
> `call' assembly macro has to be used, that expands to a pair of
> instructions.

Yes. this is lacking.

We do have an assembler manual, but it is woefully incomplete.
https://github.com/riscv/riscv-asm-manual
and some things like call can't be easily expressed except as a macro
as you mentioned.

I would offer one word of warning, which is that gcc -mcmodel=medany
-mexplicit-relocs is known to fail sometimes. it is a complex problem
that might require an ABI change to fix. It has something like a 1 in
a 1K chance of failing for each risky use. So we should not
accidentally encourage use of explicit relocs in cases when it is
known to fail.
https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/KnziiZtEJNo/M8Vfbw9UCgAJ

Jim

Maciej W. Rozycki

unread,
Apr 4, 2020, 1:07:09 PM4/4/20
to Sam Elliott, sw-...@groups.riscv.org
On Thu, 26 Mar 2020, Sam Elliott wrote:

> >> I am not convinced the code sequence for taking the address of a local
> >> function is correct, for either FDPIC or statically linked ROPI/RWPI
> >> executables, because I don?t think you can do the static relocation
> >> required for R_RISCV_GPREL_HI20(fun) if you don?t know the distance
> >> between the text and data sections (something you only know at runtime).
> >
> > This is a GP-relative rather than a PC-relative reference so the local
> > sequence is right (the external one has an editorial mistake, as noticed
> > by Stef already), as the pointer taken will be to the relevant FDT entry,
> > which is data (poked at by the dynamic loader) and therefore in the data
> > segment. The whole point of a separate GP is to have local data offsets
> > constant with respect to it.
>
> Ah, I see, so the R_RISCV_GPREL_HI20(fun) is actually closer to
> something like R_RISCV_GPREL_HI20(__riscv_fdt_fun). And we know the FDTs
> are in the data section.

Yes, the static linker can handle the redirection, as it does for various
special cases across some targets. There's no need to create actual
static `__riscv_fdt_fun' symbol.

Some thought may have to be put though into recording such an arrangement
in debug information. I know cases where it's not done at all, causing
troubles in debugging (GDB may have heuristics or rough static code
analysis implemented to handle some cases).

> I have two notational points, which I feel add clarity and ensure this
> specification lines up with the existing assembler conventions:
>
> 1. Can I propose we use the fun@FDT notation, given the psABI already
> using the @PLT notation for the PLT? Thus above, the relocation would
> be R_RISCV_GPREL_HI20(fun@FDT) - I don't think the symbols need to
> change, but I think this makes it more obvious that you're really
> pointing at the FDT entry here.

Hmm, now that you mention it I don't think this would be right as in my
view `fun@FDT' is just an alternative notation for the same relocation
operation (IOW I shouldn't have used `fun@PLT' either, because %call_plt()
already denotes that operation on the `fun' symbol). I have therefore
removed the `@PLT' symbol suffixes from code examples instead.

> 2. It came up in a different reply of yours, but you stated you would
> prefer not to add %-based assembly operators in this proposal
> (specifically gprel_add). I think this proposal is exactly the time
> to propose these operators, in line with existing conventions,
> especially as one of the replies has advocated for trying to avoid
> the explosion of relocations that are needed to cover GOT, non-GOT,
> PLT relocations etc.

I think assembly source syntax belongs to an assembly language manual or
specification (if we want to have a normative reference on this). While I
am not opposed to having one, I don't think a psABI document is the right
place for this (as it covers the binary format and not any programming
language), and neither is an architecture specification (as it covers the
hardware and again not any programming language syntax, including the
assembly language).

> > This scenario would actually correspond to the STV_INTERNAL export class
> > (visibility) in the usual dynamic load scenario, including the FDPIC ABI
> > in particular. So I think it may be worth it to permit function symbols
> > marked STV_INTERNAL to be referred directly not only for calls, but for
> > for taking their address as well in the FDPIC psABI. In that case no FDT
> > entry will be required and the address can be taken with a PC-relative
> > reference.
> >
> > There is no way that I know of however to verify that such a pointer is
> > not passed externally (except perhaps by static analysis, which is beyond
> > the scope of a compiler toolchain), so the onus would be on the software
> > writer to make sure the restriction has not been violated.
>
> Yeah this does sound like an issue, but there are other not dissimilar
> issues on ROPI/RWPI anyway to do with where constant vs non-constant
> data is placed (and issues around "constant" pointers to non-constant
> data). I would err towards not creating this semantic issue in FDPIC,
> but on the other hand if it helps us avoid another psABI, I could see
> the advantage of using STV_INTERNAL in this way.

Well, constant data is typically placed in sections like `.rodata' that
have their SHF_WRITE flag clear and at the static link time are merged
with sections containing code into the text segment.

I can see a problem here with referring to such read-only data as the
referrer may not necessarily know if data is constant or not and therefore
whether to use PC-relative or GP-relative addressing. In that case either
link-time relaxation or copy relocations will be required.

If instead constant data is merged with sections containing writable data
into the data segment, then there is no such issue, but memory is wasted.

I'll have to think about it some more, good point!

> > I think this would actually be a useful enhancement to the FDPIC psABI
> > addendum. By having the semantics of STV_INTERNAL symbols defined like
> > this in the specification we'll have both the FDPIC and the ROPI/RWPI use
> > cases covered with a single ABI (the latter as a special case of the more
> > general FDPIC case). With such semantics to build a static PIE program
> > for the ROPI/RWPI rather than full-FDPIC case all you'll have to do with
> > GCC will be using the `-mfdpic -fvisibility=internal' command-line options
> > (and of course nothing prevents us from making that the default for the
> > compiler at its build time, based either on target selection or `--with-*'
> > configuration options if that made people's life easier). Other compilers
> > may follow suit.
> >
> > NB regular FDPIC static PIE programs will still require FDT entries to be
> > created for function pointers passed to modules loaded with dlopen(3).
>
> One of the reasons for us choosing ROPI/RWPI is that it should have
> lower loading overhead than the full FDPIC implementation, so should be
> compatible with small embedded systems. Given we want as simple a loader
> as possible, it's likely the platform will also not provide dlopen(3) in
> any capacity either.

NB as we need to separate the PC from the GP anyway in GCC's code
generator to have FDPIC implemented I expect to have ROPI/RWPI supported
for RISC-V as a side effect.

> Thanks for helping clarify my understanding of how ROPI/RWPI and FDPIC
> relate to each other

You are welcome!

Maciej

Maciej W. Rozycki

unread,
Apr 7, 2020, 6:14:16 PM4/7/20
to Jim Wilson, RISC-V SW Dev
Hi Jim,
Right, but in this context the relocation is informational only really;
the relocation is never present (unless explicitly requested with
`.reloc') in object modules and in a fully-linked binary any static
relocations relocations are never going to be fed back to a linker, so
any overflow semantics does not matter.

So we could perhaps reuse relocation codes after all.

> This linker relaxation support is an important part of the RISC-V
> toolchain support for reducing code size, and improving performance.
> We handle a number of different cases in linker relaxation, and I
> expect that we will add more.

Sure.

> Your point about overflows is a good one. If these relaxation relocs
> overflow, then it is a linker bug. We have had a few bugs in this
> area that I have had to fix. With your proposal where we have both hi
> and lo part gprel relocs, overflow should not be a problem. It isn't
> immediately obvious to me if that means that they need to be different
> reloc numbers though. I suppose it will depend on how the relocs are
> represented, but different reloc numbers may be necessary so we can
> handle overflow differently for them.

As a reference we have this regular GOT vs large GOT (`-mxgot') model in
the MIPS target, where the latter uses R_MIPS_GOT_HI16/R_MIPS_GOT_LO16
relocation pairs to refer to the high 16-bit and the low 16-bit parts of
the GOT offset respectively rather than reusing R_MIPS_GOT16 for the low
part relocation, otherwise used for the former model. If you look at the
MIPS psABI (which I'm sure you're familiar with anyway), you'll notice
that R_MIPS_GOT_LO16 and R_MIPS_GOT16 both have the same calculation and
the same relocatable field and the only difference between them is the
overflow check.

We have the small issue that we have two kinds of low-part relocations to
match different machine instruction encodings, so the number of individual
relocations required doubles, and relocation numbers, being limited to 256
different values in the 32-bit ABI, are not exactly an abundant resource.
These are however repeating patterns, so the limitation can be easily
solved by using composed relocations, as I mentioned.

> > Also as a side note I think it is GCC (or any other compiler) that should
> > produce the intended assembly right from the beginning, so as to get the
> > code size right and avoid unnecessary longer sequences such as with
> > branches that seem out of range due to size estimate pessimisation but are
> > not (of course some sizes are only known at link stage making certain
> > kinds of optimisations possible in the linker anyway).
>
> It isn't possible for gcc to produce the smallest code size directly.
> Gcc doesn't emit compressed instructions; the assembler does this. So
> gcc doesn't know the size of the code. This is fixable in theory, but
> doesn't really help. Neither gcc nor the assembler know link time
> addresses, and hence some compressed instructions can only be
> generated at link time via relaxation. Also, we need link time
> address info to perform relaxations like converting lui/add or lui/lw
> to a single add or lw instruction off of the gp reg when the address
> is in range. There are also other relaxations performed in the
> linker. Since code size reduction via relaxation can change function
> and variable addresses, we can't know any address until linker
> relaxation is done. Like it or not, linker relaxation is a very
> important part of the RISC-V toolchain.

FWIW I have long been in favour to linker relaxation and took my part in
shaping how it has been done in the nanoMIPS effort on the design side.

My FDPIC psABI addendum has been created with a possibility to relax some
code sequences in mind, and in particular removing the high-part
relocations where unnecessary. For instance with a very small GOT the
LUI/ADD instruction pair used to add the upper part of the GOT offset can
be removed. This is why the GP is initialised to (DVMA + 2048); otherwise
the offset would not matter and GP could well be equal to DVMA.

As a side note I think the compiler should actually know instruction
sizes and produce compressed instructions where feasible, as it may make
decisions based on that where functionally equivalent code sequences can
be produced that differ only by their size and not instruction count. I
think we discussed that before in a different context.

To the best of my knowledge the RISC-V assembly language dialect is one
of the only two -- the other being the MIPS one -- where there is no 1:1
correspondence between assembly-language and machine instructions. And
over the years I have heard repeated complaints from people about this
peculiarity with the MIPS assembly language, actually leading to efforts
not to introduce new assembly macros corresponding to instructions added
with later ISA revisions even if exiting patterns made one to expect such
macros to exist.

Which makes me very wary about repeating the assembly-language design
decisions with the RISC-V dialect let alone making compilers rely on them.
I guess in compiler-generated code the number of assembly lines does not
really matter. I agree the notation with a fourth operand does help with
handcoded assembly though and I guess I'm fine with that as a means to
produce relocations in the syntax of RISC-V assembly.

Has it been generalised though across all the percent-ops and
instructions, or is it just a hack for this single special case?

As a side note I think a notation for individual fixed-width instructions
would be good having regardless, e.g.:

r.add a5, a5, tp

where the assembler would always produce the regular encoding, as there
are scenarios, for instance patchable code, where you want to have full
control over instruction lengths, and you never want to be forced to use
pseudo-ops such as `.half' to handcode machine code. In the nanoMIPS
psABI there's a reloc dedicated to prevent the linker from shortening such
instructions in relaxation (although same-length instructions may still be
substituted), which is inserted by the assembler automatically based on
the size suffix (the MIPS dialect chose to use suffixes rather than
prefixes, but I think our approach with a prefix is marginally cleaner).

I think at this point we could easily reserve the `r.' mnemonic prefix in
the assembly dialect to denote forced regular encoding.

> > Relaxation optimisations like this were considered and comprehensively
> > implemented with the nanoMIPS target in GOLD, publicly available. I think
> > we ought to follow suite.
>
> FYI there is no RISC-V GOLD support, if someone wants to volunteer to
> do that work.

Mentioned for the avoidance of doubt as to whether this information has
been published and under what licence (i.e. there is no trade secret I
would accidentally leak).

> > Therefore I think this will be best considered separately, as this is not
> > strictly necessary for FDPIC support on one hand, and may be used for
> > other purposes on the other. For this reason I have decided not to
> > include any relaxation support with the FDPIC psABI addendum.
>
> I think you will find code size and performance to be disappointing if
> linker relaxation is not considered from the start. But yes, it
> should be possible to handle relaxation as a separate task.

Thank you actually for pointing me at the R_RISCV_TPREL_ADD example. It
looks to me that indeed we ought to define a corresponding relocation for
GP in this FDPIC addendum.

> > I now actually wonder if we shouldn't have used composed relocations
> > (e.g. R_RISCV_GOT for GOT references with a corresponding %got operator,
> > R_RISCV_PCREL for PC-relative calculations with %pcrel, etc.) to avoid
> > proliferating relocation variants providing repeating patterns.
>
> We can't change existing relocs without an ABI break. But as a
> general comment, yes, the current scheme is not designed but rather
> implemented as necessary.

We are still at a relatively early stage of architecture/ABI development
and might be able to bypass some earlier choices by making smart decisions
as to how to build on the existing standard.

For instance we could retain the semantics of the existing relocations
when used standalone, but use them to indicate the relocatable field only
when last in a composed sequence of relocations. In particualar these
provisions of the ELF gABI explicitly allows us to do so:

"* In all but the last relocation operation of a composed sequence, the
result of the relocation expression is retained, rather than having
part extracted and placed in the relocated field. The result is
retained at full pointer precision of the applicable ABI processor
supplement.

"* In all but the first relocation operation of a composed sequence, the
addend used is the retained result of the previous relocation
operation, rather than that implied by the relocation type."

And the specific semantics of individual relocations is left to the
relevant psABI.

So we could use say a R_RISCV_GPREL/R_RISCV_TPREL_ADD composition to
indicate GP-relative offset relaxation, and overall produce code like:

# Outstanding static relocations
lui t0, %gprel_hi(fun) # R_RISCV_GPREL fun
# R_RISCV_TPREL_HI20
add t0, t0, gp, %gprel_add(fun) # R_RISCV_GPREL fun
# R_RISCV_TPREL_ADD
addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL fun
# R_RISCV_TPREL_LO12_I

(I'm not sure how feasible it would be in the relevant tools to implement
printing R_RISCV_HI20, R_RISCV_ADD and R_RISCV_LO12_I aliases to the TPREL
relocs here; I suppose this should be pretty straightforward and a simple
carry-over flag would do to mark the scenario, and is likely present in
linkers already to handle composed relocations in the first place).

> > Likewise, this can be done with `.reloc' like I noted above, and the
> > relaxation defined separately. I think relaxation support that requires
> > psABI support (e.g. extra relocations) should be defined in a separate
> > section of the standard. Perhaps individual sections included in the base
> > psABI and this addendum.
> >
> > If you think it is important to have relaxation defined right from the
> > beginning (why?), then I might consider doing it right away.
>
> Linker relaxation is fundamental to the design of the RISC-V
> toolchain, or perhaps I should say the RISC-V GNU toolchain. You
> won't get good code size or performance without it. I'm not sure if
> separating this stuff out to a separate section make sense. It may
> also be difficult to do that, since some of the relaxations don't
> require relocs, and some of the relaxations use the same relocs used
> elsewhere, and only some of the relaxations require unique relocs used
> only for relaxation.

I think we only need to define relaxation as a part of the RISC-V psABI,
be it this FDPIC addendum or any other piece, as far as it actually
affects the ABI and leave anything else up to linker implementers.

For instance if a specific new relocation is required, such as with the:

add t0, t0, gp, %gprel_add(fun)

instruction above, then we ought to standardise it (please note however,
had we used composed relocations from the beginning, nothing specific to
the RISC-V psABI FDPIC addendum would be required as R_RISCV_GPREL is a
general relocation and R_RISCV_ADD would have been previously defined in
the RISC-V psABI proper).

Conversely if relocations defined elsewhere are needed for some kind of
relaxation or none are required, then naturally provisions for such
relaxations have no place in this document as they are either defined by
the other document or are implementation specific.

Overall thank you for your feedback. Please let me know if you find
anything I wrote unclear or you have any other comments or questions.

Maciej

Maciej W. Rozycki

unread,
Apr 9, 2020, 8:07:52 PM4/9/20
to Jim Wilson, Fangrui Song, RISC-V SW Dev
On Thu, 26 Mar 2020, Jim Wilson wrote:

> > > I am not subscribed, so I suspect my reply will be eaten by Google
> > > Groups... I also guessed your email addresses.
> >
> > It went through as I received it at my LMO personal e-mail address too.
> > Perhaps the mailing list isn't open for posting only by subscribers after
> > all (I sought advice on that from the list owner, but haven't ever heard
> > back).
>
> Maybe the next draft can be done via the
> github.com/riscv/riscv-elf-psabi-doc tree as an issue or pull request?
> I think most all interested parties are watching that github repo.

As we discussed before off the list, I'm sceptical about the use of
GitHub for our project as they require anyone wishing to have write access
to accept their T&Cs, which they may vary according to their requirements
at any time. That may be OK to a newcomer wanting to gain some reach with
their software experiment, however for a major project like the RISC-V ISA
that does not sound good for me.

OTOH using a mailing list is safe in that even if the list server and
associated archives go down (NB we can have many, e.g. `marc.info' might
agree to add us to their archive if we ask nicely), past messages will
have been archived by at least some recipients and can be recovered.

Some essential FOSS projects like the GNU toolchain and especially the
Linux kernel have relied on mailing lists for technical reviews since
forever and while they keep an eye on alternatives they have concluded no
better medium to have appeared so far.

So I'd rather stick to e-mail for this effort, and below I have included
the current version of the document. I will try to address Stef O'Rear's
concerns next.

> > It's an implementation detail (the section is informative), assemblers
> > are free to define their own syntax, which is beyond the scope of an ABI.
>
> Well, compatibility between assemblers is useful, and I would hope
> that GCC and LLVM at least have compatible assembly syntax.

I think we cannot force everyone to use the same syntax, however if we
want to encourage doing that, then we need to give people a chance and
provide a normative reference.

> > In the GNU assembler percent-operators indicate relocation, however we
> > currently have an issue in that several operations have not been defined
> > and the compiler has no direct way to synthesize them other than with the
> > `.reloc' pseudo-op.
> >
> > In particular there is no way (or I haven't found one) for the compiler
> > to emit an instruction sequence to make a function call. Instead the
> > `call' assembly macro has to be used, that expands to a pair of
> > instructions.
>
> Yes. this is lacking.
>
> We do have an assembler manual, but it is woefully incomplete.
> https://github.com/riscv/riscv-asm-manual
> and some things like call can't be easily expressed except as a macro
> as you mentioned.

FWIW using macros looks to me like repeating old MIPS assembly language's
mistakes. While having assembly idioms for individual instructions such
as NOP or MV does appear both useful and harmless, and does not conflict
with the spirit of an assembly dialect being a human-writable way of
directly expressing machine code, providing no way but with complex macros
to produce some instruction sequences does not seem the right way to me.

> I would offer one word of warning, which is that gcc -mcmodel=medany
> -mexplicit-relocs is known to fail sometimes. it is a complex problem
> that might require an ABI change to fix. It has something like a 1 in
> a 1K chance of failing for each risky use. So we should not
> accidentally encourage use of explicit relocs in cases when it is
> known to fail.
> https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/KnziiZtEJNo/M8Vfbw9UCgAJ

Ah, it is a known issue with PC-relative addressing overall, caused by
the misalignment (with respect to the data type referred, `long long' in
this case) of the PC used in a calculation made by AUIPC causing a
carry/borrow to/from the high part to occur in the PC-relative offset when
accessing subsequent words of a multi-word data type (or multi-dword data
in the RV64 case) that crosses the boundary of the 12-bit range spanned by
the low part. And I don't actually think that the use, or the lack, of
explicit relocations is going to change anything here: if a carry/borrow
happens at the static link time, the issue will strike.

This could be solved in hardware by masking off a number of low-order
bits of the PC in the AUIPC calculation. It might be hard to determine
what number would be right though: if too low it would only support
narrower data types, if too high it would waste some text memory by the
hightened alignment requirement (although this would be per-segment rather
than per-object or per-function, so perhaps not a big deal). Anyway, we
don't have it, so we need to address it via software means.

There are a couple of easy ways to tackle it, some without and some with
a need to update the psABI. Taking the code from your example we have:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
lw a1, %pcrel_lo(.LA0 + 4)(a5)
ret

as it stands.

One way is to make sure the PC is correctly aligned WRT data referred, so
taking RV32 as the target (from now on) for a 64-bit access, like a DImode
integer or a DFmode real type we can instead emit:

sub:
.balign 8
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
lw a1, %pcrel_lo(.LA0 + 4)(a5)
ret

although at the cost of 2 bytes of code wasted on average for the sequence
itself (plus any alignment increase for functions causing extra padding).
Likewise with a 128-bit access, like a TImode integer, a DFmode complex
real or some kind of a vector type:

sub:
.balign 16
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
lw a1, %pcrel_lo(.LA0 + 4)(a5)
lw a2, %pcrel_lo(.LA0 + 8)(a5)
lw a3, %pcrel_lo(.LA0 + 12)(a5)
ret

at the cost of 6 bytes of code wasted on average (plus function
alignment). Of course in both cases there will be extra execution time
required for any alignment NOPs inserted. This however does not require
any psABI update and will work as it stands.

Another way is to preload the address of the data accessed and offset it
separately:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
addi a5, a5, %pcrel_lo(.LA0)
lw a0, 0(a5)
lw a1, 4(a5)
ret

This takes the same amount of space as the original on RV32C, however
takes once cycle more on scalar implementations and takes a fixed amount
of 4 bytes extra in the absence of the C extension. Similarly:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
addi a5, a5, %pcrel_lo(.LA0)
lw a0, 0(a5)
lw a1, 4(a5)
lw a2, 8(a5)
lw a3, 12(a5)
ret

takes 4 bytes less on RV32C, one cycle more on scalar implementations and
a fixed amount of 4 bytes extra in the absence of the C extension.
Neither require any psABI update either.

Finally we can emit full individual load sequences:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
.LA1: auipc a5, %pcrel_hi(ll + 4), %pcrel_auipc(ll)
lw a1, %pcrel_lo(.LA1)(a5)
ret

and:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(.LA0)(a5)
.LA1: auipc a5, %pcrel_hi(ll + 4), %pcrel_auipc(ll)
lw a1, %pcrel_lo(.LA1)(a5)
.LA2: auipc a5, %pcrel_hi(ll + 8), %pcrel_auipc(ll + 4)
lw a2, %pcrel_lo(.LA2)(a5)
.LA3: auipc a5, %pcrel_hi(ll + 12), %pcrel_auipc(ll + 8)
lw a3, %pcrel_lo(.LA3)(a5)
ret

where %pcrel_auipc emits an R_RISCV_PCREL_AUIPC relocation used in linker
relaxation to remove an AUIPC instruction with an R_RISCV_PCREL_HI20
relocation attached iff the calculation of both expressions associated
with there relocations works out at the same value as far as the high
20-bit part is concerned.

These sequences do require a psABI update and preclude the use of
compressed instructions (which may actually be a good idea at `-Os' even
if we have this implemented), but at the static link time they only leave
an extra AUIPC instruction (at most once per sequence) if it is indeed
required and cause no wasted extra execution cycles.

Please note however that this issue only affects PC-relative addressing,
because the PC changes as execution goes. Whereas GP remains constant and
aligned according to the alignment of the data segment, which has to be no
smaller than the largest alignment of all the data types used within.

BTW, why has such peculiar (and possibly limiting) semantics of the
low-part relocation been chosen rather than the obvious:

sub:
0: auipc a5, %pcrel_hi(ll)
1: lw a0, %pcrel_lo(ll + 1b - 0b)(a5)
2: lw a1, %pcrel_lo(ll + 2b - 0b)(a5)
ret

? Is that because linker relaxation could make such label difference
expressions not to be assembly-time constants if there were intervening
instructions?

Maciej

--------------------------------------------------------------------------
RISC-V FDPIC ELF psABI Addendum (Apr, 10th 2020)
Operand | Description
=========+================================================================
A | Relocation addend.
---------+----------------------------------------------------------------
| Data segment's base address; the difference between the actual
DBA | data segment's load address and DVMA in dynamic load, 0 in
| static link.
---------+----------------------------------------------------------------
DVMA | Data segment's virtual memory address as in `p_vaddr'.
---------+----------------------------------------------------------------
G | The offset from GP of a GOT entry for the symbol referred by
| the relocation.
---------+----------------------------------------------------------------
GP | The value of GP associated with the symbol referred, nominally
| (DVMA + DBA + 2048).
---------+----------------------------------------------------------------
P | The place (offset or address) of the storage unit affected by
| the relocation.
---------+----------------------------------------------------------------
PLTE | The address of a PLT entry associated with the symbol referred.
---------+----------------------------------------------------------------
PLTI | The address of a PLT entry designated to make indirect calls.
---------+----------------------------------------------------------------
S | The value of the symbol referred by the relocation.
---------+----------------------------------------------------------------
| Text segment's base address; the difference between the actual
TBA | text segment's load address and TVMA in dynamic load, 0 in
| static link.
---------+----------------------------------------------------------------
TVMA | Text segment's virtual memory address as in `p_vaddr'.

Table 4.2 Relocation Types

Name | Value | Field | Symbol | Calculation
==========================+=======+=============+===========+=============
R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
R_RISCV_REL_TEXT (alias) | | | |
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GP | 12 | T-word32,64 | any | GP
--------------------------+-------+-------------+-----------+-------------
R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A
==========================+=======+=============+===========+=============
| | | local | S - P
R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
| | | n/a | PLTI - P
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_HI20 | 59 | V-hi20 | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GPREL_ADD | 62 | n/a | local | S - GP + A
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GOT_GPREL_HI20 | 63 | V-hi20 | any | G
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GOT_GPREL_LO12_I | 64 | T-lo12i | any | G
--------------------------+-------+-------------+-----------+-------------
R_RISCV_GOT_GPREL_ADD | 65 | n/a | any | G

Local symbols are never preempted and therefore they can be addressed
with relative addressing in PIC code. For text symbols PC-relative
addressing can be used both in ordinary PIC and FDPIC code and therefore
the same relocations are used in both cases.

PC-relative addressing cannot however be used in FDPIC code for data
symbols as the relative position of text and data with respect to each
other is not fixed and therefore a separate global pointer (GP) has to be
maintained. This ABI designates the x3 register to hold the value of the
GP and defines gp as an alias ABI name of this register. This register
is used to access local data using direct GP-relative addressing.

The R_RISCV_GPREL_HI20, R_RISCV_GPREL_LO12_I and R_RISCV_GPREL_LO12_S
static relocations are defined to support direct GP-relative addressing
suitable for local data access. Additionally an R_RISCV_GPREL_ADD static
relocation can be optionally produced to denote an ADD instruction used
in the calculation of such GP-relative offset that can be removed in
linker relaxation.

External symbols can be preempted and therefore have to be addressed
indirectly. The Global Offset Table (GOT) is used to hold the addresses
of external data symbols. GOT itself is local data and can therefore be
accessed with GP-relative addressing.

The R_RISCV_GOT_GPREL_HI20 and R_RISCV_GOT_GPREL_LO12_I static
relocations are defined to support indirect GP-relative addressing
suitable for external data access. Additionally an R_RISCV_GOT_GPREL_ADD
static relocation can be optionally produced to denote an ADD instruction
used in the calculation of such GP-relative offset that can be removed in
linker relaxation.

Occasionally a GOT entry will be created for local data to satisfy the
use of R_RISCV_GOT_GPREL_HI20 and R_RISCV_GOT_GPREL_LO12_I relocations in
code referring to such data. The R_RISCV_REL_DATA dynamic relocation is
defined to support GP-relative relocation of such GOT entries at program
load time.

Optionally linker relaxation is supported by emitting R_RISCV_GPREL_ADD
or R_RISCV_GOT_GPREL_ADD relocations
created and called into for each external procedure called. Addresses
of these PLT entries are referred to as PLTE in relocation calculation.

For direct calls an FDT entry is used that corresponds to the procedure
called and has been created in the module making the call. Therefore
code in the PLT can access the FDT entry directly as local data, using
GP-relative addressing.

For indirect calls the PLT is also used and an FDT entry is used that
corresponds to the procedure called and has been created in the module
providing the function symbol of the procedure.

If a function symbol is local, then the GP-relative address of the FDT
entry is directly used by the static linker as the value retrieved in
taking a function's address.

If a function symbol is external, then an external dynamic data symbol is
created that refers to that FDT entry and whose name is constructed by
prepending `__riscv_fdt_' to the function's symbol name.

If the address of an external function symbol is taken, then a GOT entry
is created for the corresponding `__riscv_fdt_' dynamic data symbol and
used to satisfy the reference.

When making an indirect call a dedicated PLT entry is used that is common
to all indirect calls and upon invocation of that PLT entry the x5 (t0)
register holds the address of the FDT entry in the module providing the
function symbol of the procedure to call. The address of this PLT entry
is referred to as PLTI in relocation calculation.
add t0, t0, gp, %gprel_add(var+addend) # R_RISCV_GPREL_ADD var+addend
lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend


4.4.2 External Data Addressing

Ordinary PIC code, using GOT and PC-relative addressing:

# Outstanding static relocations
label:
auipc t0, %got_pcrel_hi(var) # R_RISCV_GOT_HI20 var
l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
lb t1, addend(t0)
sb t2, addend(t0)

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 var

# or if the data symbol turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(var)

Corresponding FDPIC code, using GOT and GP-relative addressing:

# Outstanding static relocations
lui t0, %got_gprel_hi(var) # R_RISCV_GOT_GPREL_HI20 var
add t0, t0, gp, %got_gprel_add(var) # R_RISCV_GOT_GPREL_ADD var
l[w|d] t0, %got_gprel_lo(var)(t0) # R_RISCV_GOT_GPREL_LO12_I var
lbu t1, addend(t0)
sb t2, addend(t0)

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 var

# or if the function turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(var)


4.4.3 Taking a Function's Address

FDPIC code, local function:

# Outstanding static relocations
lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
add t0, t0, gp, %gprel_add(fun) # R_RISCV_GPREL_ADD fun
addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun

FDPIC code, external function:

# Outstanding static relocations
lui t0, %got_gprel_hi(fun) # R_RISCV_GOT_GPREL_HI20 fun
add t0, t0, gp, %got_gprel_add(fun) # R_RISCV_GOT_GPREL_ADD fun
l[w|d] t1, t0, %got_gprel_lo(fun) # R_RISCV_GOT_GPREL_LO12_I fun

# Outstanding dynamic relocations for the GOT entry
# R_RISCV_32,64 __riscv_fdt_fun

# or if the function symbol turns out local at static link time
# R_RISCV_REL_DATA *ABS*+ABS(__riscv_fdt_fun)


4.4.4 Procedure Calls Using the PLT

FDPIC code, direct call:

# Outstanding static relocations
auipc ra, %call_plt(fun) # R_RISCV_CALL_PLT fun
jalr ra, ra, 0
l[w|d] gp, <gp_slot>(sp)

FDPIC code, indirect call (to a2):

# Outstanding static relocations
mv t0, a2
auipc ra, %call_plt(0) # R_RISCV_CALL_PLT
jalr ra, ra, 0
l[w|d] gp, <gp_slot>(sp)

# The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
# the PLT entry associated with indirect calls.


Evandro Menezes

unread,
Apr 10, 2020, 3:48:31 PM4/10/20
to Maciej W. Rozycki, Jim Wilson, Fangrui Song, RISC-V SW Dev
The problem seems to only be evident when referring to a variable larger than XLEN. Then, the compiler should emit the correct code. For the assembler, perhaps it needs a pseudo instruction for RV32G to make sure that double words are correctly loaded from and stored to.

> BTW, why has such peculiar (and possibly limiting) semantics of the
> low-part relocation been chosen rather than the obvious:
>
> sub:
> 0: auipc a5, %pcrel_hi(ll)
> 1: lw a0, %pcrel_lo(ll + 1b - 0b)(a5)
> 2: lw a1, %pcrel_lo(ll + 2b - 0b)(a5)
> ret
>
> ? Is that because linker relaxation could make such label difference
> expressions not to be assembly-time constants if there were intervening
> instructions?

Yes, but mostly because the value of the `pc` which the offset is from is the one for the address of the `auipc` instruction, not of the `lw` or of the `add`.

Maciej W. Rozycki

unread,
Apr 10, 2020, 7:31:27 PM4/10/20
to Evandro Menezes, Jim Wilson, Fangrui Song, RISC-V SW Dev
On Fri, 10 Apr 2020, Evandro Menezes wrote:

> > Please note however that this issue only affects PC-relative addressing,
> > because the PC changes as execution goes. Whereas GP remains constant and
> > aligned according to the alignment of the data segment, which has to be no
> > smaller than the largest alignment of all the data types used within.
>
> The problem seems to only be evident when referring to a variable larger
> than XLEN. Then, the compiler should emit the correct code. For the
> assembler, perhaps it needs a pseudo instruction for RV32G to make sure
> that double words are correctly loaded from and stored to.

Or rather when multiple instructions are needed to access data at varying
offsets, which may or may not be tied to XLEN (e.g. FP loads/stores used
for complex data will be tied to FLEN). I gave examples of assembly code
a compiler can produce to get correct results even with our psABI as it
stands.

As to handwritten assembly a macro such as LD could be provided for RV32G
that would expand to one of these sequences -- though as I noted I am
doubtful as to whether this is the right direction. Instead you can just
handcode this correctly as with other assembly targets that use no macros.

As I say the use or the lack of explicit relocations has nothing to do
with it.

> > BTW, why has such peculiar (and possibly limiting) semantics of the
> > low-part relocation been chosen rather than the obvious:
> >
> > sub:
> > 0: auipc a5, %pcrel_hi(ll)
> > 1: lw a0, %pcrel_lo(ll + 1b - 0b)(a5)
> > 2: lw a1, %pcrel_lo(ll + 2b - 0b)(a5)
> > ret
> >
> > ? Is that because linker relaxation could make such label difference
> > expressions not to be assembly-time constants if there were intervening
> > instructions?
>
> Yes, but mostly because the value of the `pc` which the offset is from
> is the one for the address of the `auipc` instruction, not of the `lw`
> or of the `add`.

Normally this PC adjustment for PC-relative relocations located not at
the PC they refer to (as expressed with the label subtractions in the
example above) is handled either with the addend (if the subtraction works
out to an assembly-time constant) or with additional relocations (if the
calculation has to be deferred to the link time; GAS has infrastructure to
do this automatically). This seems most straightforward to me and it is
how other targets do it.

Instead we have this complicated indirection where the symbol referred by
the low-part relocation is not used in calculation and instead points at
the location of the high-part relocation, which refers the symbol to use
with the low-part relocation. And then the addends of both relocations
are combined in the calculation of the low-part relocation.

It couldn't be more convoluted in my opinion, so your explanation doesn't
really answer my question I am afraid.

Maciej

Evandro Menezes

unread,
Apr 13, 2020, 5:09:58 PM4/13/20
to Maciej W. Rozycki, Jim Wilson, Fangrui Song, RISC-V SW Dev
Hi, Maciej.

> On Apr 10, 2020, at 18:31, Maciej W. Rozycki <ma...@wdc.com> wrote:
>
> On Fri, 10 Apr 2020, Evandro Menezes wrote:
>
>>> Please note however that this issue only affects PC-relative addressing,
>>> because the PC changes as execution goes. Whereas GP remains constant and
>>> aligned according to the alignment of the data segment, which has to be no
>>> smaller than the largest alignment of all the data types used within.
>>
>> The problem seems to only be evident when referring to a variable larger
>> than XLEN. Then, the compiler should emit the correct code. For the
>> assembler, perhaps it needs a pseudo instruction for RV32G to make sure
>> that double words are correctly loaded from and stored to.
>
> Or rather when multiple instructions are needed to access data at varying
> offsets, which may or may not be tied to XLEN (e.g. FP loads/stores used
> for complex data will be tied to FLEN). I gave examples of assembly code
> a compiler can produce to get correct results even with our psABI as it
> stands.
>
> As to handwritten assembly a macro such as LD could be provided for RV32G
> that would expand to one of these sequences -- though as I noted I am
> doubtful as to whether this is the right direction. Instead you can just
> handcode this correctly as with other assembly targets that use no macros.

Macros help to code once for both RV32 and RV64.
I believe that I answered your question whether the reason for this convoluted method is in place to facilitate the linker relaxation. However, it seems that you meant it to be about something else, but I failed to catch your meaning.

Cheers,

__
Evandro Menezes ◊ SiFive ◊ Austin, TX


Maciej W. Rozycki

unread,
Apr 14, 2020, 10:59:56 AM4/14/20
to Evandro Menezes, Jim Wilson, Fangrui Song, RISC-V SW Dev
On Mon, 13 Apr 2020, Evandro Menezes wrote:

> > Or rather when multiple instructions are needed to access data at varying
> > offsets, which may or may not be tied to XLEN (e.g. FP loads/stores used
> > for complex data will be tied to FLEN). I gave examples of assembly code
> > a compiler can produce to get correct results even with our psABI as it
> > stands.
> >
> > As to handwritten assembly a macro such as LD could be provided for RV32G
> > that would expand to one of these sequences -- though as I noted I am
> > doubtful as to whether this is the right direction. Instead you can just
> > handcode this correctly as with other assembly targets that use no macros.
>
> Macros help to code once for both RV32 and RV64.

It seemed an attractive idea back in early 1990s, but over the years it
has turned out not to be sufficient anyway on one hand, and it complicated
processing on the other. I am not sure if the historical mistake is worth
repeating with the RISC-V ISA, especially as handcoded assembly is more of
a corner case nowadays then ever, and in a compiler we want to have full
control over machine code generated anyway.

> > Normally this PC adjustment for PC-relative relocations located not at
> > the PC they refer to (as expressed with the label subtractions in the
> > example above) is handled either with the addend (if the subtraction works
> > out to an assembly-time constant) or with additional relocations (if the
> > calculation has to be deferred to the link time; GAS has infrastructure to
> > do this automatically). This seems most straightforward to me and it is
> > how other targets do it.
> >
> > Instead we have this complicated indirection where the symbol referred by
> > the low-part relocation is not used in calculation and instead points at
> > the location of the high-part relocation, which refers the symbol to use
> > with the low-part relocation. And then the addends of both relocations
> > are combined in the calculation of the low-part relocation.
> >
> > It couldn't be more convoluted in my opinion, so your explanation doesn't
> > really answer my question I am afraid.
>
> I believe that I answered your question whether the reason for this
> convoluted method is in place to facilitate the linker relaxation.
> However, it seems that you meant it to be about something else, but I
> failed to catch your meaning.

Hmm, in my understanding your answer implies that a peculiar and
complicated solution was chosen for a simple if not routine case, for
which simple solutions have been designed into the ELF relocation system
since forever (well maybe some 20+ years ago). Therefore I gather there
must be a second bottom here, and I'm still interested in finding it, i.e.
why the simple solution from the ELF gABI was found not adequate and the
complex one chosen instead.

Having to traverse the symbol table to find the other relocation while
all could be kept in the relocation table, and then locally within, seems
really awkward to me.

Maciej

Jim Wilson

unread,
Apr 14, 2020, 1:21:25 PM4/14/20
to Maciej W. Rozycki, Fangrui Song, RISC-V SW Dev
On Thu, Apr 9, 2020 at 5:07 PM Maciej W. Rozycki <ma...@wdc.com> wrote:
> As we discussed before off the list, I'm sceptical about the use of
> GitHub for our project as they require anyone wishing to have write access
> to accept their T&Cs, which they may vary according to their requirements
> at any time. That may be OK to a newcomer wanting to gain some reach with
> their software experiment, however for a major project like the RISC-V ISA
> that does not sound good for me.

Github is what we use. You need to accept that.

> OTOH using a mailing list is safe in that even if the list server and
> associated archives go down (NB we can have many, e.g. `marc.info' might
> agree to add us to their archive if we ask nicely), past messages will
> have been archived by at least some recipients and can be recovered.

At least one major contributor has already indicated that he isn't on
this mailing list, and hence it isn't the best place for some of these
discussions. You need to accept that.

> Some essential FOSS projects like the GNU toolchain and especially the
> Linux kernel have relied on mailing lists for technical reviews since
> forever and while they keep an eye on alternatives they have concluded no
> better medium to have appeared so far.

Over 99% of the world uses email in a different way than the Linux
kernel and GNU toolchain projects do. This is a losing battle. You
need to accept that.

> > Well, compatibility between assemblers is useful, and I would hope
> > that GCC and LLVM at least have compatible assembly syntax.
>
> I think we cannot force everyone to use the same syntax, however if we
> want to encourage doing that, then we need to give people a chance and
> provide a normative reference.

All major ISAs have a standard assembly syntax that is shared across
compilers. You need to accept that.

> > We do have an assembler manual, but it is woefully incomplete.
> > https://github.com/riscv/riscv-asm-manual
> > and some things like call can't be easily expressed except as a macro
> > as you mentioned.
>
> FWIW using macros looks to me like repeating old MIPS assembly language's
> mistakes. While having assembly idioms for individual instructions such
> as NOP or MV does appear both useful and harmless, and does not conflict
> with the spirit of an assembly dialect being a human-writable way of
> directly expressing machine code, providing no way but with complex macros
> to produce some instruction sequences does not seem the right way to me.

Yes, macros aren't ideal, but it is how the RISC-V tools were
designed. You need to accept that.

Yes, there is some syntax missing. Contributions to fix this are welcome.

> > I would offer one word of warning, which is that gcc -mcmodel=medany
> > -mexplicit-relocs is known to fail sometimes. it is a complex problem
> > that might require an ABI change to fix. It has something like a 1 in
> > a 1K chance of failing for each risky use. So we should not
> > accidentally encourage use of explicit relocs in cases when it is
> > known to fail.
> > https://groups.google.com/a/groups.riscv.org/forum/#!msg/sw-dev/KnziiZtEJNo/M8Vfbw9UCgAJ

> This could be solved in hardware by masking off a number of low-order
> bits of the PC in the AUIPC calculation. It might be hard to determine
> what number would be right though: if too low it would only support
> narrower data types, if too high it would waste some text memory by the
> hightened alignment requirement (although this would be per-segment rather
> than per-object or per-function, so perhaps not a big deal). Anyway, we
> don't have it, so we need to address it via software means.

Yes, a hardware change could fix it, but the hardware design was fixed
years before I found this bug. There was a suggestion that maybe we
could fix this by emitting an and instruction after the auipc, and
then add relaxation support to remove the extra and instruction. I
tried looking at this once but it got complicated and I didn't have
enough time to determine if it could work or not.

> One way is to make sure the PC is correctly aligned WRT data referred, so
> taking RV32 as the target (from now on) for a 64-bit access, like a DImode
> integer or a DFmode real type we can instead emit:
> sub:
> .balign 8
> .LA0: auipc a5, %pcrel_hi(ll)
> lw a0, %pcrel_lo(.LA0)(a5)
> lw a1, %pcrel_lo(.LA0 + 4)(a5)
> ret
> although at the cost of 2 bytes of code wasted on average for the sequence
> itself (plus any alignment increase for functions causing extra padding).
> Likewise with a 128-bit access, like a TImode integer, a DFmode complex
> real or some kind of a vector type:

Too many nops added, and no way to remove them with relaxation. Code
size is very important for embedded processors, and the RISC-V market
is mostly embedded currently, so any solution that isn't relaxable is
going to cause problems.

> Another way is to preload the address of the data accessed and offset it
> separately:
>
> sub:
> .LA0: auipc a5, %pcrel_hi(ll)
> addi a5, a5, %pcrel_lo(.LA0)
> lw a0, 0(a5)
> lw a1, 4(a5)
> ret

This requires an inconvenient hook in target independent optimizers
for gcc. it was somewhere in expand_expr that I had to change. I
then ran into other problems after doing that though I don't remember
exactly what anymore. And again there is the problem that it is
increasing code size, but we might be able to fix this with
relaxations.

> sub:
> .LA0: auipc a5, %pcrel_hi(ll)
> lw a0, %pcrel_lo(.LA0)(a5)
> .LA1: auipc a5, %pcrel_hi(ll + 4), %pcrel_auipc(ll)
> lw a1, %pcrel_lo(.LA1)(a5)
> ret

I don't think we tried this solution. Again, we will need relaxations
to avoid a code size increase.

> BTW, why has such peculiar (and possibly limiting) semantics of the
> low-part relocation been chosen rather than the obvious:
>
> sub:
> 0: auipc a5, %pcrel_hi(ll)
> 1: lw a0, %pcrel_lo(ll + 1b - 0b)(a5)
> 2: lw a1, %pcrel_lo(ll + 2b - 0b)(a5)
> ret
>
> ? Is that because linker relaxation could make such label difference
> expressions not to be assembly-time constants if there were intervening
> instructions?

Yes. The lw instructions may be 2 or 4 bytes which can't be known
until link time. The compiler may do instruction scheduling and place
other instructions in the middle of this sequence, which themselves
may be relaxable. In general, the aggressive linker relaxation means
we can never compute a text label subtraction at assembly time.

This also shows up in our dwarf output, which doesn't use leb128 as
much as other targets, and hence ends up larger than other targets.
There is a proposal to fix this by adding special relaxation relocs
for leb128 but there are holes in the proposal and it has been stalled
for a while.

There are too many different issues you are trying to discuss in a
single email thread. And there are too many different controversies
you are creating at the same time.

Jim

Jim Wilson

unread,
Apr 14, 2020, 1:41:27 PM4/14/20
to Maciej W. Rozycki, Evandro Menezes, Fangrui Song, RISC-V SW Dev
On Fri, Apr 10, 2020 at 4:31 PM Maciej W. Rozycki <ma...@wdc.com> wrote:
> Instead we have this complicated indirection where the symbol referred by
> the low-part relocation is not used in calculation and instead points at
> the location of the high-part relocation, which refers the symbol to use
> with the low-part relocation. And then the addends of both relocations
> are combined in the calculation of the low-part relocation.
>
> It couldn't be more convoluted in my opinion, so your explanation doesn't
> really answer my question I am afraid.

We either use this convoluted system, or we have a reloc with multiple
operands. Currently, all relocs have only one operand. If we have to
have relocs with multiple operands that is a major change. Or maybe
we use multiple relocs on the same instruction to hold the 3 operands
fields.

I did get a message once from Michael Eager, quoting some text from
the ELF standard saying that the operand to a reloc must always be
related to the symbol that the reloc is for, which we are technically
in violation of, because in our case the reloc operand for pcrel_lo
points at the auipc not the final symbol. I haven't tried closely
studying the ELF standard to see if we are actually violating it or
not. If so, this would be an argument for changing the current
approach. But any solution looks like it is going to be as
inconvenient as the current scheme.

Jim

Jim Wilson

unread,
Apr 14, 2020, 1:47:00 PM4/14/20
to Maciej W. Rozycki, Evandro Menezes, Fangrui Song, RISC-V SW Dev
On Tue, Apr 14, 2020 at 7:59 AM Maciej W. Rozycki <ma...@wdc.com> wrote:
> > Macros help to code once for both RV32 and RV64.
>
> It seemed an attractive idea back in early 1990s, but over the years it
> has turned out not to be sufficient anyway on one hand, and it complicated
> processing on the other. I am not sure if the historical mistake is worth
> repeating with the RISC-V ISA, especially as handcoded assembly is more of
> a corner case nowadays then ever, and in a compiler we want to have full
> control over machine code generated anyway.

Macros are very convenient for people writing assembly code by hand.
Macros are not convenient for compilers, as they inhibit optimization.
We should support both approaches.

There are some complications for RISC-V though. The way that the
auipc instruction is defined makes it difficult for the compiler to
represent it in IL, and very hard to optimize it. There are a number
of cases where gcc has to give up, and just use the macros because it
is too hard to optimize the code. So we need the macros for the
compiler too, but really only the ones that use auipc.

Jim

Karsten Merker

unread,
Apr 14, 2020, 2:42:00 PM4/14/20
to Jim Wilson, Maciej W. Rozycki, RISC-V SW Dev, Karsten Merker
On Tue, Apr 14, 2020 at 10:21:10AM -0700, Jim Wilson wrote:
> On Thu, Apr 9, 2020 at 5:07 PM Maciej W. Rozycki <ma...@wdc.com> wrote:
> > As we discussed before off the list, I'm sceptical about the use of
> > GitHub for our project as they require anyone wishing to have write access
> > to accept their T&Cs, which they may vary according to their requirements
> > at any time. That may be OK to a newcomer wanting to gain some reach with
> > their software experiment, however for a major project like the RISC-V ISA
> > that does not sound good for me.
>
> Github is what we use. You need to accept that.
>
> > OTOH using a mailing list is safe in that even if the list server and
> > associated archives go down (NB we can have many, e.g. `marc.info' might
> > agree to add us to their archive if we ask nicely), past messages will
> > have been archived by at least some recipients and can be recovered.
>
> At least one major contributor has already indicated that he isn't on
> this mailing list, and hence it isn't the best place for some of these
> discussions. You need to accept that.
>
> > Some essential FOSS projects like the GNU toolchain and especially the
> > Linux kernel have relied on mailing lists for technical reviews since
> > forever and while they keep an eye on alternatives they have concluded no
> > better medium to have appeared so far.
>
> Over 99% of the world uses email in a different way than the Linux
> kernel and GNU toolchain projects do. This is a losing battle. You
> need to accept that.

Hello Jim,

I beg to differ with you in this case and agree with Maciej.

The RISC-V sw-dev list was created exactly as _the_ canonical medium for
the discussion and review of all software-development- and
toolchain-related topics around the RISC-V ISA for which no more specific
mailinglist exists (such as e.g. the linux-riscv list for kernel-related
topics or the opensbi list for SBI-related topics), and discussing the
psABI spec is perfectly on-topic for this list. The github terms of
service are a no-go for quite a number of people and we already have
precedent in the development of OpenSBI (which is hosted under the same
RISC-V foundation github umbrella as the psABI spec) that the mailinglist
is the primary place for discussion and review and not the github issues
(cf. https://github.com/riscv/opensbi/blob/master/docs/contributing.md).
Discussions about the existing ELF psABI have also been taking part on
the sw-dev list and patches to it have been posted, reviewed and accepted
here, so there is IMHO no reason to insist on having the discussion and
the review of further additions to it on github instead of on this list,
quite the contrary.

The argument "at least one contributor is not subcribed to this
mailinglist" isn't any more or any less of a valid argument than "at
least one other contributor isn't subscribed to github" - this works
equally in both ways.

Regards,
Karsten
--
Ich widerspreche hiermit ausdrücklich der Nutzung sowie der
Weitergabe meiner personenbezogenen Daten für Zwecke der Werbung
sowie der Markt- oder Meinungsforschung.

Maciej W. Rozycki

unread,
Apr 14, 2020, 3:14:19 PM4/14/20
to Jim Wilson, Evandro Menezes, Fangrui Song, RISC-V SW Dev
On Tue, 14 Apr 2020, Jim Wilson wrote:

> > It couldn't be more convoluted in my opinion, so your explanation doesn't
> > really answer my question I am afraid.
>
> We either use this convoluted system, or we have a reloc with multiple
> operands. Currently, all relocs have only one operand. If we have to
> have relocs with multiple operands that is a major change. Or maybe
> we use multiple relocs on the same instruction to hold the 3 operands
> fields.

As I mentioned previously composed relocations can be used, in this case
to refer multiple symbols. The calculation result of a given relocation
in a composed sequence is carried over to the following one as the addend
to use (from the ELF gABI it seems that any addend provided by subsequent
relocation themselves is ignored in the calculation; this is I suppose to
keep the definition consistent between REL and RELA relocation formats,
and I also suppose a psABI is free to use that addend for a different
purpose). And this is already handled by generic ELF linker code.

BTW, technically the addend is a second operand to a relocation, the
first being the symbol referred (if any).

> I did get a message once from Michael Eager, quoting some text from
> the ELF standard saying that the operand to a reloc must always be
> related to the symbol that the reloc is for, which we are technically
> in violation of, because in our case the reloc operand for pcrel_lo
> points at the auipc not the final symbol. I haven't tried closely
> studying the ELF standard to see if we are actually violating it or
> not.

I think it is implied directly by the `r_info' field definition:

"r_info
This member gives both the symbol table index with respect to which
the relocation must be made, and the type of relocation to apply.
[...]"

> If so, this would be an argument for changing the current
> approach. But any solution looks like it is going to be as
> inconvenient as the current scheme.

I think it would not be at all inconvenient. Given the sequence I quoted
previously (modified slightly to use a local symbol instead of a label,
and the dot symbol):

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(ll + (. - .LA0))(a5)
lw a1, %pcrel_lo(ll + (. - .LA0) + 4)(a5)
ret

we'd end up with object code like this:

0000000000000000 <sub>:
0: 00000797 auipc a5,0x0
0: R_RISCV_PCREL_HI20 ll
0: R_RISCV_RELAX *ABS*
4: 0007a503 lw a0,0(a5) # 0 <sub>
4: R_RISCV_PCREL_LO12_I ll
4: R_RISCV_PCREL_DIFF .LA0
4: R_RISCV_RELAX *ABS*
8: 0007a583 lw a1,0(a5)
8: R_RISCV_PCREL_LO12_I ll+0x4
8: R_RISCV_PCREL_DIFF .LA0
8: R_RISCV_RELAX *ABS*+0x4
c: 8082 ret

(I'm not sure what the semantics of the R_RISCV_RELAX relocation is here,
so I left these relocations intact; they may have to be moved next to the
original symbol reference, i.e. ahead of R_RISCV_PCREL_DIFF, or refer
`ll'). Of course the source syntax could be simplified as:

sub:
.LA0: auipc a5, %pcrel_hi(ll)
lw a0, %pcrel_lo(ll + %pcrel_diff(.LA0))(a5)
lw a1, %pcrel_lo(ll + %pcrel_diff(.LA0) + 4)(a5)
ret

This is a bit problematic with respect to LO12_I vs LO12_S relocations
(we might need to have separate DIFF_I and DIFF_S relocations), but using
fully composed relocations (where %hi/%lo only denote the field to
relocate) we could have:

sub:
.LA0: auipc a5, %hi(%pcrel(ll))
lw a0, %lo(%pcrel(ll + %pcrel_diff(.LA0)))(a5)
lw a1, %lo(%pcrel(ll + %pcrel_diff(.LA0) + 4))(a5)
ret

and:

0000000000000000 <sub>:
0: 00000797 auipc a5,0x0
0: R_RISCV_PCREL ll
0: R_RISCV_HI20 *ABS*
0: R_RISCV_RELAX *ABS*
4: 0007a503 lw a0,0(a5) # 0 <sub>
4: R_RISCV_PCREL ll
4: R_RISCV_PCREL_DIFF .LA0
4: R_RISCV_LO12_I *ABS*
4: R_RISCV_RELAX *ABS*
8: 0007a583 lw a1,0(a5)
8: R_RISCV_PCREL_LO12_I ll+0x4
8: R_RISCV_PCREL_DIFF .LA0
8: R_RISCV_LO12_I *ABS*
8: R_RISCV_RELAX *ABS*+0x4
c: 8082 ret

where there is no such issue as the HI20/LO12 relocation, which determines
the relocatable field, is last, as expected. Yes, the source-level syntax
appears more elaborate, but ultimately both the binary representation and
linker processing is more straightforward (especially as it's been already
implemented by generic ELF linker code). And I think the intent as seen
in the source code is also quite clear.

And in handcoded assembly both you and other people suggest using macros
instead, in which case you won't use any of these percent-ops anyway,
e.g.:

sub:
ld a0, ll
ret

and the assembler will expand the LD macro into the right machine
instructions with appropriate relocations attached that are required to
access the `ll' symbol.

FWIW,

Maciej

Jim Wilson

unread,
Apr 14, 2020, 10:41:49 PM4/14/20
to Maciej W. Rozycki, Evandro Menezes, Fangrui Song, RISC-V SW Dev
On Tue, Apr 14, 2020 at 12:14 PM Maciej W. Rozycki <ma...@wdc.com> wrote:
> (I'm not sure what the semantics of the R_RISCV_RELAX relocation is here,
> so I left these relocations intact; they may have to be moved next to the
> original symbol reference, i.e. ahead of R_RISCV_PCREL_DIFF, or refer
> `ll').

The bfd linker can relax pc-relative addressing sequences to
gp-relative. This reduces the usual 2 instruction auipc/lw sequence
to a single lw instruction. This is critically important for reducing
code size and improving performance. The linker can also relax
pc-relative addressing to x0-relative (i.e. zero-page addressing).
This is useful for making undefined weak work, and for some embedded
targets that have zero-page memory again reducing code size and
improving performance.

This relaxation support also works for double XLEN loads, changing
auipc/lw/lw into lw/lw.

Jim

Jim Wilson

unread,
Apr 14, 2020, 11:01:57 PM4/14/20
to Maciej W. Rozycki, Evandro Menezes, Fangrui Song, RISC-V SW Dev
On Tue, Apr 14, 2020 at 12:14 PM Maciej W. Rozycki <ma...@wdc.com> wrote:
> sub:
> .LA0: auipc a5, %pcrel_hi(ll)
> lw a0, %pcrel_lo(ll + (. - .LA0))(a5)
> lw a1, %pcrel_lo(ll + (. - .LA0) + 4)(a5)
> ret

There are practical issues here.

We are desperately short of binutils developer time. You want a
non-trivial change to fix something that isn't actually broken. That
is hard to justify and unlikely to happen unless you volunteer to do
the work yourself. There are a lot of more important problems to fix,
like adding support for proposed extensions (v, b, zfinx, zfh, etc),
and fixing bugs that have already been filed against binutils, like
the problem with relocs against text sections in PIE programs.

If we do succeed if adding this feature to binutils, and modifying
gcc, then we have created a dependency between binutils and gcc and
need to update the minimum required binutils version for gcc.

If we modify gcc, and don't modify llvm/lld then gcc compiled code can
no longer link with lld and we have a major problem. And vice versa
if llvm/lld is updated before binutils/gcc. So we now also have a
dependency between binutils/gcc and llvm/lld here, in that we need
support for the feature in both before we can safely enable it in
either one.

So to do this without causing problems, we need staged releases
(assembler/linker then compiler) and coordination between various
compilers (gcc and llvm primarily), and we need to make sure all of
the work happens in the right order and all of the linux distros
update to compatible gcc/llvm versions or we end up breaking
something. So it probably takes 2 or 3 years to do this right without
causing any problems, assuming we have engineers available to do the
work. This is an awful lot of work for something that isn't actually
broken.

I agree that it is a better design in theory. I'm just not convinced
that it is better in practice because ABI changes are always hard to
deal with.

And I'm only thinking about compilers here. There are other tools
that also deal with relocations. The linux kernel for instance has
support to load modules, handling relocs at kernel module load time,
which means you will need a linux kernel patch to handle the new
relocs also. That is another complication. A new compiler won't work
with old kernels if used to compile a kernel module, unless you can
update the kernel with a patch, or disable the compiler feature. We
could add an option to new kernel versions to turn the feature off,
but old kernels won't have that option in their Makefiles.

Jim

Maciej W. Rozycki

unread,
Apr 19, 2020, 2:18:34 PM4/19/20
to Stef O'Rear, RISC-V SW Dev, Damien Le Moal, i...@maskray.me, dal...@aerifal.cx
Stef --

> > This design has been originally presented at LCA 2020 and a recording is
> > available here: <https://www.youtube.com/watch?v=GydyykyNjxs>.
> >
> > I will appreciate your questions, comments and any other kind of
> > feedback.
>
> I've discussed this proposal with Rich Felker and Fangrui Song (CCed)
> in #musl; the
> following comments are exclusively mine.

Thank you for your input and your involvement with this effort.

> The register usage is exactly what I had in mind, and most of the code
> sequences seem approximately fine (several are not), but the relocation
> structure is extremely different from the other existing FDPIC ABIs[1][2][3],
> in a way which will make it difficult to support in generic code such as musl;
> I believe the ABI should be made as consistent as possible to avoid surprises
> like what we went through with TLS copy relocs.

I have deliberately avoided going through any other architecture's psABI
under the observation that while I can do it at any time after the initial
design proposal doing it right at the beginning would put me at the risk
of becoming negatively primed with respect to ways to solve the problem.
If interested, please watch: <https://www.youtube.com/watch?v=Yv4tI6939q0>
to see why negative priming can make one make bad decisions.

Of course as everyone I might make a bad design decision from time to
time as well, and the purpose of a peer review is to catch those early so
as to avoid any damage they might create otherwise, which could be
difficult to repair. This is one reason of my posting of this proposal.

The ultimate goal is to design the psABI the best way possible given the
properties of the architecture and in particular taking any competitive
advantage it may have over other architectures. Therefore any choices
made for other architectures ought not to influence it unless they are
beneficial or at least neutral.

> [1]: http://ftp.redhat.com/pub/redhat/gnupro/FRV/FDPIC-ABI.txt
> [2]: https://j-core.org/downloads/fdpic-sh.txt
> [3]: https://github.com/mickael-guene/fdpic_doc/blob/master/abi.txt

Thank you for the references.

> > Table 4.1 Relocation Operands
> >
> > Operand | Description
> > =========+================================================================
> > A | Relocation addend.
> > ---------+----------------------------------------------------------------
> > DBA | Data segment's base address; 0 in static link.
> > ---------+----------------------------------------------------------------
> > G | The offset from GP of a GOT entry for the symbol referred by
> > | the relocation.
> > ---------+----------------------------------------------------------------
> > GP | The value of GP associated with the symbol referred, nominally
> > | (DVMA + DBA + 2048).
> > ---------+----------------------------------------------------------------
> > P | The place (offset or address) of the storage unit affected by
> > | the relocation.
> > ---------+----------------------------------------------------------------
> > PLTE | The address of a PLT entry associated with the symbol referred.
> > ---------+----------------------------------------------------------------
> > PLTI | The address of a PLT entry designated to make indirect calls.
> > ---------+----------------------------------------------------------------
> > S | The value of the symbol referred by the relocation.
> > ---------+----------------------------------------------------------------
> > TBA | Text segment's base address; 0 in static link.
> >
> > Table 4.2 Relocation Types
> >
> > Name | Value | Field | Symbol | Calculation
> > ==========================+=======+=============+===========+=============
> > R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
> > R_RISCV_REL_TEXT (alias) | | | |
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GP | 12 | T-word32,64 | any | GP
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A
>
> None of the SH, FRV, or ARM FDPIC ABIs define anything equivalent to REL_DATA
> or GP. Why is it there?

The REL_DATA relocation is needed for static data references to local
symbols. Those symbols are necessarily not present in the dynamic symbol
table, yet the data references have to be relocated by the data segment's
base address at load time, because the final load-time address of the
respective symbols is not known at the static link time.

Separate REL_DATA and REL_TEXT relocations are required rather than a
single RELATIVE relative relocation, because unlike with the regular ABI,
which only has a single base address defined, we have a separate data
segment base address and text segment base address for every program.

The GP relocation resolves as per its definition, to the value of the
global pointer associated with the function called. It is required
because the callee has no way to determine the value of the GP from the PC
(if required) anymore, because there is no fixed offset between the two
like in the regular ABI.

> "Data segment base address" does not seem to be defined anywhere?

Now corrected. As per the ELF gABI the base address is the difference
between the load address and the corresponding virtual memory address
(`p_vaddr') of the segment loaded lowest in memory. Since in the FDPIC
ABI we necessarily treat text and data segments as separate areas in
memory they both have a corresponding text segment base address and a data
segment base address each.

I feel it is sort of obvious to anyone familiar with the ELF gABI, but
you are right in that in a formal document even seemingly obvious terms
are best explicitly defined for the avoidance of doubt.

> > ==========================+=======+=============+===========+=============
> > | | | local | S - P
> > R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
> > | | | n/a | PLTI - P
>
> None of the SH, FRV, or ARM ABIs use anything like PLTI.

Ack.

> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_HI20 | 59 | V-hi20 | local | S - GP + A
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_GOT_HI20 | 62 | V-hi20 | any | G
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_GOT_LO12_I | 63 | T-lo12i | any | G
>
> The GPREL and GPREL_GOT relocations look correct. We also need assembler
> syntax for them, and to decide whether they are %functions or @MODIFIERS.

Given the established practice with RISC-V assembly syntax and also the
discussion elsewhere in this thread about composed relocations I think
using percent-ops is the way forward as they give more flexibility (in
particular you can use parentheses around expressions to indicate the
addend to include with the relocation involved).

But none of this is a part of the psABI, which (as the name implies) only
discusses the binary interface. Any source-level syntax belongs to the
respective language involved, including the assembly language.

> We also need R_RISCV_FUNCDESC (canonical function descriptor),
> R_RISCV_FUNCDESC_VALUE (copy of function descriptor),
> R_RISCV_GPREL_GOTFUNCDESC_(HI20, LO12_I) (offset within GOT of a pointer-sized
> slot which will receive a pointer to the canonical function descriptor),
> R_RISCV_GPREL_FUNCDESC_(HI20, LO12).

Given the analysis of the problem so far it does not appear to me that
these relocations are strictly required, as you can infer the access type
(data vs code, the latter implying a function descriptor) of a GP-relative
reference from the referred symbol's type (STT_OBJECT vs STT_FUNCTION).
After all the static linker always has to create a function descriptor
whenever an address of a function is taken or a call made to preemptible
symbol (which goes through the PLT), so it is not the access (relocation)
type that determines it.

Also I find it cleaner when the compiler has to know less about linkage
peculiarities, but that might be seen as a matter of style.

That noted I guess it would not be a big deal if we had such separate
relocations, although the redundancy introduced this way would imply
consistency checks and the rejection by the static linker of invalid
relocation vs symbol combinations. Or was it that consistency check that
the motivation has been for the design you refer to?

Also if we were to adopt these separate relocations, which obviously
multiply relocation kinds that follow a similar pattern, based on the
observations made in the discussion elsewhere in this thread I would be
leaning towards using composed relocations rather than individual
relocation types, disentangling the relocation calculation from the layout
of the field to relocate.

This way we'd only have one extra R_RISCV_FUNCDESC relocation type for
function descriptor references rather than five or six individual ones.
That single relocation could be composed by an implementation as required
to represent the link-time operation (expression) requested without the
need to expand the psABI whenever a new combined expression is required,
and the model would overall be cleaner in my opinion.

Same with the R_RISCV_GPREL relocations I already proposed (we may have
to figure out the namespace to use to avoid a semantics clash with the
regular RISC-V psABI as defined already); I'll look into it.

> R_RISCV_FUNCDESC and R_RISCV_FUNCDESC_VALUE are dynamic relocations.

The former relocation would presumably be used instead of R_RISCV_64 or
R_RISCV_32 for preemptible function references from static data?
Likewise the dynamic loader could resolve that based on the referred
symbol's type, so the same observation as I made above applies.

I'm not sure what the use scenario for the latter relocation would be,
please elaborate.

> > Occasionally a GOT entry will be created for local data to satisfy the
> > use of R_RISCV_GPREL_GOT_HI20 and R_RISCV_GPREL_GOT_LO12_I relocations in
> > code referring to such data. The R_RISCV_REL_DATA dynamic relocation is
> > defined to support GP-relative relocation of such GOT entries at program
> > load time.
>
> Why do you need REL_DATA when ARM, FRV, and SH don't?

What relocation do you use for local GOT entries referring to data rather
than text? Do you always produce a function descriptor for function calls
made to a local symbol? That would be a waste of memory and cycles for
quite a common scenario: shared libraries often use restrictive ELF export
classes or a linker script to avoid exporting symbols meant not to be a
part of the API; also symbols in the main executable are typically not
exported to shared libraries. All these symbols can be called with a
direct PC-relative reference (no PLT involved) as with the regular RISC-V
psABI.

> > 4.3 Procedure Calls (normative)
> >
> > Local procedure calls use the same code sequence as with ordinary PIC
> > code. PC-relative addressing can be used as all code locations are fixed
> > with respect to each other and the address is not interpreted beyond
> > making the jump itself. GP does not change in the process of making a
> > local procedure call as control remains in the same module.
>
> Should clarify that while GP does not change as part of the call instruction
> itself, the called procedure is allowed to clobber GP (this is necessary for
> external tail calls).

That is an interesting point, thanks. I don't have numbers available to
hand, but intuitively the saving from allowing tail calls to be made will
be higher than from relaxing GP restoration (and possibly also arranging a
save slot for) away.

Therefore I have, provisionally, updated the specification, however I
think it will have to be evaluated in implementation before it has been
finally decided.

> > A data structure called Function Descriptor Table (FDT) is created by the
> > static linker to hold PC/GP pairs used in external procedure calls.
> > Addresses of individual FDT entries serve as pointers to the respective
> > procedures. An FDT entry is therefore created for each function symbol
> > that is external, whether defined or not, or whose address is taken for
> > a purpose other than making a call.
>
> Canonical function descriptors are created by the *dynamic* linker, not ld,
> and they exist outside of any load segment (except possibly when static
> linking). Every function which is referred to gets a single canonical
> function descriptor. Other FDPIC ABIs don't use the "FDT" term and I
> think it detracts from clarity to use it here.

There's no need I believe to use an assertion in a discussion about
something that hasn't been finalised yet.

Your proposal to build what you call canonical function descriptors on
demand in the dynamic loader rather than precreating then in the static
linker sounds interesting to me, as it seems to solve some issues in my
design, although at the price of some heap consumption and processing
complication in the dynamic loader.

It's not clear to me where the term "canonical" comes from though, as
those will only be occasionally created, as most functions do not have
their address taken for purposes other than making a call; and to qualify
for dynamic creation of a function descriptor they need to be external
too.

Note however that building function descriptors in the dynamic loader has
an issue with protected function symbols, which need to resolve locally
within the defining module even in the presence of an earlier external
definition, and yet satisfy pointer equality requirements. There may be
multiple protected function symbols of the same name involved in a given
dynamic load, plus optionally one non-protected external symbol of that
name. This has to be handled correctly.

> R_arch_FUNCDESC_VALUE can create a copy of a function descriptor at any
> two-word aligned address in the load segment, but there is no "descriptor
> table" as a cohesive entity.

Surely one is needed to handle PLT calls effectively, like the PLTGOT
is used with the regular ABI.

I think it makes sense to put function descriptors of non-preemptible
functions whose address is taken for a purpose other than making a call
here as well; those will typically have no dynamic symbol associated at
all (except for protected symbols), and therefore there is no way even to
have them arranged by the dynamic loader (to say nothing of any point).

> > As the ultimate values of the PC and the GP are only determined at load
> > time the static linker attaches dynamic relocations to data in the FDT.
> > For external function symbols the R_RISCV_JUMP_SLOT and R_RISCV_GP
> > relocations are used for the PC and GP respectively, both referring to
> > the function symbol. For local function symbols whose address is taken
> > the R_RISCV_REL_TEXT and R_RISCV_GP relocations are used with no symbol
> > referred.
>
> Every other FDPIC ABI uses a R_ARCH_FUNCDESC_VALUE relocation to fill in both
> words of a function descriptor copy at once.

Well, it makes it more difficult for the dynamic loader to tell entries
apart that correspond to functions whose address has been taken for a
purpose other than making a call and those that can be lazily bound.
Consequently, depending on the order of dynamic relocations in the
relocation table, it may happen that the lazy resolver is called for calls
to function symbols that have already been eagerly resolved. It also
actually precludes the static linker from arranging some references to
never be lazily bound if required for whatever reason, as there is no
relocation defined to express that requirement.

Otherwise that seems largely a matter of style to me: relocations with
the STN_UNDEF symbol index correspond to R_RISCV_REL_TEXT and the
remaining ones correspond to R_RISCV_JUMP_SLOT, with the GP relocation of
the following address word implied.

> > Figure 4.2 Function Description Table
> >
> > FDT Outstanding dynamic relocations
> > __riscv_fdt_func1 ---> +------------------+
> > | Text Pointer 1 | R_RISCV_JUMP_SLOT func1
> > +------------------+
> > | Global Pointer 1 | R_RISCV_GP func1
> > __riscv_fdt_func2 ---> +==================+
> > | Text Pointer 2 | R_RISCV_JUMP_SLOT func2
> > +------------------+
> > | Global Pointer 2 | R_RISCV_GP func2
> > __riscv_fdt_func3 ---> +==================+
> > | Text Pointer 3 | R_RISCV_REL_TEXT
> > +------------------+
> > | Global Pointer 3 | R_RISCV_GP
> > +==================+
> > | . . . |
>
> again, this is gratuitously different from what every other arch does.
>
> Other arches use 1 relocation per function descriptor copy, and they don't
> create duplicate symbols.

See above for the discussion on using individual relocations. You have
not raised any concern about the increase of memory consumption caused by
using individual relocations for addresses held in the function descriptor
table, but if that was your intent, then I agree that it would be a valid
concern, and it could be addressed by defining R_RISCV_FUNCDESC_JUMP_SLOT,
R_RISCV_FUNCDESC_GLOBAL and R_RISCV_FUNCDESC_RELATIVE relocations instead.

Also we need to provide symbols for function descriptors created for
protected symbols so that other modules in a dynamic load can refer to
them when taking such a function's address for a purpose other than making
a call. I agree that in your proposed model where function descriptors
for external symbols that are not protected whose address is taken for a
purpose other than making a call are made by the dynamic loader the extra
symbols can go.

Being different from solutions chosen for other architectures does not
automatically make a solution wrong, so this is a weak argument.

> > A Procedure Linkage Table (PLT) is created to handle calls via the FDT,
> > so that the same code sequence is used in the program proper to make
> > direct procedure calls regardless of whether the function symbol called
> > is local or external. Since the PLT is local to the module its entries
> > can be reached with PC-relative addressing. Individual PLT entries are
> > created and called into for each external procedure called.
> >
> > For direct calls an FDT entry is used that corresponds to the procedure
> > called and has been created in the module making the call. Therefore
> > code in the PLT can access the FDT entry directly as local data, using
> > GP-relative addressing.
>
> Again, "FDT" is misleading about how function descriptors are created.

It just matches reality. It's not that function descriptors are going to
be randomly scattered across the data segment, it's natural for the static
linker to group them into a table like GOT entries.

> > For indirect calls the PLT is also used and an FDT entry is used that
> > corresponds to the procedure called and has been created in the module
> > providing the function symbol of the procedure.
>
> This seems a bad idea and gratuitously different from every other FDPIC ABI.
> Other FDPIC ABIs use code at the call site for indirect calls. If you are
> doing this for code size reasons, a compiler generated function in a
> .gnu.linkonce section is a much better idea because it does not create an ABI
> constraint.

Since we need code to load the GP/PC pair in the PLT anyway I found it
attractive to reuse it. Do you have any counter-arguments beside that
nobody else has decided to do so? It seems a weak argument to me, and
there's nothing in the psABI document that forbids a code generator to
expand the sequence inline if speed is preferred to space, which is what I
had in mind when developing this part; I can clarify that in the document.

> > If a function symbol is external, then an external dynamic data symbol is
> > created that refers to that FDT entry and whose name is constructed by
> > prepending `__riscv_fdt_' to the function's symbol name.
>
> This is gratuitously different from other FDPIC ABIs, which use *FUNCDESC*
> relocations to generate function descriptors.

As I noted above function descriptors for protected symbols whose address
is taken by another module for a purpose other than making a call cannot
be constructed like you propose or pointer equality would not be
guaranteed.

> It is also very inefficient since it doubles the number of symbols and symbol
> names in a library.

A shared library normally exports a limited number of symbols as its API,
but you are right this is is inefficient if an alternative exists. I
think we still need to do this for protected symbols, so I will update the
document accordingly.

> > If the address of an external function symbol is taken, then a GOT entry
> > is created for the corresponding `__riscv_fdt_' dynamic data symbol and
> > used to satisfy the reference.
>
> The compiler should generate an @GOTFUNCDESC reference and the linker should
> generate a R_RISCV_FUNCDESC relocation, not create a new symbol.

As I noted above, it's a matter of the convention whether we want to have
distinct relocation types or examine the referred symbol's type. Overall
I find it cleaner when the compiler know less about linkage peculiarities.

> > When making an indirect call a dedicated PLT entry is used that is common
> > to all indirect calls and upon invocation of that PLT entry the x5 (t0)
> > register holds the address of the FDT entry in the module providing the
> > function symbol of the procedure to call.
>
> No other FDPIC ABI does this.

Ack.

> > 4.4 Typical Code Sequences (informative)
> >
> > In the sequences below expressions on the right-hand side of relocation
> > names denote the symbol and the addend specified with the relocation. In
> > the absence of a `+' operator only a symbol is specified, otherwise the
> > left-hand side of the addition is a symbol and the right-hand side is an
> > addend. If a symbol is specified as `*ABS*', then the value is 0 (the
> > symbol index is STN_UNDEF in the relocation). The value of ABS() is the
> > absolute (static-link-time) value of the expression in the parentheses.
> >
> > 4.4.1 Local Data Addressing
> >
> > Ordinary PIC code, using PC-relative addressing:
> >
> > # Outstanding static relocations
> > label:
> > auipc t0, %pcrel_hi(var+addend) # R_RISCV_PCREL_HI20 var+addend
> > lbu t1, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
> > sb t2, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_S label
> >
> > Corresponding FDPIC code, using GP-relative addressing:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
> > c.add t0, gp
> > lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
> > sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend
>
> This is good, subject to Jim's point about add vs c.add

This part is informative and there's technically nothing wrong with the
sequence quoted as it will produce the correct result at run time, however
to avoid potential confusion I have already edited this code according to
Jim's suggestion.

As discussed elsewhere this sequence does not work for read-only data
sections merged with the text segment. Offhand I think this can only be
solved with linker relaxation as it may not be known up until the static
link time that a symbol referred will be placed there; fortunately the
presence of AUIPC guarantees that the corresponding code sequence will be
shorter, so even trivial processing in the static linker with NOP padding
will do.

Alternatively a simple static linker implementation can choose to merge
read-only data sections with the data segment; there's no MMU anyway to
enforce write protection for the corresponding memory area in systems
typically targetted by FDPIC code, although run-time memory consumption
will rise once the data segment is copied for multiple processes.

> > 4.4.2 External Data Addressing
> >
> > Ordinary PIC code, using GOT and PC-relative addressing:
> >
> > # Outstanding static relocations
> > label:
> > auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
> > l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
> > lb t1, addend(t0)
> > sb t2, addend(t0)
>
> > # Outstanding dynamic relocations for the GOT entry
> > # R_RISCV_32,64 var
>
> So far so good
>
> > # or if the data symbol turns out local at static link time
> > # R_RISCV_REL_DATA *ABS*+ABS(var)
>
> I don't think this actually works, for one thing var might be in rodata, there
> could also be multiple data segments. I don't see anything like REL_DATA in
> other FDPIC ABIs, I think it always has to be R_RISCV_{32,64}, or whatever the
> other arches do.

There is no issue with read-only data merged with the text segment here
(which is what I gather you refer to) as we can still use a GOT entry for
a local access to data that is not addressable in a GP-relative manner.
I guess you meant to refer to code I proposed above in section 4.4.1.

We cannot handle a run-time scenario where a single module has pieces of
its text or data segment scattered across multiple memory areas located at
arbitrary positions with respect to each other, because we have only one
PC and one GP. Of course a single text or data segment can each be
represented by multiple ELF file segments, which can map to memory in a
discontiguous manner. I see no practical reason to do so and holes in the
resulting memory allocations may make it difficult to use available memory
effectively, but yes, technically it is doable and is going to be
supported with the model I propose.

Finally you raise an interesting point with respect to the nomenclature
of relocations that I haven't considered before. Technically there is no
need for any architecture to have dedicated R_*_RELATIVE relocations with
their regular psABI, as the corresponding R_*_{32,64} relocations provide
the same semantics where the index of the symbol referred is STN_UNDEF.
Therefore I have no idea why separate R_*_RELATIVE relocations have been
invented with the same semantics (or for that matter why there are
separate R_*_32 and R_*_64 relocations where the relocation used has to
match the ELF file's address width, but no separate R_*_RELATIVE_32 and
R_*_RELATIVE_64 relocations).

So technically you are right we can use R_RISCV_RELATIVE for PC-relative
dynamic relocations and R_RISCV_{32,64} relocations for GP-relative
dynamic relocations. Or vice versa. Either way we need to be explicit
which one is which though, to avoid unnecessary confusion for humans, and
also I dislike the asymmetry where we have one relocation for one purpose
regardless of the ELF file's address width and a pair of relocations,
chosen individually according to the ELF file's address width, for the
other. I'd prefer to have a new single code for the other case.

> > Corresponding FDPIC code, using GOT and GP-relative addressing:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> > c.add t0, gp
> > l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> > lbu t1, addend(t0)
> > sb t2, addend(t0)
> >
> > # Outstanding dynamic relocations for the GOT entry
> > # R_RISCV_32,64 var
> >
> > # or if the function turns out local at static link time
> > # R_RISCV_REL_DATA *ABS*+ABS(var)
>
> Code looks good, same concern about REL_DATA.

Discussed above.

> > 4.4.3 Taking a Function's Address
> >
> > FDPIC code, local function:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
> > c.add t0, gp
> > addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun
> >
> > FDPIC code, external function:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
> > c.add t0, gp
> > addi t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun
>
> These are, unfortunately, not compatible with dynamic linking semantics. A
> function needs to have the same address regardless of which module its address
> is taken in, so you have to always get the canonical function descriptor, which
> has to come from the GOT because canonical function descriptors are created by
> the dynamic linker.

Yes, this was a silly editorial mistake, sorry about that. The sequence
for an external reference has to read the relevant GOT entry rather than
taking its address. The local sequence is of course fine, the offset from
GP is constant at link time. These sequences were meant to be:

FDPIC code, local function:

# Outstanding static relocations
lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
c.add t0, gp
addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun

FDPIC code, external function:

# Outstanding static relocations
lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
c.add t0, gp
l[w|d] t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun

-- as presented at LCA (with C.ADD then updated to ADD as per Jim's
suggestion). Thank you for your meticulousness.

> This should be something like (same for both local and
> external):
>
> lui t0, %gprel_got_hi(fun@FUNCDESC) #
> R_RISCV_GPREL_GOTFUNCDESC_HI20 fun
> add t0, t0, gp
> l[w|d] t0, %gprel_got_lo(fun@FUNCDESC)(t0) #
> R_RISCV_GPREL_GOTFUNCDESC_LO12 fun
>
> eventually resulting in dynamic relocations for the GOT entry:
>
> R_RISCV_FUNCDESC fun

Discussed above already.

> > FDPIC code, indirect call (to a2):
> >
> > # Outstanding static relocations
> > c.mv t0, a2
> > label:
> > auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
> > jalr ra, ra, %pcrel_call_lo(label)
> > l[w|d] gp, <gp_slot>(sp)
> >
> > # The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
> > # the PLT entry associated with indirect calls.
>
> As above I don't think it makes sense to handle this as a PLT entry. The call
> should be generated inline:
>
> lw t1, 0(a2)
> lw gp, 4(a2)
> jalr ra, t1
> lw gp, <gp_slot>(sp)

Discussed above already.

> > Chapter 5 Program Loading
> >
> > 5.1 Base Addresses (normative)
> >
> > A single individual base address is defined by the ELF gABI for a module
> > being loaded that determines the amount to relocate the module by. This
> > is unsuitable for FDPIC modules, which need to have their text segments
> > and data segments mapped in memory separately. This is so that where a
> > module is mapped multiple times in a no-MMU system, only a single copy of
> > its text segments is present in memory and serves all the mappings, while
> > a separate copy of its data segments is present in memory for each of the
> > mappings. Consequently the distance between text and data segments is no
> > longer constant between mappings and there is no single base address.
> >
> > Instead a separate text base address and a data base address is defined
> > as a difference between the load address and the link address of the text
> > segment and the data segment respectively. These two base addresses are
> > used by the dynamic loader to relocate text and data respectively.
>
> FDPIC does not have a "data base address"; there are one or more load segments,
> relocated independently using a load map.

Regrettably we cannot support multiple global pointers to address each of
the load segments independently, so even if the presence of multiple ELF
segments makes the data segment (or indeed the text segment) discontiguous
the relative position of the individual pieces of the data (text) segment
with respect to one another has to remain constant, and therefore together
they all form a single sparse logical segment.

> > In the initial module, such as a program interpreter, loaded by an OS or
> > other executive runtime the text base address of said initial module can
> > be determined by calculating a run-time difference between the actual
> > value of the PC for a given location, such as the beginning of the text
> > segment, obtained with a PC-relative reference to a symbol associated
> > with that location and the value of a corresponding absolute symbol
> > associated with the same location. The way to determine the data base
> > address and therefore the value of GP of the initial module is specific
> > to the individual OS or other executive runtime and therefore beyond the
> > scope of this specification. Possibilities include passing suitable
>
> Every other FDPIC ABI has a normative Start up section that specifies how
> Linux will pass a elf32_fdpic_loadmap struct; it's in scope here.

I think OS-specific startup is beyond the scope of this specification as
it is not OS-specific. For example FreeBSD or some bare-metal RTOS may do
this differently, e.g. use the auxiliary vector, preset the GP to the load
address of the lowest-mapped writable ELF file segment, define a syscall,
or whatever. The OS-specific runtime is meant to set up the GP somehow to
match this specification's requirements.

> Note that the Linux FDPIC support currently has 32-bit assumptions and
> 64-bit FDPIC will need to be documented here, much as the FRV ABI
> supplement defined 32-bit FDPIC ptrace calls.

I advise discussing such details with each interested OS's developers at
the relevant forum. Actually the 64-bit Linux part has already been done
by Damien (cc-ed).

I suppose we could accept submissions from OS developers documenting
their interfaces as informational appendices, so that there is a single
reference point.

> > information via the initial stack, such as in the auxiliary vector,
> > preinitializing a processor register, providing a system call to retrieve
> > it, etc.
> >
> > The presence of a separate text base address and a data base address also
> > means that ET_EXEC images cannot be supported with the FDPIC psABI as it
> > is not possible to make multiple copies of such image's data segment in a
> > no-MMU system without the ability to relocate it at load time.
> >
> >
> > 5.2 Lazy Binding (normative)
> >
> > Lazy binding can be optionally implemented by the dynamic loader. If it
> > is implemented, then the run-time relocation of R_RISCV_JUMP_SLOT and
> > their associated R_RISCV_GP relocations present in the FDT is done in two
> > stages.
>
> these should be a single relocation for consistency with other FDPIC ABIs.
>
> Properly supporting lazy binding on FDPIC is very difficult for multithreaded
> programs because it is impossible (on baseline RV*IA) to atomically update both
> words that compose a function descriptor copy. Lazy binding is disabled on
> modern distros as a hardening measure and not supported by musl as a matter of
> policy, so it is likely not worth trying to make it work.

Good point about atomicity.

> If you were to attempt to do this, it would be necessary to specify the order
> of loads in PLT entries (always load the entry point first and the GOT second);
> updates would write the correct GOT, issue a membarrier() syscall (a no-op on
> uniprocessor or sequentially consistent systems, required for ordering
> otherwise), and then write the new entry point.

Does the RISC-V ISA support weak memory ordering? I thought it did not,
having learnt from all the software engineering challenges it caused with
DEC Alpha systems (and to some extent MIPS systems). Some 25 years on and
some Linux kernel bugs still haven't been sorted in this area, and people
keep appearing who cannot even understand there is a problem there.

Anyway, that does not seem to be a big deal to me, and then it is an
implementation detail.

Also why do we need such a heavyweight mechanism as a syscall for an
ordering barrier? Borrowing your argument: all the other ISAs that
support weak memory ordering have an unprivileged hardware instruction for
synchronisation: Alpha has MB, MIPS has SYNC and even Intel x86 (which had
some weak bus ordering properties in its Pentium Pro implementation; not
sure if any were carried to any later microarchitectures) has CPUID.

> This guarantees that the entry point can only be reached with the corresponding
> GOT, however, it allows the lazy resolver to be called with _either_ the
> initial GOT value for the lazy descriptor, _or_ the final symbol's GOT. As
> such, the lazy resolver cannot depend(!) on the GOT register it receives.

Indeed, but as you note dynamic loader's GP can be stored at a place
uniformly reachable from any valid GP value corresponding to one of the
modules loaded, e.g. in the link map. As such it does not have to be
standardised at the psABI level and can be left to the implementation.

> > In the first stage, which is done by the dynamic loader at the time a
> > module is loaded, R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are
> > resolved respectively to the address of the lazy resolver and the value
> > of the global pointer associated with the module providing the lazy
> > resolver.
>
> > In the second stage, which is done when the lazy resolver is reached by
> > means of making a call through an FDT entry referring to it,
> > R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are resolved respectively
> > to the address of the function symbol associated with the FDT entry and
> > the value of the global pointer associated with the module providing the
> > function symbol. To be able to do its work the lazy resolver is called
> > with certain registers containing values as follows:
> >
> > * x3 (gp) holds the dynamic loader's GP value as with an ordinary FDT
> > entry (this is a consequence of the first stage of run-time relocation)
>
> The dynamic loader needs to be able to tolerate _any_ valid gp value. This
> could be achieved by reserving a few words near gp and having the dynamic
> loader store a pointer to its own state at a known offset from every GOT.

Discussed above.

> > * x5 (t0) holds a pointer to the FDT entry to relocate
> >
> > * x6 (t1) holds the caller's GP value
>
> I don't think this is actually needed - the SH and ARM FDPIC ABIs
> unconditionally clobber the caller's GP. Given a pointer to a function
> descriptor copy (which is within one of the caller's data segments) the dynamic
> linker can easily find the caller by walking a list of loaded libraries.

It simplifies lazy resolver's processing at the cost of one instruction,
which is however executed every time a call via the PLT is made, even once
the symbol has been resolved. Perhaps it's not worth it.

> > Registers have been assigned such as to work with the RV32E instruction
> > set as well.
> >
> > Upon completion of the second stage the lazy resolver makes a jump to the
> > newly resolved address of the function symbol.
>
> > 5.3 Example PLT Code (informative)
> >
> > @PLT:
> > l[w|d] t2, 0(t0)
> > mv t1, gp
>
> We don't need to save t1 here; we could save 2 bytes per PLT entry by moving
> the adds into this function.

Except it would surely break cache line alignment for every other entry
causing an execution penalty. Anyway, the structure of PLT is private to
the containing module and can therefore be left to the implementation.
This is an example for illustration only.

Again, thank you for your input. If you have any further comments or
questions, then I'll be happy to address them. Otherwise I will factor in
what has been observed here.

Maciej

Drew Fustini

unread,
Jun 28, 2020, 11:46:43 AM6/28/20
to Maciej W. Rozycki, Stef O'Rear, RISC-V SW Dev, Damien Le Moal, i...@maskray.me, dal...@aerifal.cx
Hello, I'm trying determine the current status of FDPIC / NOMMU solution.

I would like mention it when talking about Linux on RISC-V during the
Embedded Linux Conference tomorrow. [0] My understanding this is
needed for any practical usage of Linux userspace on the limited SRAM
in the Kendryte K210.

Thanks,
Drew

[0] https://ossna2020.sched.com/event/c3Yq

Maciej W. Rozycki

unread,
Jun 28, 2020, 4:47:17 PM6/28/20
to Drew Fustini, Stef O'Rear, RISC-V SW Dev, Damien Le Moal, i...@maskray.me, dal...@aerifal.cx
Hi Drew,

> Hello, I'm trying determine the current status of FDPIC / NOMMU solution.
>
> I would like mention it when talking about Linux on RISC-V during the
> Embedded Linux Conference tomorrow. [0] My understanding this is
> needed for any practical usage of Linux userspace on the limited SRAM
> in the Kendryte K210.

This is still early stage owing to distractions and overall COVID-19
impact. I'm working on it though and aim for externally visible progress
in several weeks' time. Sorry if this disappoints you, and thanks for
your interest.

Maciej

Khem Raj

unread,
Jun 30, 2020, 5:30:45 PM6/30/20
to Maciej W. Rozycki, Drew Fustini, Stef O'Rear, RISC-V SW Dev, Damien Le Moal, i...@maskray.me, dal...@aerifal.cx