Stef --
> > This design has been originally presented at LCA 2020 and a recording is
> > available here: <
https://www.youtube.com/watch?v=GydyykyNjxs>.
> >
> > I will appreciate your questions, comments and any other kind of
> > feedback.
>
> I've discussed this proposal with Rich Felker and Fangrui Song (CCed)
> in #musl; the
> following comments are exclusively mine.
Thank you for your input and your involvement with this effort.
> The register usage is exactly what I had in mind, and most of the code
> sequences seem approximately fine (several are not), but the relocation
> structure is extremely different from the other existing FDPIC ABIs[1][2][3],
> in a way which will make it difficult to support in generic code such as musl;
> I believe the ABI should be made as consistent as possible to avoid surprises
> like what we went through with TLS copy relocs.
I have deliberately avoided going through any other architecture's psABI
under the observation that while I can do it at any time after the initial
design proposal doing it right at the beginning would put me at the risk
of becoming negatively primed with respect to ways to solve the problem.
If interested, please watch: <
https://www.youtube.com/watch?v=Yv4tI6939q0>
to see why negative priming can make one make bad decisions.
Of course as everyone I might make a bad design decision from time to
time as well, and the purpose of a peer review is to catch those early so
as to avoid any damage they might create otherwise, which could be
difficult to repair. This is one reason of my posting of this proposal.
The ultimate goal is to design the psABI the best way possible given the
properties of the architecture and in particular taking any competitive
advantage it may have over other architectures. Therefore any choices
made for other architectures ought not to influence it unless they are
beneficial or at least neutral.
> [1]:
http://ftp.redhat.com/pub/redhat/gnupro/FRV/FDPIC-ABI.txt
> [2]:
https://j-core.org/downloads/fdpic-sh.txt
> [3]:
https://github.com/mickael-guene/fdpic_doc/blob/master/abi.txt
Thank you for the references.
> > Table 4.1 Relocation Operands
> >
> > Operand | Description
> > =========+================================================================
> > A | Relocation addend.
> > ---------+----------------------------------------------------------------
> > DBA | Data segment's base address; 0 in static link.
> > ---------+----------------------------------------------------------------
> > G | The offset from GP of a GOT entry for the symbol referred by
> > | the relocation.
> > ---------+----------------------------------------------------------------
> > GP | The value of GP associated with the symbol referred, nominally
> > | (DVMA + DBA + 2048).
> > ---------+----------------------------------------------------------------
> > P | The place (offset or address) of the storage unit affected by
> > | the relocation.
> > ---------+----------------------------------------------------------------
> > PLTE | The address of a PLT entry associated with the symbol referred.
> > ---------+----------------------------------------------------------------
> > PLTI | The address of a PLT entry designated to make indirect calls.
> > ---------+----------------------------------------------------------------
> > S | The value of the symbol referred by the relocation.
> > ---------+----------------------------------------------------------------
> > TBA | Text segment's base address; 0 in static link.
> >
> > Table 4.2 Relocation Types
> >
> > Name | Value | Field | Symbol | Calculation
> > ==========================+=======+=============+===========+=============
> > R_RISCV_RELATIVE | 3 | T-word32,64 | n/a | TBA + A
> > R_RISCV_REL_TEXT (alias) | | | |
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GP | 12 | T-word32,64 | any | GP
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_REL_DATA | 13 | T-word32,64 | n/a | DBA + A
>
> None of the SH, FRV, or ARM FDPIC ABIs define anything equivalent to REL_DATA
> or GP. Why is it there?
The REL_DATA relocation is needed for static data references to local
symbols. Those symbols are necessarily not present in the dynamic symbol
table, yet the data references have to be relocated by the data segment's
base address at load time, because the final load-time address of the
respective symbols is not known at the static link time.
Separate REL_DATA and REL_TEXT relocations are required rather than a
single RELATIVE relative relocation, because unlike with the regular ABI,
which only has a single base address defined, we have a separate data
segment base address and text segment base address for every program.
The GP relocation resolves as per its definition, to the value of the
global pointer associated with the function called. It is required
because the callee has no way to determine the value of the GP from the PC
(if required) anymore, because there is no fixed offset between the two
like in the regular ABI.
> "Data segment base address" does not seem to be defined anywhere?
Now corrected. As per the ELF gABI the base address is the difference
between the load address and the corresponding virtual memory address
(`p_vaddr') of the segment loaded lowest in memory. Since in the FDPIC
ABI we necessarily treat text and data segments as separate areas in
memory they both have a corresponding text segment base address and a data
segment base address each.
I feel it is sort of obvious to anyone familiar with the ELF gABI, but
you are right in that in a formal document even seemingly obvious terms
are best explicitly defined for the avoidance of doubt.
> > ==========================+=======+=============+===========+=============
> > | | | local | S - P
> > R_RISCV_CALL_PLT | 19 | V-hi20lo12i | external | PLTE - P
> > | | | n/a | PLTI - P
>
> None of the SH, FRV, or ARM ABIs use anything like PLTI.
Ack.
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_HI20 | 59 | V-hi20 | local | S - GP + A
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_I | 60 | T-lo12i | local | S - GP + A
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_LO12_S | 61 | T-lo12s | local | S - GP + A
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_GOT_HI20 | 62 | V-hi20 | any | G
> > --------------------------+-------+-------------+-----------+-------------
> > R_RISCV_GPREL_GOT_LO12_I | 63 | T-lo12i | any | G
>
> The GPREL and GPREL_GOT relocations look correct. We also need assembler
> syntax for them, and to decide whether they are %functions or @MODIFIERS.
Given the established practice with RISC-V assembly syntax and also the
discussion elsewhere in this thread about composed relocations I think
using percent-ops is the way forward as they give more flexibility (in
particular you can use parentheses around expressions to indicate the
addend to include with the relocation involved).
But none of this is a part of the psABI, which (as the name implies) only
discusses the binary interface. Any source-level syntax belongs to the
respective language involved, including the assembly language.
> We also need R_RISCV_FUNCDESC (canonical function descriptor),
> R_RISCV_FUNCDESC_VALUE (copy of function descriptor),
> R_RISCV_GPREL_GOTFUNCDESC_(HI20, LO12_I) (offset within GOT of a pointer-sized
> slot which will receive a pointer to the canonical function descriptor),
> R_RISCV_GPREL_FUNCDESC_(HI20, LO12).
Given the analysis of the problem so far it does not appear to me that
these relocations are strictly required, as you can infer the access type
(data vs code, the latter implying a function descriptor) of a GP-relative
reference from the referred symbol's type (STT_OBJECT vs STT_FUNCTION).
After all the static linker always has to create a function descriptor
whenever an address of a function is taken or a call made to preemptible
symbol (which goes through the PLT), so it is not the access (relocation)
type that determines it.
Also I find it cleaner when the compiler has to know less about linkage
peculiarities, but that might be seen as a matter of style.
That noted I guess it would not be a big deal if we had such separate
relocations, although the redundancy introduced this way would imply
consistency checks and the rejection by the static linker of invalid
relocation vs symbol combinations. Or was it that consistency check that
the motivation has been for the design you refer to?
Also if we were to adopt these separate relocations, which obviously
multiply relocation kinds that follow a similar pattern, based on the
observations made in the discussion elsewhere in this thread I would be
leaning towards using composed relocations rather than individual
relocation types, disentangling the relocation calculation from the layout
of the field to relocate.
This way we'd only have one extra R_RISCV_FUNCDESC relocation type for
function descriptor references rather than five or six individual ones.
That single relocation could be composed by an implementation as required
to represent the link-time operation (expression) requested without the
need to expand the psABI whenever a new combined expression is required,
and the model would overall be cleaner in my opinion.
Same with the R_RISCV_GPREL relocations I already proposed (we may have
to figure out the namespace to use to avoid a semantics clash with the
regular RISC-V psABI as defined already); I'll look into it.
> R_RISCV_FUNCDESC and R_RISCV_FUNCDESC_VALUE are dynamic relocations.
The former relocation would presumably be used instead of R_RISCV_64 or
R_RISCV_32 for preemptible function references from static data?
Likewise the dynamic loader could resolve that based on the referred
symbol's type, so the same observation as I made above applies.
I'm not sure what the use scenario for the latter relocation would be,
please elaborate.
> > Occasionally a GOT entry will be created for local data to satisfy the
> > use of R_RISCV_GPREL_GOT_HI20 and R_RISCV_GPREL_GOT_LO12_I relocations in
> > code referring to such data. The R_RISCV_REL_DATA dynamic relocation is
> > defined to support GP-relative relocation of such GOT entries at program
> > load time.
>
> Why do you need REL_DATA when ARM, FRV, and SH don't?
What relocation do you use for local GOT entries referring to data rather
than text? Do you always produce a function descriptor for function calls
made to a local symbol? That would be a waste of memory and cycles for
quite a common scenario: shared libraries often use restrictive ELF export
classes or a linker script to avoid exporting symbols meant not to be a
part of the API; also symbols in the main executable are typically not
exported to shared libraries. All these symbols can be called with a
direct PC-relative reference (no PLT involved) as with the regular RISC-V
psABI.
> > 4.3 Procedure Calls (normative)
> >
> > Local procedure calls use the same code sequence as with ordinary PIC
> > code. PC-relative addressing can be used as all code locations are fixed
> > with respect to each other and the address is not interpreted beyond
> > making the jump itself. GP does not change in the process of making a
> > local procedure call as control remains in the same module.
>
> Should clarify that while GP does not change as part of the call instruction
> itself, the called procedure is allowed to clobber GP (this is necessary for
> external tail calls).
That is an interesting point, thanks. I don't have numbers available to
hand, but intuitively the saving from allowing tail calls to be made will
be higher than from relaxing GP restoration (and possibly also arranging a
save slot for) away.
Therefore I have, provisionally, updated the specification, however I
think it will have to be evaluated in implementation before it has been
finally decided.
> > A data structure called Function Descriptor Table (FDT) is created by the
> > static linker to hold PC/GP pairs used in external procedure calls.
> > Addresses of individual FDT entries serve as pointers to the respective
> > procedures. An FDT entry is therefore created for each function symbol
> > that is external, whether defined or not, or whose address is taken for
> > a purpose other than making a call.
>
> Canonical function descriptors are created by the *dynamic* linker, not ld,
> and they exist outside of any load segment (except possibly when static
> linking). Every function which is referred to gets a single canonical
> function descriptor. Other FDPIC ABIs don't use the "FDT" term and I
> think it detracts from clarity to use it here.
There's no need I believe to use an assertion in a discussion about
something that hasn't been finalised yet.
Your proposal to build what you call canonical function descriptors on
demand in the dynamic loader rather than precreating then in the static
linker sounds interesting to me, as it seems to solve some issues in my
design, although at the price of some heap consumption and processing
complication in the dynamic loader.
It's not clear to me where the term "canonical" comes from though, as
those will only be occasionally created, as most functions do not have
their address taken for purposes other than making a call; and to qualify
for dynamic creation of a function descriptor they need to be external
too.
Note however that building function descriptors in the dynamic loader has
an issue with protected function symbols, which need to resolve locally
within the defining module even in the presence of an earlier external
definition, and yet satisfy pointer equality requirements. There may be
multiple protected function symbols of the same name involved in a given
dynamic load, plus optionally one non-protected external symbol of that
name. This has to be handled correctly.
> R_arch_FUNCDESC_VALUE can create a copy of a function descriptor at any
> two-word aligned address in the load segment, but there is no "descriptor
> table" as a cohesive entity.
Surely one is needed to handle PLT calls effectively, like the PLTGOT
is used with the regular ABI.
I think it makes sense to put function descriptors of non-preemptible
functions whose address is taken for a purpose other than making a call
here as well; those will typically have no dynamic symbol associated at
all (except for protected symbols), and therefore there is no way even to
have them arranged by the dynamic loader (to say nothing of any point).
> > As the ultimate values of the PC and the GP are only determined at load
> > time the static linker attaches dynamic relocations to data in the FDT.
> > For external function symbols the R_RISCV_JUMP_SLOT and R_RISCV_GP
> > relocations are used for the PC and GP respectively, both referring to
> > the function symbol. For local function symbols whose address is taken
> > the R_RISCV_REL_TEXT and R_RISCV_GP relocations are used with no symbol
> > referred.
>
> Every other FDPIC ABI uses a R_ARCH_FUNCDESC_VALUE relocation to fill in both
> words of a function descriptor copy at once.
Well, it makes it more difficult for the dynamic loader to tell entries
apart that correspond to functions whose address has been taken for a
purpose other than making a call and those that can be lazily bound.
Consequently, depending on the order of dynamic relocations in the
relocation table, it may happen that the lazy resolver is called for calls
to function symbols that have already been eagerly resolved. It also
actually precludes the static linker from arranging some references to
never be lazily bound if required for whatever reason, as there is no
relocation defined to express that requirement.
Otherwise that seems largely a matter of style to me: relocations with
the STN_UNDEF symbol index correspond to R_RISCV_REL_TEXT and the
remaining ones correspond to R_RISCV_JUMP_SLOT, with the GP relocation of
the following address word implied.
> > Figure 4.2 Function Description Table
> >
> > FDT Outstanding dynamic relocations
> > __riscv_fdt_func1 ---> +------------------+
> > | Text Pointer 1 | R_RISCV_JUMP_SLOT func1
> > +------------------+
> > | Global Pointer 1 | R_RISCV_GP func1
> > __riscv_fdt_func2 ---> +==================+
> > | Text Pointer 2 | R_RISCV_JUMP_SLOT func2
> > +------------------+
> > | Global Pointer 2 | R_RISCV_GP func2
> > __riscv_fdt_func3 ---> +==================+
> > | Text Pointer 3 | R_RISCV_REL_TEXT
> > +------------------+
> > | Global Pointer 3 | R_RISCV_GP
> > +==================+
> > | . . . |
>
> again, this is gratuitously different from what every other arch does.
>
> Other arches use 1 relocation per function descriptor copy, and they don't
> create duplicate symbols.
See above for the discussion on using individual relocations. You have
not raised any concern about the increase of memory consumption caused by
using individual relocations for addresses held in the function descriptor
table, but if that was your intent, then I agree that it would be a valid
concern, and it could be addressed by defining R_RISCV_FUNCDESC_JUMP_SLOT,
R_RISCV_FUNCDESC_GLOBAL and R_RISCV_FUNCDESC_RELATIVE relocations instead.
Also we need to provide symbols for function descriptors created for
protected symbols so that other modules in a dynamic load can refer to
them when taking such a function's address for a purpose other than making
a call. I agree that in your proposed model where function descriptors
for external symbols that are not protected whose address is taken for a
purpose other than making a call are made by the dynamic loader the extra
symbols can go.
Being different from solutions chosen for other architectures does not
automatically make a solution wrong, so this is a weak argument.
> > A Procedure Linkage Table (PLT) is created to handle calls via the FDT,
> > so that the same code sequence is used in the program proper to make
> > direct procedure calls regardless of whether the function symbol called
> > is local or external. Since the PLT is local to the module its entries
> > can be reached with PC-relative addressing. Individual PLT entries are
> > created and called into for each external procedure called.
> >
> > For direct calls an FDT entry is used that corresponds to the procedure
> > called and has been created in the module making the call. Therefore
> > code in the PLT can access the FDT entry directly as local data, using
> > GP-relative addressing.
>
> Again, "FDT" is misleading about how function descriptors are created.
It just matches reality. It's not that function descriptors are going to
be randomly scattered across the data segment, it's natural for the static
linker to group them into a table like GOT entries.
> > For indirect calls the PLT is also used and an FDT entry is used that
> > corresponds to the procedure called and has been created in the module
> > providing the function symbol of the procedure.
>
> This seems a bad idea and gratuitously different from every other FDPIC ABI.
> Other FDPIC ABIs use code at the call site for indirect calls. If you are
> doing this for code size reasons, a compiler generated function in a
> .gnu.linkonce section is a much better idea because it does not create an ABI
> constraint.
Since we need code to load the GP/PC pair in the PLT anyway I found it
attractive to reuse it. Do you have any counter-arguments beside that
nobody else has decided to do so? It seems a weak argument to me, and
there's nothing in the psABI document that forbids a code generator to
expand the sequence inline if speed is preferred to space, which is what I
had in mind when developing this part; I can clarify that in the document.
> > If a function symbol is external, then an external dynamic data symbol is
> > created that refers to that FDT entry and whose name is constructed by
> > prepending `__riscv_fdt_' to the function's symbol name.
>
> This is gratuitously different from other FDPIC ABIs, which use *FUNCDESC*
> relocations to generate function descriptors.
As I noted above function descriptors for protected symbols whose address
is taken by another module for a purpose other than making a call cannot
be constructed like you propose or pointer equality would not be
guaranteed.
> It is also very inefficient since it doubles the number of symbols and symbol
> names in a library.
A shared library normally exports a limited number of symbols as its API,
but you are right this is is inefficient if an alternative exists. I
think we still need to do this for protected symbols, so I will update the
document accordingly.
> > If the address of an external function symbol is taken, then a GOT entry
> > is created for the corresponding `__riscv_fdt_' dynamic data symbol and
> > used to satisfy the reference.
>
> The compiler should generate an @GOTFUNCDESC reference and the linker should
> generate a R_RISCV_FUNCDESC relocation, not create a new symbol.
As I noted above, it's a matter of the convention whether we want to have
distinct relocation types or examine the referred symbol's type. Overall
I find it cleaner when the compiler know less about linkage peculiarities.
> > When making an indirect call a dedicated PLT entry is used that is common
> > to all indirect calls and upon invocation of that PLT entry the x5 (t0)
> > register holds the address of the FDT entry in the module providing the
> > function symbol of the procedure to call.
>
> No other FDPIC ABI does this.
Ack.
> > 4.4 Typical Code Sequences (informative)
> >
> > In the sequences below expressions on the right-hand side of relocation
> > names denote the symbol and the addend specified with the relocation. In
> > the absence of a `+' operator only a symbol is specified, otherwise the
> > left-hand side of the addition is a symbol and the right-hand side is an
> > addend. If a symbol is specified as `*ABS*', then the value is 0 (the
> > symbol index is STN_UNDEF in the relocation). The value of ABS() is the
> > absolute (static-link-time) value of the expression in the parentheses.
> >
> > 4.4.1 Local Data Addressing
> >
> > Ordinary PIC code, using PC-relative addressing:
> >
> > # Outstanding static relocations
> > label:
> > auipc t0, %pcrel_hi(var+addend) # R_RISCV_PCREL_HI20 var+addend
> > lbu t1, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
> > sb t2, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_S label
> >
> > Corresponding FDPIC code, using GP-relative addressing:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_hi(var+addend) # R_RISCV_GPREL_HI20 var+addend
> > c.add t0, gp
> > lbu t1, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_I var+addend
> > sb t2, %gprel_lo(var+addend)(t0) # R_RISCV_GPREL_LO12_S var+addend
>
> This is good, subject to Jim's point about add vs c.add
This part is informative and there's technically nothing wrong with the
sequence quoted as it will produce the correct result at run time, however
to avoid potential confusion I have already edited this code according to
Jim's suggestion.
As discussed elsewhere this sequence does not work for read-only data
sections merged with the text segment. Offhand I think this can only be
solved with linker relaxation as it may not be known up until the static
link time that a symbol referred will be placed there; fortunately the
presence of AUIPC guarantees that the corresponding code sequence will be
shorter, so even trivial processing in the static linker with NOP padding
will do.
Alternatively a simple static linker implementation can choose to merge
read-only data sections with the data segment; there's no MMU anyway to
enforce write protection for the corresponding memory area in systems
typically targetted by FDPIC code, although run-time memory consumption
will rise once the data segment is copied for multiple processes.
> > 4.4.2 External Data Addressing
> >
> > Ordinary PIC code, using GOT and PC-relative addressing:
> >
> > # Outstanding static relocations
> > label:
> > auipc t0, %pcrel_got_hi(var) # R_RISCV_GOT_HI20 var
> > l[w|d] t0, %pcrel_lo(label)(t0) # R_RISCV_PCREL_LO12_I label
> > lb t1, addend(t0)
> > sb t2, addend(t0)
>
> > # Outstanding dynamic relocations for the GOT entry
> > # R_RISCV_32,64 var
>
> So far so good
>
> > # or if the data symbol turns out local at static link time
> > # R_RISCV_REL_DATA *ABS*+ABS(var)
>
> I don't think this actually works, for one thing var might be in rodata, there
> could also be multiple data segments. I don't see anything like REL_DATA in
> other FDPIC ABIs, I think it always has to be R_RISCV_{32,64}, or whatever the
> other arches do.
There is no issue with read-only data merged with the text segment here
(which is what I gather you refer to) as we can still use a GOT entry for
a local access to data that is not addressable in a GP-relative manner.
I guess you meant to refer to code I proposed above in section 4.4.1.
We cannot handle a run-time scenario where a single module has pieces of
its text or data segment scattered across multiple memory areas located at
arbitrary positions with respect to each other, because we have only one
PC and one GP. Of course a single text or data segment can each be
represented by multiple ELF file segments, which can map to memory in a
discontiguous manner. I see no practical reason to do so and holes in the
resulting memory allocations may make it difficult to use available memory
effectively, but yes, technically it is doable and is going to be
supported with the model I propose.
Finally you raise an interesting point with respect to the nomenclature
of relocations that I haven't considered before. Technically there is no
need for any architecture to have dedicated R_*_RELATIVE relocations with
their regular psABI, as the corresponding R_*_{32,64} relocations provide
the same semantics where the index of the symbol referred is STN_UNDEF.
Therefore I have no idea why separate R_*_RELATIVE relocations have been
invented with the same semantics (or for that matter why there are
separate R_*_32 and R_*_64 relocations where the relocation used has to
match the ELF file's address width, but no separate R_*_RELATIVE_32 and
R_*_RELATIVE_64 relocations).
So technically you are right we can use R_RISCV_RELATIVE for PC-relative
dynamic relocations and R_RISCV_{32,64} relocations for GP-relative
dynamic relocations. Or vice versa. Either way we need to be explicit
which one is which though, to avoid unnecessary confusion for humans, and
also I dislike the asymmetry where we have one relocation for one purpose
regardless of the ELF file's address width and a pair of relocations,
chosen individually according to the ELF file's address width, for the
other. I'd prefer to have a new single code for the other case.
> > Corresponding FDPIC code, using GOT and GP-relative addressing:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_got_hi(var) # R_RISCV_GPREL_GOT_HI20 var
> > c.add t0, gp
> > l[w|d] t0, %gprel_got_lo(var)(t0) # R_RISCV_GPREL_GOT_LO12_I var
> > lbu t1, addend(t0)
> > sb t2, addend(t0)
> >
> > # Outstanding dynamic relocations for the GOT entry
> > # R_RISCV_32,64 var
> >
> > # or if the function turns out local at static link time
> > # R_RISCV_REL_DATA *ABS*+ABS(var)
>
> Code looks good, same concern about REL_DATA.
Discussed above.
> > 4.4.3 Taking a Function's Address
> >
> > FDPIC code, local function:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
> > c.add t0, gp
> > addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun
> >
> > FDPIC code, external function:
> >
> > # Outstanding static relocations
> > lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
> > c.add t0, gp
> > addi t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun
>
> These are, unfortunately, not compatible with dynamic linking semantics. A
> function needs to have the same address regardless of which module its address
> is taken in, so you have to always get the canonical function descriptor, which
> has to come from the GOT because canonical function descriptors are created by
> the dynamic linker.
Yes, this was a silly editorial mistake, sorry about that. The sequence
for an external reference has to read the relevant GOT entry rather than
taking its address. The local sequence is of course fine, the offset from
GP is constant at link time. These sequences were meant to be:
FDPIC code, local function:
# Outstanding static relocations
lui t0, %gprel_hi(fun) # R_RISCV_GPREL_HI20 fun
c.add t0, gp
addi t1, t0, %gprel_lo(fun) # R_RISCV_GPREL_LO12_I fun
FDPIC code, external function:
# Outstanding static relocations
lui t0, %gprel_got_hi(fun) # R_RISCV_GPREL_GOT_HI20 fun
c.add t0, gp
l[w|d] t1, t0, %gprel_got_lo(fun) # R_RISCV_GPREL_GOT_LO12_I fun
-- as presented at LCA (with C.ADD then updated to ADD as per Jim's
suggestion). Thank you for your meticulousness.
> This should be something like (same for both local and
> external):
>
> lui t0, %gprel_got_hi(fun@FUNCDESC) #
> R_RISCV_GPREL_GOTFUNCDESC_HI20 fun
> add t0, t0, gp
> l[w|d] t0, %gprel_got_lo(fun@FUNCDESC)(t0) #
> R_RISCV_GPREL_GOTFUNCDESC_LO12 fun
>
> eventually resulting in dynamic relocations for the GOT entry:
>
> R_RISCV_FUNCDESC fun
Discussed above already.
> > FDPIC code, indirect call (to a2):
> >
> > # Outstanding static relocations
> >
c.mv t0, a2
> > label:
> > auipc ra, %pcrel_call_hi(@PLT) # R_RISCV_CALL_PLT
> > jalr ra, ra, %pcrel_call_lo(label)
> > l[w|d] gp, <gp_slot>(sp)
> >
> > # The R_RISCV_CALL_PLT relocation with no symbol referred resolves to
> > # the PLT entry associated with indirect calls.
>
> As above I don't think it makes sense to handle this as a PLT entry. The call
> should be generated inline:
>
> lw t1, 0(a2)
> lw gp, 4(a2)
> jalr ra, t1
> lw gp, <gp_slot>(sp)
Discussed above already.
> > Chapter 5 Program Loading
> >
> > 5.1 Base Addresses (normative)
> >
> > A single individual base address is defined by the ELF gABI for a module
> > being loaded that determines the amount to relocate the module by. This
> > is unsuitable for FDPIC modules, which need to have their text segments
> > and data segments mapped in memory separately. This is so that where a
> > module is mapped multiple times in a no-MMU system, only a single copy of
> > its text segments is present in memory and serves all the mappings, while
> > a separate copy of its data segments is present in memory for each of the
> > mappings. Consequently the distance between text and data segments is no
> > longer constant between mappings and there is no single base address.
> >
> > Instead a separate text base address and a data base address is defined
> > as a difference between the load address and the link address of the text
> > segment and the data segment respectively. These two base addresses are
> > used by the dynamic loader to relocate text and data respectively.
>
> FDPIC does not have a "data base address"; there are one or more load segments,
> relocated independently using a load map.
Regrettably we cannot support multiple global pointers to address each of
the load segments independently, so even if the presence of multiple ELF
segments makes the data segment (or indeed the text segment) discontiguous
the relative position of the individual pieces of the data (text) segment
with respect to one another has to remain constant, and therefore together
they all form a single sparse logical segment.
> > In the initial module, such as a program interpreter, loaded by an OS or
> > other executive runtime the text base address of said initial module can
> > be determined by calculating a run-time difference between the actual
> > value of the PC for a given location, such as the beginning of the text
> > segment, obtained with a PC-relative reference to a symbol associated
> > with that location and the value of a corresponding absolute symbol
> > associated with the same location. The way to determine the data base
> > address and therefore the value of GP of the initial module is specific
> > to the individual OS or other executive runtime and therefore beyond the
> > scope of this specification. Possibilities include passing suitable
>
> Every other FDPIC ABI has a normative Start up section that specifies how
> Linux will pass a elf32_fdpic_loadmap struct; it's in scope here.
I think OS-specific startup is beyond the scope of this specification as
it is not OS-specific. For example FreeBSD or some bare-metal RTOS may do
this differently, e.g. use the auxiliary vector, preset the GP to the load
address of the lowest-mapped writable ELF file segment, define a syscall,
or whatever. The OS-specific runtime is meant to set up the GP somehow to
match this specification's requirements.
> Note that the Linux FDPIC support currently has 32-bit assumptions and
> 64-bit FDPIC will need to be documented here, much as the FRV ABI
> supplement defined 32-bit FDPIC ptrace calls.
I advise discussing such details with each interested OS's developers at
the relevant forum. Actually the 64-bit Linux part has already been done
by Damien (cc-ed).
I suppose we could accept submissions from OS developers documenting
their interfaces as informational appendices, so that there is a single
reference point.
> > information via the initial stack, such as in the auxiliary vector,
> > preinitializing a processor register, providing a system call to retrieve
> > it, etc.
> >
> > The presence of a separate text base address and a data base address also
> > means that ET_EXEC images cannot be supported with the FDPIC psABI as it
> > is not possible to make multiple copies of such image's data segment in a
> > no-MMU system without the ability to relocate it at load time.
> >
> >
> > 5.2 Lazy Binding (normative)
> >
> > Lazy binding can be optionally implemented by the dynamic loader. If it
> > is implemented, then the run-time relocation of R_RISCV_JUMP_SLOT and
> > their associated R_RISCV_GP relocations present in the FDT is done in two
> > stages.
>
> these should be a single relocation for consistency with other FDPIC ABIs.
>
> Properly supporting lazy binding on FDPIC is very difficult for multithreaded
> programs because it is impossible (on baseline RV*IA) to atomically update both
> words that compose a function descriptor copy. Lazy binding is disabled on
> modern distros as a hardening measure and not supported by musl as a matter of
> policy, so it is likely not worth trying to make it work.
Good point about atomicity.
> If you were to attempt to do this, it would be necessary to specify the order
> of loads in PLT entries (always load the entry point first and the GOT second);
> updates would write the correct GOT, issue a membarrier() syscall (a no-op on
> uniprocessor or sequentially consistent systems, required for ordering
> otherwise), and then write the new entry point.
Does the RISC-V ISA support weak memory ordering? I thought it did not,
having learnt from all the software engineering challenges it caused with
DEC Alpha systems (and to some extent MIPS systems). Some 25 years on and
some Linux kernel bugs still haven't been sorted in this area, and people
keep appearing who cannot even understand there is a problem there.
Anyway, that does not seem to be a big deal to me, and then it is an
implementation detail.
Also why do we need such a heavyweight mechanism as a syscall for an
ordering barrier? Borrowing your argument: all the other ISAs that
support weak memory ordering have an unprivileged hardware instruction for
synchronisation: Alpha has MB, MIPS has SYNC and even Intel x86 (which had
some weak bus ordering properties in its Pentium Pro implementation; not
sure if any were carried to any later microarchitectures) has CPUID.
> This guarantees that the entry point can only be reached with the corresponding
> GOT, however, it allows the lazy resolver to be called with _either_ the
> initial GOT value for the lazy descriptor, _or_ the final symbol's GOT. As
> such, the lazy resolver cannot depend(!) on the GOT register it receives.
Indeed, but as you note dynamic loader's GP can be stored at a place
uniformly reachable from any valid GP value corresponding to one of the
modules loaded, e.g. in the link map. As such it does not have to be
standardised at the psABI level and can be left to the implementation.
> > In the first stage, which is done by the dynamic loader at the time a
> > module is loaded, R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are
> > resolved respectively to the address of the lazy resolver and the value
> > of the global pointer associated with the module providing the lazy
> > resolver.
>
> > In the second stage, which is done when the lazy resolver is reached by
> > means of making a call through an FDT entry referring to it,
> > R_RISCV_JUMP_SLOT and R_RISCV_GP relocations are resolved respectively
> > to the address of the function symbol associated with the FDT entry and
> > the value of the global pointer associated with the module providing the
> > function symbol. To be able to do its work the lazy resolver is called
> > with certain registers containing values as follows:
> >
> > * x3 (gp) holds the dynamic loader's GP value as with an ordinary FDT
> > entry (this is a consequence of the first stage of run-time relocation)
>
> The dynamic loader needs to be able to tolerate _any_ valid gp value. This
> could be achieved by reserving a few words near gp and having the dynamic
> loader store a pointer to its own state at a known offset from every GOT.
Discussed above.
> > * x5 (t0) holds a pointer to the FDT entry to relocate
> >
> > * x6 (t1) holds the caller's GP value
>
> I don't think this is actually needed - the SH and ARM FDPIC ABIs
> unconditionally clobber the caller's GP. Given a pointer to a function
> descriptor copy (which is within one of the caller's data segments) the dynamic
> linker can easily find the caller by walking a list of loaded libraries.
It simplifies lazy resolver's processing at the cost of one instruction,
which is however executed every time a call via the PLT is made, even once
the symbol has been resolved. Perhaps it's not worth it.
> > Registers have been assigned such as to work with the RV32E instruction
> > set as well.
> >
> > Upon completion of the second stage the lazy resolver makes a jump to the
> > newly resolved address of the function symbol.
>
> > 5.3 Example PLT Code (informative)
> >
> > @PLT:
> > l[w|d] t2, 0(t0)
> > mv t1, gp
>
> We don't need to save t1 here; we could save 2 bytes per PLT entry by moving
> the adds into this function.
Except it would surely break cache line alignment for every other entry
causing an execution penalty. Anyway, the structure of PLT is private to
the containing module and can therefore be left to the implementation.
This is an example for illustration only.
Again, thank you for your input. If you have any further comments or
questions, then I'll be happy to address them. Otherwise I will factor in
what has been observed here.
Maciej