Hi Folks,
I've been doing some study of the RISC-V Memory Model as outlined in The
RISC-V Instruction Set Manual Volume I: User-level ISA, Appendix A —
RVWMO Explanatory Material, Section A.5 Code Porting and Mapping
Guidelines, Table A.6: Mappings from C/C++ primitives to RISC-V primitives.
# Standard Atomics
This work started out as the desire to create a stdatomic.h header for
use in bare-metal programs such as benchmarks, and possibly even
firmware. It is part of riscv-probe, which I derived from riscv-pk when
working on riscv-qemu for SiFive. It's in my personal GitHub account,
because I don't have permissions either at SiFive or in RISC-V
Foundation. In any case, here are the headers:
-
https://github.com/michaeljclark/riscv-probe/blob/master/libfemto/include/stdatomic.h
-
https://github.com/michaeljclark/riscv-probe/blob/master/libfemto/include/arch/riscv/stdatomic.h
The desire here is to use standard types, in this case the standard C11
types and interfaces, versus what I would refer to as "cooked" types or
interfaces. For interfaces that are unique, or which have special
requirements, it makes sense to invent new ones, but in cases where
existing interfaces are satisfactory, I think it makes more sense to
adopt standard interfaces.
# GCC Builtin Atomics Verification
In any case, during the course of my work on this, towards creating
threads on bare metal (something Palmer asked for), I ended up going
depth-first, and upon scratching the surface of the GCC code-gen, it
became apparent that the RISC-V Memory Model as recently added to the
ISA Manual, is simply not present in GCC. So the differences I have
found are not so much GCC defects, more a case of missing engineering. I
decided depth-first was the best approach in this case because we are
actually missing the compiler infrastructure to create atomics that take
advantage of the full extent of the memory model with correct acquire
release hints. I understand a niave implementation can function using
"fence" to imply total order irrespective of read and write hints (the
current case), which is not wrong, however, it does not give hardware
enough information to do more precise reordering.
I spent time to write up my findings and share them here:
-
https://github.com/michaeljclark/riscv-atomics
I have written replacement versions for all of the C11 Atomics, so that
today I am able to generate code, from C that matches the ISA Manual
Memory Model Guidelines.
The rationale being, someone else out there might actually be working on
an Out-of-Order core, might find these useful, and may like to know that
GCC (and LLVM?) are not yet emitting the code conforming with the ISA
Manual. Also, in addition to comparing GCC to the ISA manual guidelines,
I have created a set of drop-in headers that allow one to use the C11
atomics and emit (hopefully) correct assembly. I haven't looked at LLVM
code-gen in this area yet.
# RISC-V Atomics Compiler Analysis
The Atomics test repo contains a number of simple code-gen tests. I am
not aware of a tool like Lit in LLVM so I am currently using visual
inspection to compare against expected assembly in comments versus
code-gen from the GCC intrinsics, and code gen for my replacement
versions. The results are here:
https://github.com/michaeljclark/riscv-atomics/tree/master/results
- Found Logic bug in atomic_flag_clear
- Missing or incorrect .aqrl flags on atomic operations (add,and,or,xor)
- Missing or incorrect .aqrl flags on atomic load reserve, store conditional
- Use of IO flags on fence instructions generated by intrinsics
- Use of amoswap vs with zero register versus plain stores plus fence rw,w
## IO Fence attributes
I believe IO flags should be used explicitly in kernel code or with
special attributes. IO fence flags may incur higher overhead with some
micro-architectures. One would expect these flags in in-line assembly
(e.g. eieio on PowerPC, and inside readl/writel type macros) or only to
be generated with if special attributes are present. One wouldn't expect
these to be on typical User-level code which for most cases is going to
be ordering memory not IO.
It is possibly to use science here. i.e. look at other architectures.
While that is not completely scientific, and is partially subjective, it
does mean one would be using an "orthodox approach" vs a "heterodox
approach". There are some times when it makes sense to follow the
heterodoxy, although I think it should be justified. I think using
compiler attributes lets one get the best of both worlds. i.e. no
surprises. This is similar to other case where we don't expect the
compiler to emit anything close to a system instruction unless using
specific attributes __attribute__(interrupt) on arm or
__attribute__((address_space(258))) for SS relative on x86 come to mind.
I certainly was surprised that regular memory synchronization emits
flags that signal IO ordering. Whether that is an indication of its
correctness is subjective and needs further sampling.
## AMOs in place of load/store+fence
Favor the use of simple loads and stores with FENCE
a). allow atomic_load and atomic_store without the A extension.
b). incompatible access control requirements (RW vs W).
c). code density not an argument for synchronisation primitives.
d). consistency with stores of other widthes, e.g. byte and half.
e). micro-architectural cost of atomic operation versus FENCE
instruction, the former requiring special bus operations versus the
later which just enforces memory access ordering.
Thinking of a spec subset of just LR/SC+FENCE. It would be quite useful.
# LARS - Load-Acquire Release-Store
As fences take quite a bit of mental gymnastics, and can even confuse
long-time kernel developers who have got their fences around the wrong
way (evidence by a few searches of the linux-kernel archives; no need
for names; just knowledge that nobody is immune from making mistakes;
and why we have verification). It took me a tiny bit of time to
construct a mental mnemonic, which, by all means, please verify against
the ISA Manual guidelines, but it has helped me remember fence position
and flags:
It is LARS (Load-Acquire Release-Store). The position of acquire and
release are the barriers/fences and it expands with modifications inside
of the barrier, as so:
lw a0,0(a1)
fence r,rw
--------------
<modify>
--------------
fence rw,w
sw a0,0(a1)
Traditionally folk say Store-Release, but for the sake of the mental
mnemonics, the inversion of Release-Store helps use the ordering as a
way to remember the barrier/fence locations (substitute for memory
operations from all threads at that point in time), and also the fence
flags are obvious once we have the fence locations. i.e. the
load-acquire fence following has the (r)load as the predecessor and
rw(modification) as the successor, and conversely, the Release-Store
barrier/fence has prior rw(modifications) as the predecessor and the
w(store) as the successor, assuming we are thinking of the load-acquire
and store-release for ordering modifications.
# Spinlocks
I also created a RISC-V ticket spin-lock. I wished to do this in MIT
licensed code so that it is portable to Linux and elsewhere (this is
before I found Linux had moved ticket locks into asm-generic which
generated some other patches):
-
https://github.com/michaeljclark/riscv-probe/blob/master/libfemto/include/arch/riscv/spinlock.h
-
https://github.com/michaeljclark/riscv-probe/blob/master/libfemto/arch/riscv/spinlock.c
Here is the resulting ticket spinlock code for Linux, which doesn't use
any of my code, rather adds support for small word atomics (borrowed
from MIPS) and then adopts the asm-generic qspinlock:
-
http://lists.infradead.org/pipermail/linux-riscv/2019-February/003070.html
# Queues
I have some code for a multiple-producer multiple-consumer lock-free
queue that might be useful for memory model tests in a simulator. It
could be simplified for single-producer single-consumer, by removing ABA
tolerance (it embeds a version number in the ring buffer head and tail
counters for ABA tolerance allowing MPMC vs SPSC). The queue has
two-phase explicit memory ordering, essentially acquires a queue ticket,
inserts or removes from the queue, then synchronize the offsets:
https://github.com/michaeljclark/queue_atomic/blob/master/queue_atomic.h
Head and tail offset and version number are on separate cache lines to
avoid false sharing. We could do fun things like put front_cache onto
the tail cache line and tail_cache onto the front cache line. Verifying
the correctness of this type of code is interesting. i.e. specialized
lock-free constructs that substitute spinlocks, for carefully ordered
memory reads and writes (although when I wrote it, I was ignorant of the
translation of the acquire-release semantics I was using). The
interesting thing to note is that load-reserved store-conditional has
not made its way into the C standard, so we are still exposed to the
compare_and_swap primitive that suffers from the ABA problem that LR/SC
is supposed to solve. The problem is it cannot easily be included
because it can't easily be implemented on platforms that lack it.
Now I would like to find an RISC-V OoO simulator to analyze safety of
these codes. I don't mean Verilog synthesis in FPGA; I mean something
more like spike but OoO. There are particular parallelism analysis
use-cases that I have in mind that dictate having dedicated tools, and
also it needs to be !QEMU (many reasons). RISCVEMU would even be a
better starting point than QEMU. Gem5 could be considered, but ideally
it is some code that is controlled by a competitor to RISC-V, unless
Linaro joins the RISC-V Foundation; which is the ultimate proof they are
not.
# Sabotage
Sabotage [1] is possibly a little bit of a strong word to use, but when
you witness what is clearly someone throwing spanners into the machine,
and doing so not as part of some coordinated or scheduled durability
testing (as throwing spanners into machines is in fact valid engineering
practice), then one finds it a little disheartening. Especially when
certain objectives were set and barriers were created to prevent their
execution. It's a more than a little disheartening. That said, it is
sometimes impossible to distinguish sabotage from ignorance or human
error, however, methinks there is some sabotage going on.
Regards,
Michael.
P.S. I have not checked LLVM atomics.
[1]
https://www.psychologytoday.com/us/blog/in-one-lifespan/201808/sabotage-in-academia
--
Trust without verification is corruption