On Fri, Oct 7, 2016 at 8:28 PM, Andrew Lutomirski <
aml...@gmail.com> wrote:
> Continuing off another thread:
>
> I'm having trouble understanding what the RISC-V memory model is exactly,
> and I find the FENCE instruction to be quite vague. For example, if I do:
>
> STORE 1, address 0
> FENCE w,w
> STORE 1, address 1
>
> and another hart does:
>
> LOAD address 1
> LOAD address 0
>
> A literal reading of the docs would suggest that, if the first load sees a
> 1, the second load will also see a 1. I doubt that's the intent.
No, the two LOADs can be freely reordered. RISC-V's current model is
basically Alpha, except with less mature documentation, but ...
> But I think that the memory model, as I understand it, is weak enough that
> performance will suffer. The world seems to be standardizing on something
> like the C11 memory model, and RISC-V doesn't appear to have lightweight
> acquire and release operations. The only way to get a release with a fence
> seems to be (by my reading) FENCE w,rw, and, while I'm not a CPU designer,
> my understanding is that w,r fences are generally quite expensive because
> they force in-flight stores to complete before much else can happen, where
> all I really wanted was to enforce some a limited form of causailty. x86
> manages to make almost every store be a release without too much performance
> loss.
... They know it's too vague and too weak.
https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/Va2bE2kf7uw/SHG5Vj_vAQAJ
appears to indicate a current plan to replace it with something ARM-
or PPC-like.
Personally I'm more concerned about people thinking "Alpha is dead, so
I can forget about smp_read_barrier_depends". RISC-V as currently
specified needs it (as "fence r,rw"), but if the changes alluded to be
Andrew in the linked message are made it won't.
> Oddly, RISC-V seems to have acquire and release *atomics*, but plain load
> and store aren't in the list of atomics, so that doesn't really solve the
> problem. Also, there's no clear spec on how acquires and releases order
> with respect to FENCE. (And what does an atomic op without acquire or
> release set do? It would be neat if they guaranteed atomic execution with
> respect to other operations on the same CPU but were otherwise very
> lightweight. Then they'd be useful to synchronize against interrupts.)
I read the spec as saying that a naked atomic was
memory_model_relaxed. It needs clarification.
> As for page table walks, to me, the obvious semantics are that all fetches
> from paging structures are load-acquire. The supervisor code would make
> sense to use a release or stronger for page table writes, and all would be
> well.
Yes.
> Stores that set the A and D bits should be atomic with full,
> effective LL/SC semantics in the sense that the A and D bits shouldn't be
> set if the PTE is modified remotely and no longer matches what's in the TLB.
I don't see why this is needed. PTEs are pointer-sized and naturally
aligned, so reads and writes should be atomic. If two threads write a
PTE and a third thread reads it, it will get the flags and the PFN
from the same read so they'll already match. (qemu MT-TCG might need
special code to ensure ACCESS_ONCE on PTEs, though) (I have never
actually used the kernel atomics and I might be goofing the
references).
> Tangent: A feature I've long mused about is a form of heavyweight
> super-barrier. Specifically, an instruction (or maybe even a request that
> requires polling) that guarantees that all prior memory access on the
> invoking CPU/hart is visible *and observed* on all other CPUs before any
> subsequent memory access. Somewhat formally, it would be a barrier that
> synchronizes as heavily as a broadcast IPI that calls a function that
> executes a full fence on all other CPUs. In some sense, this should be free
> to implement -- merely waiting long enough should achieve it as long as all
> the other CPUs eventually flush their store buffers. If you search for
> "sys_membarrier", you'll find wild and crazy uses for this type of
> superfence. I am not aware of any prior art for this type of fence on any
> architecture. But who knows: maybe an ARM64 remote TLB flush has this as a
> side effect.
I've had the same idea, I don't understand TileLink nearly well enough
to know how it would be implemented.
-s