Memory model, acquire, and release

Andrew Lutomirski

unread,

Oct 7, 2016, 11:28:16 PM10/7/16

to RISC-V ISA Dev

Continuing off another thread:

On Friday, October 7, 2016 at 6:21:12 PM UTC-7, Jacob Bachmeyer wrote:

Races between page tables and underlying memory are an issue that I
still question currently. You are correct about software PTE walks
being inherently vulnerable to races and this would also affect a
"verify user address" instruction. I withdraw that suggestion.

I still have an unanswered question: What combination of FENCE,
SFENCE.VM, remote fences, etc. is required to globally kill a page
mapping before reusing that physical page?

> I can't speak for other OS's, but on Linux this essentially this
> doesn't happen. Linux will allocate a physical page, fill it in using
> an alias in the kernel virtual address range, and only map it into the
> user address range once it's fully populated. (And, on RISC-V, the
> kernel will have to put a FENCE w,w in there as well.)
>
> P.S. Am I understanding the memory model right? One CPU doing FENCE
> w,w; store to address A will synchronize against another CPU doing
> load from address A; FENCE r,r, right?

Unless I also misunderstand the memory model, your FENCEs are
incorrect. FENCE is defined informally as "no other RISC-V thread or
external device can observe any operation in the successor set following
a FENCE before any operation in the predecessor set preceding the FENCE"
(RISC-V user spec v2.1, sec. 2.7 "Memory Model"). Assuming that FENCE
is "FENCE <pred>, <succ>", paging in program text would seem to require
a "FENCE w,r; FENCE.I" sequence, but I am uncertain what must be done to
ensure that other harts will actually see the new page contents.
Mapping it to user space will require SFENCE.VM, since the TLB already
has a page fault for that address.

I'm having trouble understanding what the RISC-V memory model is exactly, and I find the FENCE instruction to be quite vague. For example, if I do:

STORE 1, address 0
FENCE w,w
STORE 1, address 1

and another hart does:

LOAD address 1
LOAD address 0

A literal reading of the docs would suggest that, if the first load sees a 1, the second load will also see a 1. I doubt that's the intent.

But I think that the memory model, as I understand it, is weak enough that performance will suffer. The world seems to be standardizing on something like the C11 memory model, and RISC-V doesn't appear to have lightweight acquire and release operations. The only way to get a release with a fence seems to be (by my reading) FENCE w,rw, and, while I'm not a CPU designer, my understanding is that w,r fences are generally quite expensive because they force in-flight stores to complete before much else can happen, where all I really wanted was to enforce some a limited form of causailty. x86 manages to make almost every store be a release without too much performance loss.

Oddly, RISC-V seems to have acquire and release *atomics*, but plain load and store aren't in the list of atomics, so that doesn't really solve the problem. Also, there's no clear spec on how acquires and releases order with respect to FENCE. (And what does an atomic op without acquire or release set do? It would be neat if they guaranteed atomic execution with respect to other operations on the same CPU but were otherwise very lightweight. Then they'd be useful to synchronize against interrupts.)

As for page table walks, to me, the obvious semantics are that all fetches from paging structures are load-acquire. The supervisor code would make sense to use a release or stronger for page table writes, and all would be well. Stores that set the A and D bits should be atomic with full, effective LL/SC semantics in the sense that the A and D bits shouldn't be set if the PTE is modified remotely and no longer matches what's in the TLB.

Tangent: A feature I've long mused about is a form of heavyweight super-barrier. Specifically, an instruction (or maybe even a request that requires polling) that guarantees that all prior memory access on the invoking CPU/hart is visible *and observed* on all other CPUs before any subsequent memory access. Somewhat formally, it would be a barrier that synchronizes as heavily as a broadcast IPI that calls a function that executes a full fence on all other CPUs. In some sense, this should be free to implement -- merely waiting long enough should achieve it as long as all the other CPUs eventually flush their store buffers. If you search for "sys_membarrier", you'll find wild and crazy uses for this type of superfence. I am not aware of any prior art for this type of fence on any architecture. But who knows: maybe an ARM64 remote TLB flush has this as a side effect.

Stefan O'Rear

unread,

Oct 7, 2016, 11:44:30 PM10/7/16

to Andrew Lutomirski, RISC-V ISA Dev

On Fri, Oct 7, 2016 at 8:28 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> Continuing off another thread:

>
> I'm having trouble understanding what the RISC-V memory model is exactly,
> and I find the FENCE instruction to be quite vague. For example, if I do:
>
> STORE 1, address 0
> FENCE w,w
> STORE 1, address 1
>
> and another hart does:
>
> LOAD address 1
> LOAD address 0
>
> A literal reading of the docs would suggest that, if the first load sees a
> 1, the second load will also see a 1. I doubt that's the intent.

No, the two LOADs can be freely reordered. RISC-V's current model is
basically Alpha, except with less mature documentation, but ...

> But I think that the memory model, as I understand it, is weak enough that
> performance will suffer. The world seems to be standardizing on something
> like the C11 memory model, and RISC-V doesn't appear to have lightweight
> acquire and release operations. The only way to get a release with a fence
> seems to be (by my reading) FENCE w,rw, and, while I'm not a CPU designer,
> my understanding is that w,r fences are generally quite expensive because
> they force in-flight stores to complete before much else can happen, where
> all I really wanted was to enforce some a limited form of causailty. x86
> manages to make almost every store be a release without too much performance
> loss.

... They know it's too vague and too weak.
https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/Va2bE2kf7uw/SHG5Vj_vAQAJ
appears to indicate a current plan to replace it with something ARM-
or PPC-like.

Personally I'm more concerned about people thinking "Alpha is dead, so
I can forget about smp_read_barrier_depends". RISC-V as currently
specified needs it (as "fence r,rw"), but if the changes alluded to be
Andrew in the linked message are made it won't.

> Oddly, RISC-V seems to have acquire and release *atomics*, but plain load
> and store aren't in the list of atomics, so that doesn't really solve the
> problem. Also, there's no clear spec on how acquires and releases order
> with respect to FENCE. (And what does an atomic op without acquire or
> release set do? It would be neat if they guaranteed atomic execution with
> respect to other operations on the same CPU but were otherwise very
> lightweight. Then they'd be useful to synchronize against interrupts.)

I read the spec as saying that a naked atomic was
memory_model_relaxed. It needs clarification.

> As for page table walks, to me, the obvious semantics are that all fetches
> from paging structures are load-acquire. The supervisor code would make
> sense to use a release or stronger for page table writes, and all would be
> well.

Yes.

> Stores that set the A and D bits should be atomic with full,
> effective LL/SC semantics in the sense that the A and D bits shouldn't be
> set if the PTE is modified remotely and no longer matches what's in the TLB.

I don't see why this is needed. PTEs are pointer-sized and naturally
aligned, so reads and writes should be atomic. If two threads write a
PTE and a third thread reads it, it will get the flags and the PFN
from the same read so they'll already match. (qemu MT-TCG might need
special code to ensure ACCESS_ONCE on PTEs, though) (I have never
actually used the kernel atomics and I might be goofing the
references).

> Tangent: A feature I've long mused about is a form of heavyweight
> super-barrier. Specifically, an instruction (or maybe even a request that
> requires polling) that guarantees that all prior memory access on the
> invoking CPU/hart is visible *and observed* on all other CPUs before any
> subsequent memory access. Somewhat formally, it would be a barrier that
> synchronizes as heavily as a broadcast IPI that calls a function that
> executes a full fence on all other CPUs. In some sense, this should be free
> to implement -- merely waiting long enough should achieve it as long as all
> the other CPUs eventually flush their store buffers. If you search for
> "sys_membarrier", you'll find wild and crazy uses for this type of
> superfence. I am not aware of any prior art for this type of fence on any
> architecture. But who knows: maybe an ARM64 remote TLB flush has this as a
> side effect.

I've had the same idea, I don't understand TileLink nearly well enough
to know how it would be implemented.

-s

Andrew Lutomirski

unread,

Oct 7, 2016, 11:55:00 PM10/7/16

to RISC-V ISA Dev, aml...@gmail.com

[OT: You're Stefan in emails but sorear2 in the Groups UI. Go figure.]

On Friday, October 7, 2016 at 8:44:30 PM UTC-7, sorear2 wrote:

> Stores that set the A and D bits should be atomic with full,
> effective LL/SC semantics in the sense that the A and D bits shouldn't be
> set if the PTE is modified remotely and no longer matches what's in the TLB.

I don't see why this is needed. PTEs are pointer-sized and naturally
aligned, so reads and writes should be atomic. If two threads write a
PTE and a third thread reads it, it will get the flags and the PFN
from the same read so they'll already match. (qemu MT-TCG might need
special code to ensure ACCESS_ONCE on PTEs, though) (I have never
actually used the kernel atomics and I might be goofing the
references).

The reason is fairly mundane. Suppose that CPU A replaces valid PTE with something totally different. That is, the PTE starts out being backed by page 1, W, and initially not dirty, and the PTE is changed so it's backed by page 2, W, !D. For simplicity, let's suppose CPU A does this using a sequentially consistent atomic swap, although it doesn't really matter. Then CPU A broadcasts an IPI that does SFENCE.VM. Meanwhile CPU B writes from the virtual address being modified. Acceptable outcomes include:

CPU A observes that the old PTE is dirty, CPU B writes to page 1, and the PTE ends up with D clear.
CPU A observes the the old PTE is clean, CPU B writes to page 2, and the PTE ends up dirty.

As currently specified, it looks like CPU A could observe the old PTE being clean, the new PTE could be dirty, and CPU B writes to page 1. This could easily cause data loss. The only way that I see around it would be for CPU A to invalidate the PTE atomically, then broadcast out a flush, then swap the PTE. (And then, unless the spec changes so that negative entries aren't allowed to be cached, CPU A has to broadcast a second flush.) This is painful and doesn't fit into Linux's internal API, I think.

Stefan O'Rear

unread,

Oct 8, 2016, 12:12:39 AM10/8/16

to Andrew Lutomirski, RISC-V ISA Dev

On Fri, Oct 7, 2016 at 8:55 PM, Andrew Lutomirski <aml...@gmail.com> wrote:
> The reason is fairly mundane. Suppose that CPU A replaces valid PTE with
> something totally different. That is, the PTE starts out being backed by
> page 1, W, and initially not dirty, and the PTE is changed so it's backed by
> page 2, W, !D. For simplicity, let's suppose CPU A does this using a
> sequentially consistent atomic swap, although it doesn't really matter.
> Then CPU A broadcasts an IPI that does SFENCE.VM. Meanwhile CPU B writes
> from the virtual address being modified. Acceptable outcomes include:
>
> CPU A observes that the old PTE is dirty, CPU B writes to page 1, and the
> PTE ends up with D clear.
> CPU A observes the the old PTE is clean, CPU B writes to page 2, and the PTE
> ends up dirty.
>
> As currently specified, it looks like CPU A could observe the old PTE being
> clean, the new PTE could be dirty, and CPU B writes to page 1. This could
> easily cause data loss. The only way that I see around it would be for CPU
> A to invalidate the PTE atomically, then broadcast out a flush, then swap
> the PTE. (And then, unless the spec changes so that negative entries aren't
> allowed to be cached, CPU A has to broadcast a second flush.) This is
> painful and doesn't fit into Linux's internal API, I think.

Ah! You mean that the _implicit_ operations to set A and D bits need
to be done as AMOOR. I thought you were talking about stores to PTEs
from the kernel needing to be atomic for some reason (yes, this makes
little sense now).

Yes, that would be good to specify. I think Rocket/BOOM already works
this way, judging by
https://github.com/ucb-bar/rocket-chip/blob/master/src/main/scala/rocket/ptw.scala#L135
.

-s

Michael Clark

unread,

Oct 8, 2016, 1:11:35 AM10/8/16

to Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev

Explicit fencing seems to confer both some advantages and some disadvantages.

I can imagine the looser, or more explicit coherency has some potential security advantages (FLUSH+RELOAD), if combined with a mechanism to configure an n x n matrix of cache coherency domains in a multi-tennant environment, assuming PDID is in the cache key.

It also allows one to implement atomic 128 bit aligned reads (not crossing cache-line boundaries) with less effort. i.e. cache address snooping is not going to cause automatic invalidation of cache lines on other threads. It is potentially more scalable with a larger number of threads. Invalidation traffic would be limited to ones own threads in a multi-tennant environment?

rs1 is reserved on both fence and fence.i, so one imagines one could implement C11 memory model using an address fence (ignoring the last 6 bits on a system with a 64 byte cache line) instead of a LOCK prefix. I don’t know what effect this has on the pipeline. I’m thinking from the perspective of a software developer.

I imagine one could implement token spinlocks with AMOADD (acquire token) and AMOSWAP (take lock) along with fence rs1 (cache line), potentially with the token counter and lock on separate cache lines.

That said, even with looser coherency, one cause stomp on ways to cause cache to flush to leak timing information and if fence succ,pred,rs1 is provided one has more control over cache way invalidation traffic.

On x86, SSE has loose cache coherency, and it is only the legacy integer ISA that has stores magically expire in order. I guess a lot of software depends on this behaviour… ???

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADJ6UvPO5U6zLQpEVd2R8ppJVWCs7WFNhYTWcwvL5D3giAkEsw%40mail.gmail.com.

Paolo Bonzini

unread,

Oct 8, 2016, 2:34:56 AM10/8/16

to Andrew Lutomirski, RISC-V ISA Dev

On 08/10/2016 05:28, Andrew Lutomirski wrote:
> The only way to get a release with a fence seems to be (by my reading)
> FENCE w,rw,

Release would be FENCE rw,w (while acquire would be FENCE r,rw).

Paolo

Paolo Bonzini

unread,

Oct 8, 2016, 2:41:43 AM10/8/16

to Michael Clark, Stefan O'Rear, Andrew Lutomirski, RISC-V ISA Dev

On 08/10/2016 07:11, Michael Clark wrote:
> On x86, SSE has loose cache coherency, and it is only the legacy
> integer ISA that has stores magically expire in order. I guess a lot
> of software depends on this behaviour… ???

_All_ multithreaded software does. C compilers will not put an LFENCE
or SFENCE after acquire integer loads or before release integer stores.

In fact SSE doesn't have loose memory ordering by default; only if you
use the special MOVNT instruction then can writes be reordered. (Also,
LFENCE is actually an acquire fence, while SFENCE is just a write/write
fence).

Paolo

Reinoud Zandijk

unread,

Oct 9, 2016, 5:45:13 AM10/9/16

to Andrew Lutomirski, RISC-V ISA Dev

Hi folks,

On Fri, Oct 07, 2016 at 08:55:00PM -0700, Andrew Lutomirski wrote:
> The reason is fairly mundane. Suppose that CPU A replaces valid PTE with
> something totally different. That is, the PTE starts out being backed by
> page 1, W, and initially not dirty, and the PTE is changed so it's backed
> by page 2, W, !D. For simplicity, let's suppose CPU A does this using a
> sequentially consistent atomic swap, although it doesn't really matter.
> Then CPU A broadcasts an IPI that does SFENCE.VM. Meanwhile CPU B writes
> from the virtual address being modified. Acceptable outcomes include:

As most of the IPIs executed are basically `please sfence' or the like for the
memory model it seems beneficial if there are instructions that perform such
requests on the TileLink bus without the hugely expensive remote interrupting,
register savings, checking its IPI, then decoding and issuing an sfence
variant and load all registers back and then returning.

Yes, the interrupt code could be written so it only saves just a small amount
of registers until it knows its an IPI, but for most system software its
`just' an interrupt so all is saved in a trapframe etc. before higher level C
code is called to handle it.

As for naming, its up for grabs, maybe even `RSFENCE' or the like ;)

With regards,
Reinoud

Michael Clark

unread,

Oct 9, 2016, 6:18:14 AM10/9/16

to Reinoud Zandijk, Andrew Lutomirski, RISC-V ISA Dev

I guess that the SBI interface allows an implementation defined mechanism which doesn’t necessarily imply IPI:

void sbi_remote_sfence_vm(const uintptr_t* harts, size_t asid)

If the SBI is used then an implementation is free to provide an optimised implementation without specifying an instruction. Who knows, maybe it could be triggered via MMIO in which case one would use STORE. Adding an instruction adds a higher barrier. Well I guess it could be trapped and emulated by M mode, and do an IPI under the hood, but maybe that’s the rationale for the SBI interface. Doesn’t use opcode space… for things that may not be implemented using instructions.

Things that are likely to have a variety of implementation defined mechanisms are perhaps good to put behind SBI?

Andrew Waterman

unread,

Oct 9, 2016, 5:16:52 PM10/9/16

to Michael Clark, Reinoud Zandijk, Andrew Lutomirski, RISC-V ISA Dev

MMIO store seems most natural, since it should end up as a Tilelink Put message.

>
> Things that are likely to have a variety of implementation defined
> mechanisms are perhaps good to put behind SBI?

That was our intent in making these SBI calls. It admits several
plausible implementations, including software shootdown, hardware
shootdown, and hardware coherence; and it avoids exposing the guts to
the OS.

>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/BF880C29-306A-4A8A-9A43-63CF8F9EA398%40mac.com.

Reply all

Reply to author

Forward