Proposal: Explicit cache-control instructions (draft 5 after feedback)

598 views
Skip to first unread message

Jacob Bachmeyer

unread,
Mar 7, 2018, 11:19:42 PM3/7/18
to RISC-V ISA Dev
Previous discussions suggested that explicit cache-control instructions
could be useful, but RISC-V has some constraints here that other
architectures lack, namely that caching must be transparent to the user ISA.

I propose a new minor opcode REGION := 3'b001 within the existing
MISC-MEM major opcode. Instructions in REGION are R-type and use rs1 to
indicate a base address, rs2 to indicate an upper bound address, and
produce a result in rd that is the first address after the highest
address affected by the operation. If rd is x0, the instruction has no
directly visible effects and can be executed entirely asynchronously as
an implementation option.

Non-destructive operations permit an implementations to expand a
requested region on both ends to meet hardware granularity for the
operation. An application can infer alignment from the produced value
if it is a concern. As a practical matter, cacheline lengths are
expected to be declared in the processor configuration ROM.

Destructive operations are a thornier issue, and are resolved by
requiring any partial cachelines (at most 2 -- first and last) to be
flushed or prefetched instead of performing the requested operation on
those cachelines. Implementations may perform the destructive operation
on the parts of these cachelines included in the region, or may simply
flush or prefetch the entire cacheline.

If the upper and lower bounds are specified by the same register, the
smallest region that can be affected that includes the lower bound is
affected if the operation is non-destructive; destructive operations are
no-ops. Otherwise, the upper bound must be greater than the lower bound
and the contrary case is reserved. (Issue for discussion: what happens
if the reserved case is executed?)

In general, this proposal uses "cacheline" to describe the hardware
granularity for an operation that affects multiple words of memory or
address space. Where these operations are implemented using traditional
caches, the use of the term "cacheline" is entirely accurate, but this
proposal does not prevent implementations from using other means to
implement these instructions.

Instructions in MISC-MEM/REGION may be implemented as no-ops if an
implementation lacks the corresponding hardware. The value produced in
this case is the base address.

The new MISC-MEM/REGION space will have room for 128 opcodes, one of
which is the existing FENCE.I. I propose:

[for draft 3, the function code assignments have changed to better group
prefetch operations]
[for draft 4, most of the mnemonics have been shortened and now indicate
that these instructions affect the memory subsystem]

===Fences===

[function 7'b0000000 is the existing FENCE.I instruction]

[function 7'b0000001 reserved]

FENCE.RD ("range data fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
Perform a conservative fence affecting only data accesses to the
chosen region. This instruction always has visible effects on memory
consistency and is therefore synchronous in all cases.

FENCE.RI ("range instruction fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000011}
Perform equivalent of FENCE.I but affecting only instruction fetches
from the chosen region. This instruction always has visible effects on
memory consistency and is therefore synchronous in all cases.

===Non-destructive cache control===

====Prefetch====

All prefetch instructions ignore page faults and other access faults.
In general use, applications should use rd == x0 for prefetching,
although this is not required. If a fault occurs during a synchronous
prefetch (rd != x0), the operation must terminate and produce the
faulting address. A fault occurring during an asynchronous prefetch (rd
== x0) may cause the prefetching to stop or the implementation may
attempt to continue prefetching past the faulting location.

MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
Load as much of the chosen region as possible into the data cache,
with varying levels of expected temporal access locality. The number in
the opcode is proportionate to the expected frequency of access to the
prefetched data: MEM.PF3 is for data that will be very heavily used.

MEM.PF.EXCL ("prefetch exclusive")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
Load as much of the chosen region as possible into the data cache,
with the expectation that future stores will soon occur to this region.
In a cache-coherent system, any locks required for writing the affected
cachelines should be acquired.

[function 7'b0001101 reserved for a future prefetch operation]

MEM.PF.ONCE ("prefetch once")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
Prefetch as much of the region as possible, but expect the prefetched
data to be used at most once. This operation may activate a prefetch
unit and prefetch the region incrementally if rd is x0. Software is
expected to access the region sequentially, starting at the base address.

MEM.PF.TEXT ("prefetch program text")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
Load as much of the chosen region as possible into the instruction cache.

====Cacheline pinning====

??? Issue for discussion: should a page fault while pinning cachelines
cause a trap to be taken?
??? Issue for discussion: what if another processor attempts to write
to an address in a cacheline pinned on this processor?

CACHE.PIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
Arrange for as much of the chosen region as possible to be accessible
with minimal delay and no traffic to main memory. Pinning a region is
idempotent and an implementation may pin a larger region than requested,
provided that an unpin operation with the same base and bound will also
unpin the larger region.
One possible implementation is to load as much of the chosen region as
possible into the data cache and keep it there until unpinned. Another
implementation is to configure a scratchpad RAM and map it over at least
the chosen region, preloading it with data from main memory.
Scratchpads may be processor-local, but writes to a scratchpad mapped
with CACHE.PIN must be visible to other nodes in a coherent system.
Implementations are expected to ensure that pinned cachelines will not
impede the efficacy of a cache. Implementations with fully-associative
caches may permit any number of pins, provided that at least one
cacheline remains available for normal use. Implementations with N-way
set associative caches may support pinning up to (N-1) ways within each
set, provided that at least one way in each set remains available for
normal use. Implementations with direct-mapped caches should not pin
cachelines, but may still use CACHE.PIN to configure an overlay
scratchpad, which may itself use storage shared with caches, such that
mapping the scratchpad decreases the size of the cache.

Implementations may support both cacheline pinning and scratchpads,
choosing which to use to perform a CACHE.PIN operation in an
implementation-defined manner.

CACHE.UNPIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010001}
Explicitly release a pin set with CACHE.PIN. Pinned regions are also
implicitly released if the memory protection and virtual address mapping
is changed. (Specifically, a write to the current satp CSR or an
SFENCE.VM will unpin all cachelines as a side effect, unless the
implementation partitions its cache by ASID. Even with ASID-partitioned
caches, changing the root page number associated with an ASID unpins all
cachelines belonging to that ASID.) Unpinning a region does not
immediately remove it from the cache. Unpinning a region always
succeeds, even if parts of the region were not pinned. For an
implementation that implements CACHE.PIN using scratchpad RAM, unpinning
a region that uses a scratchpad causes the current contents of the
scratchpad to be written to main memory.

And two M-mode-only privileged instructions:

CACHE.PIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
Arrange for code to execute from as much of the chosen region as
possible without traffic to main memory. Pinning a region is idempotent.

CACHE.UNPIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010001, 2'b11}
Release resources pinned with CACHE.PIN.I. Pins are idempotent. One
unpin instruction will unpin the chosen region completely, regardless of
how many times it was pinned. Unpinning always succeeds.

====Flush====

CACHE.WRITEBACK
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
Writeback any cachelines in the requested region.

CACHE.FLUSH
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
Write any cachelines in the requested region (as by CACHE.WRITE),
marking them invalid afterwards (as by MEM.DISCARD). Flushed cachelines
are automatically unpinned.

Rationale for including CACHE.FLUSH: small implementations may
significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD;
the implementations that most benefit lack the infrastructure to achieve
such combination by macro-op fusion.


===Declaring data obsolescence===

These operations declare data to be obsolete and unimportant. In
fully-coherent systems, they are two sides of the same coin:
MEM.DISCARD declares data not yet written to main memory to be obsolete,
while MEM.REWRITE declares data in main memory to be obsolete and
indicates that software on this hart will soon overwrite the region.
These operations are useful in general: a function prologue could use
MEM.REWRITE to allocate a stack frame, while a function epilogue could
use MEM.DISCARD to release a stack frame without requiring the
now-obsolete local variables ever be written back to main memory. In
non-coherent systems, MEM.DISCARD may also be an important tool for
software-enforced coherency, since its semantics provide an invalidate
operation on all caches on the path between a hart and main memory.

The declarations of obsolescence produced by these instructions are
global and affect all caches on the path between a hart and main memory
and all caches coherent with those caches, but are not required to
affect non-coherent caches not on the direct path between a hart and
main memory. Implementations depending on software to maintain
coherency in such situations must provide other means (MMIO control
registers, for example) to force invalidations in remote non-coherent
caches.

These instructions create regions with undefined contents and share a
requirement that foreign data never be introduced. Foreign data is,
simply, data that was not previously visible to the current hart at the
current privilege level at any address. Operating systems zero pages
before attaching them to user address spaces to prevent foreign data
from appearing in freshly-allocated pages. Implementations must ensure
that these instructions do not cause foreign data to leak through caches
or other structures.

MEM.DISCARD
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
Declare the contents of the region obsolete, dropping any copies
present in the processor without performing writes to main memory. The
contents of the region are undefined after the operation completes, but
shall not include foreign data.
If the region does not align with cacheline boundaries, any partial
cachelines are written back. If hardware requires such, the full
contents of a cacheline partially included may be written back,
including data just declared obsolete. In a non-coherent system,
partial cachelines written back are also invalidated. In a system with
hardware cache coherency, partial cachelines must be written back, but
may remain valid.
Any cachelines fully affected by MEM.DISCARD are automatically unpinned.
NOTE WELL: MEM.DISCARD is *not* an "undo" operation for memory writes
-- an implementation is permitted to aggressively writeback dirty
cachelines, or even to omit caches entirely. *ANY* combination of "old"
and "new" data may appear in the region after executing MEM.DISCARD.

MEM.REWRITE
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
Declare the contents of the region obsolete, indicating that the
current hart will soon overwrite the entire region. Reading any datum
from the region before the current hart has written that datum (or other
data fully overlapping that datum) is incorrect behavior and produces
undefined results, but shall not return foreign data. Note that
undefined results may include destruction of nearby data. For optimal
performance, software should write the entire region before reading any
part of the region and should do so sequentially, starting at the base
address and moving towards the address produced by MEM.REWRITE.
TLB fills occur normally as for writes to the region and must appear
to occur sequentially, starting at the base address. A page fault in
the middle of the region causes the operation to stop and produce the
faulting address. A page fault at the base address causes a page fault
trap to be taken.
Implementations with coherent caches should arrange for all cachelines
in the region to be in a state that permits the current hart to
immediately overwrite the region with no further delay. In common
cache-coherency protocols, this is an "exclusive" state.
An implementation may have a maximum size of a region that can have a
rewrite pending. If software declares intent to overwrite a larger
region than the implementation can prepare at once, the operation must
complete partially and return the first address beyond the region
immediately prepared for overwrite. Software is expected to overwrite
the region prepared, then iterate for the next part of the region that
software intends to overwrite until the entire larger region is overwritten.
If the region does not align with cacheline boundaries, any partial
cachelines are prefetched as by MEM.PF.EXCL. If hardware requires such,
the full contents of a cacheline partially included may be loaded from
memory, including data just declared obsolete.
NOTE WELL: MEM.REWRITE is *not* memset(3) -- any portion of the
region prepared for overwrite already present in cache will *retain* its
previously-visible contents.

MEM.REWRITE appears to be a relatively novel operation and previous
iterations of this proposal have produced considerable confusion. While
the above semantics are the required behavior, there are different ways
to implement them. One simple option is to temporarily mark the region
as "write-through" in internal configuration. Another option is to
allocate cachelines, but either retain their previous contents (provided
that the implementation can *prove* that those contents are *not*
foreign data) or load the allocated cachelines with zero instead of
fetching contents from memory. A third option is to track whether
cachelines have been overwritten and use a monitor trap to zero
cachelines that software attempts to invalidly read. A fourth option is
to provide dedicated write-combining buffers for MEM.REWRITE.
In systems that implement MEM.REWRITE using cache operations,
MEM.REWRITE allocates cachelines, marking them "valid, exclusive, dirty"
and filling them without reading from main memory while abiding by the
requirements to avoid introducing foreign data. Other cachelines may be
evicted to make room if needed but implementations should avoid evicting
data recently fetched with MEM.PF.ONCE, as software may intend to copy
that data into the region. Implementations are recommended to permit at
most half of a data cache to be allocated for MEM.REWRITE if data has
been recently prefetched into the cache to aid in optimizing memcpy(3),
but may permit the full data cache to be used to aid in optimizing
memset(3). In particular, an active asynchronous MEM.PF.ONCE ("active"
meaning that the data prefetched has not yet been read) can be taken as
a hint that MEM.REWRITE is preparing to copy data and should use at most
half or so of the data cache.



=== ===

Thoughts?

Thanks to:
[draft 1]
Bruce Hoult for citing a problem with the HiFive board that inspired
the I-cache pins.
[draft 2]
Stefan O'Rear for suggesting the produced value should point to the
first address after the affected region.
Alex Elsayed for pointing out serious problems with expanding the
region for a destructive operation and suggesting that "backwards"
bounds be left reserved.
Guy Lemieux for pointing out that pinning was insufficiently specified.
Andrew Waterman for suggesting that MISC-MEM/REGION could be encoded
around the existing FENCE.I instruction.
Allen Baum for pointing out the incomplete handling of page faults.
[draft 3]
Guy Lemieux for raising issues that inspired renaming PREZERO to RELAX.
Chuanhua Chang for suggesting that explicit invalidation should unpin
cachelines.
Guy Lemieux for being persistent asking for CACHE.FLUSH and giving
enough evidence to support that position.
Guy Lemieux and Andrew Waterman for discussion that led to rewriting a
more general description for pinning.
[draft 4]
Guy Lemieux for suggesting that CACHE.WRITE be renamed CACHE.WRITEBACK.
Allen J. Baum and Guy Lemieux for suggestions that led to rewriting the
destructive operations.
[draft 5]
Allen Baum for offering a clarification for the case of using the same
register for both bounds to select a minimal-length region.


-- Jacob

Aaron Severance

unread,
Mar 8, 2018, 6:07:09 PM3/8/18
to jcb6...@gmail.com, RISC-V ISA Dev
Thanks Jacob.

As a general point of confusion, the memory model spec states that future extensions may include cache management instructions but that they should be treated as hints, not functional requirements.  For correctness it specs that a (possibly range-limited) fence must be used; the example they give is "fence rw[addr],w[addr]" for writeback.

I take this to mean that with non-coherent caches/DMA on a fence with W in the predecessor set cached data needs to be written out to memomry and on a fence with R in the successor set the cache needs to be flushed.  I'm not sure how useful the synchronous versions of WRITEBACK/FLUSH are then.

The WRITEBACK and FLUSH instructions then seem mostly useful in their asynchronous form to initiate a writeback/flush early because a fence is needed to ensure correctness.  As an example if working with a buffer shared with another non-coherent master then after you finish with a buffer: 1) do an asynchronous CACHE.FLUSH instruction on its addresses 2) do some other work 3) when the buffer is needed again by another hart or DMA device do a fence.

Anyway, the points that I think should be clarified are:
  1) Is if a fence is required for correctness when using the CACHE.WRITEBACK/CACHE.FLUSH operations?
  2) Can CACHE.WRITEBACK, CACHE.FLUSH, MEM.DISCARD, and MEM.REWRITE be implemented as no-ops even on hardware with non-coherent caches?  I assume the cache pinning and prefetching ops can.  WRITEBACK/FLUSH depend on if fences are required for correctness.  DISCARD/REWRITE are more problematic but I would think they can be as long as fences are required for correctness.


More notes inline.
  Aaron

On Wed, Mar 7, 2018 at 8:19 PM Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Previous discussions suggested that explicit cache-control instructions
could be useful, but RISC-V has some constraints here that other
architectures lack, namely that caching must be transparent to the user ISA.

I propose a new minor opcode REGION := 3'b001 within the existing
MISC-MEM major opcode.  Instructions in REGION are R-type and use rs1 to
indicate a base address, rs2 to indicate an upper bound address, and
produce a result in rd that is the first address after the highest
address affected by the operation.  If rd is x0, the instruction has no
directly visible effects and can be executed entirely asynchronously as
an implementation option.
Is rs2's upper bound inclusive?

Regarding having rd write back the first address after the highest address affected by the operation:
  This wording is a bit confusing; even if there is no data in the cache in the specified range those addresses are 'affected'.  Not sure what better wording would be though...
  Is this always rs2 (or rs2+1) or can it be arbitrarily higher?
  I believe this is meant to allow partial completion, where the operation is restarted from the address returned by rd.

Assuming partial completition is allowed:
    Is forward progress required? i.e. must rd be >= rs1 + 1?
    0 must be a valid return value (if the region goes to the highest addressable value).
    I would suggest that rd must be >= the first address after the highest address affected by the operation.
      Then an implementation that always fully completes could then return 0.
      No-op implementations could also always return 0.
    Does this apply to FENCE.RD/FENCE.RI? It seems problematic to have FENCE.RD/FENCE.RI partially complete and return, and FENCE/FENCE.I must fully complete anyway.
Again I would suggest 0 here, assuming an implementation can return an arbitrarily high value (up to the end of memory wrapping back around to 0).  This would signal that the operation completed, unless there's some reason to signal to the software that the operation was skipped due to lack of hardware (I don't think there is).
 
.


The new MISC-MEM/REGION space will have room for 128 opcodes, one of
which is the existing FENCE.I.  I propose:

[for draft 3, the function code assignments have changed to better group
prefetch operations]
[for draft 4, most of the mnemonics have been shortened and now indicate
that these instructions affect the memory subsystem]

===Fences===

[function 7'b0000000 is the existing FENCE.I instruction]

[function 7'b0000001 reserved]

FENCE.RD ("range data fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
  Perform a conservative fence affecting only data accesses to the
chosen region.  This instruction always has visible effects on memory
consistency and is therefore synchronous in all cases.

Does "only data accesses" mean it has the same effects as a "fence rw, rw"?
CACHE.FLUSH is also non-destructive as a single op; CACHE.WRITEBACK+MEM.DISCARD need be executed as an atomic pair.  Not that that's a huge burden.
I think the wording should be changed to something like 'data in the region need not be written to main memory'.  Flushing the cache should be a valid implementation; discarding is just a performance optimization.
 
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AA0B9DB.4090408%40gmail.com.

Albert Cahalan

unread,
Mar 9, 2018, 3:57:02 AM3/9/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 3/7/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> ====Cacheline pinning====
>
> ??? Issue for discussion: should a page fault while pinning cachelines
> cause a trap to be taken?

This should not be an issue. Besides the fact that the MMU is
most likely disabled, the actual filling in of the data shouldn't
have to happen until an access occurs.

There are two use cases I have seen:

1. before DRAM has been set up
2. as scratch space for highly optimized FFT libraries

When setting up DRAM, the MMU won't yet have been set up.
Typically one might ask the cache to retain all writes to firmware,
making locations in flash or ROM seem writable. Stuff breaks if you
write more data than the size of the cache, and this is OK because
the code will not do that. After DRAM is running, all that data needs
to disappear. It obviously can't be written to ROM, and it isn't needed.

When running highly optimized FFT libraries, hardware-specific
knowledge is built into the libraries. Systems that are optimized
to this level tend to run without security distinctions, so mapping
the cache rwx at the same address in every task is likely acceptable.

> ??? Issue for discussion: what if another processor attempts to write
> to an address in a cacheline pinned on this processor?

If they share caches, they see the same data. If they don't share
caches, they see different data. This data is never flushed to RAM.

> CACHE.PIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
> Arrange for as much of the chosen region as possible to be accessible
> with minimal delay and no traffic to main memory. Pinning a region is
> idempotent and an implementation may pin a larger region than requested,
> provided that an unpin operation with the same base and bound will also
> unpin the larger region.
> One possible implementation is to load as much of the chosen region as
> possible into the data cache and keep it there until unpinned. Another
> implementation is to configure a scratchpad RAM and map it over at least
> the chosen region, preloading it with data from main memory.
> Scratchpads may be processor-local, but writes to a scratchpad mapped
> with CACHE.PIN must be visible to other nodes in a coherent system.

No, pinned cache should not be coherent.

> Implementations are expected to ensure that pinned cachelines will not
> impede the efficacy of a cache. Implementations with fully-associative
> caches may permit any number of pins, provided that at least one
> cacheline remains available for normal use. Implementations with N-way
> set associative caches may support pinning up to (N-1) ways within each
> set, provided that at least one way in each set remains available for
> normal use. Implementations with direct-mapped caches should not pin
> cachelines, but may still use CACHE.PIN to configure an overlay
> scratchpad, which may itself use storage shared with caches, such that
> mapping the scratchpad decreases the size of the cache.

These restrictions are not required. If the user ends up without
any normal cache, oh well... this is what they chose to do.

> CACHE.UNPIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010001}
> Explicitly release a pin set with CACHE.PIN. Pinned regions are also
> implicitly released if the memory protection and virtual address mapping
> is changed. (Specifically, a write to the current satp CSR or an
> SFENCE.VM will unpin all cachelines as a side effect, unless the
> implementation partitions its cache by ASID. Even with ASID-partitioned
> caches, changing the root page number associated with an ASID unpins all
> cachelines belonging to that ASID.) Unpinning a region does not
> immediately remove it from the cache. Unpinning a region always
> succeeds, even if parts of the region were not pinned. For an

I don't believe the MMU should be a concern. It should be ignored,
assuming it is even enabled at all.

> implementation that implements CACHE.PIN using scratchpad RAM, unpinning
> a region that uses a scratchpad causes the current contents of the
> scratchpad to be written to main memory.

Under no condition should the contents ever be written to main memory.
In the case of early hardware initialization, the writeback would be going
to flash memory or even a ROM. This could cause unintended changes.
In the case of an optimized FFT library, the writeback would likely have
no place to go at all -- there is nothing to back it at all.

> ===Declaring data obsolescence===
>
> These operations declare data to be obsolete and unimportant.

I hope that a hypervisor can secretly disable these.

> These operations are useful in general: a function prologue could use
> MEM.REWRITE to allocate a stack frame, while a function epilogue could
> use MEM.DISCARD to release a stack frame without requiring the
> now-obsolete local variables ever be written back to main memory.

There is something to be said for having this happen automatically.

John Hauser

unread,
Mar 9, 2018, 11:32:45 PM3/9/18
to RISC-V ISA Dev
Thanks, Jacob, for shepherding the discussion on this topic and
generating this draft.

I'd like to propose some renaming and reorganizing, although mostly
keeping the same spirit, I hope.

In this message, I'll leave the FENCE and CACHE.PIN/UNPIN instructions
untouched.  For the others, I first suggest a different set of names, as
follows:

    MEM.PFn     -> MEM.PREP.R     (reads)
    MEM.PF.EXCL -> MEM.PREP.RW    (reads/writes)
    MEM.PF.ONCE -> MEM.PREP.INCR  (reads at increasing addresses)
    MEM.PF.TEXT -> MEM.PREP.I     (instruction execution)
    MEM.REWRITE -> MEM.PREP.W (writes) or
                    MEM.PREP.INCW (writes at increasing addresses)

    CACHE.WRITEBACK -> CACHE.CLEAN
    CACHE.FLUSH     -> CACHE.FLUSH  (name unchanged)
    MEM.DISCARD     -> CACHE.INV

The "prep" instructions (MEM.PREP.*) are essentially hints for what
the software expects to do, giving the hardware the option to prepare
accordingly---"prep" being short for _prepare_.  Note that, when rd
isn't x0, these instructions aren't true RISC-V hints, because the
hardware at a minimum must still write rd, though it need not do
anything else.  I'm currently proposing collapsing the four MEM.PFn
instructions down to one, although I'm open to further discussion about
that.  I'm also proposing a major overhaul of MEM.REWRITE, reconceiving
it as two different MEM.PREP instructions that can optionally be used in
conjunction with CACHE.INV.

The explicit cache control instructions (CACHE.*) are fairly standard.
These cannot be trivialized in the same way as the first group, unless
the entire memory system is known to be coherent (including any device
DMA).

I've broken up Jacob's "data obsolescence" section that has MEM.REWRITE
and MEM.DISCARD together, with the consequence that some of the text
there would no longer be correct.

--------------------
Memory access hints

I'm proposing to collapse Jacob's four levels of MEM.PFn prefetch
instructions into a single MEM.PREP.R, because I don't see how software
will know which of the four levels to use in any given circumstance.  If
a consensus could be developed for heuristics to guide this decision for
programmers, tools, and hardware, then I could perhaps see the value of
having multiple levels.  In the absence of such guidance, I see at best
a chicken-and-egg problem between software and hardware, where nobody
agrees or understands exactly what the different levels should imply.
In my view, the Foundation shouldn't attempt to standardize multiple
prefetch levels unless it's prepared to better answer this question.

It's still always possible for individual implementations to have their
own custom instructions for different levels of prefetch, if they see a
value in having their own custom answer to the question.

I propose some minor tweaks to how the affected region is specified.  My
version is as follows:  Assume A and B are the unsigned integer values
of operands rs1 and rs2.  If A < B, then the region covers addresses A
through B - 1, inclusive.  If B <= A, the region is empty.  (However,
note that, because these MEM.PREP instructions act only as hints, an
implementation can adjust the bounds of the affected region to suit its
purposes without officially breaking any rules.)

Except for MEM.PREP.INCR and MEM.PREP.INCW, I'm proposing to remove
the option to have anything other than x0 for rd.  It's not clear to me
that we can realistically expect software to make use of the result that
Jacob defines, and removing it is a simplification.  If everyone feels
I'm mistaken and the loss of this feature is unacceptable, it could be
restored.

If implemented at all, the MEM.PREP instructions do not trap for any
reason, so they cannot cause page faults or access faults.  (This is no
different than Jacob already had.)

Here I've attempted to summarize the intention proposed for each
instruction:

  MEM.PREP.I

      Software expects to execute many instructions in the region.

      Hardware might respond by attempting to prefetch the code into the
      instruction cache.

  MEM.PREP.R

      Software expects to read many bytes in the region, but not to
      write many bytes in the region.

      Hardware might respond by attempting to acquire a shared copy of
      the data (prefetch).

  MEM.PREP.RW

      Software expects both to read many bytes and to write many bytes
      in the region.

      Hardware might respond by attempting to acquire a copy of the
      data along with (temporary) exclusive rights to write the data.
      For some implementations, the effect of MEM.PREP.RW might be no
      different than MEM.PREP.R.

  MEM.PREP.INCR

      Software expects to read many bytes in the region, starting first
      with lower addresses and progressing over time into increasingly
      higher addresses, though not necessarily in strictly sequential
      order.  If software writes to the region, it expects usually to
      read the same or nearby bytes before writing.

      Hardware might respond by applying a mechanism for sequential
      prefetch-ahead.  For some implementations, this mechanism might
      be ineffective unless the region is accessed solely by reads at
      sequential, contiguous locations.

  MEM.PREP.W

      Software expects to write many bytes in the region.  If software
      reads from the region, it expects usually to first write those
      same bytes before reading.

      Hardware might respond by adjusting whether a write to a
      previously missing cache line within the region will cause the
      remainder of the line to be brought in from main memory.

  MEM.PREP.INCW

      Software expects to write many bytes in the region, starting first
      with lower addresses and progressing over time into increasingly
      higher addresses, though not necessarily in strictly sequential
      order.  If software reads from the region, it expects usually to
      first write those same bytes before reading.

      Hardware might respond by applying a mechanism for sequential
      write-behind.  For some implementations, this mechanism might
      be ineffective unless the region is accessed solely by writes at
      sequential, contiguous locations.


For the non-INC instructions (MEM.PREP.I, MEM.PREP.R, MEM.PREP.RW, and
MEM.PREP.W), if the size of the region specified is comparable to or
larger than the entire cache size at some level of the memory hierarchy,
the implementation would probably do best to ignore the prep instruction
for that cache, though it might still profit from applying the hint to
larger caches at lower levels of the memory hierarchy.

In my proposal, MEM.PREP.INCR and MEM.PREP.INCW are unique in allowing
destination rd to be something other than x0.  If a MEM.PREP.INCR/INCW
has a non-x0 destination, the implementation writes register rd with an
address X subject to these rules, where A and B are the values of rs1
and rs2 defined earlier:  If B <= A (region is empty), then X = B.
Else, if A < B (region is not empty), the value X must be such that
A < X <= B.  If the value X written to rd is less than B (which can
happen only if the region wasn't empty), then software is encouraged to
execute MEM.PREP.INCR/INCW again after it believes it is done accessing
the subregion between addresses A and X - 1 inclusive.  For this new
MEM.PREP.INCR/INCW, the value of rs1 should be X and the value of rs2
should be B as before.  The process may then repeat with the hardware
supplying a new X.

Software is not required to participate in this iterative sequence of
MEM.PREP.INCR/INCW instructions, as it can always simply give rd as x0.
However, read-ahead or write-behind might not be as effective without
this iteration.

The minimal hardware implementation of the non-INC instructions
(MEM.PREP.I, MEM.PREP.R, MEM.PREP.RW, and MEM.PREP.W) would be simply to
ignore the instructions as no-ops.  For MEM.PREP.INCR and MEM.PREP.INCW,
the minimum is to copy rs2 (B) to rd (X) and do nothing else.  Since
the non-INC instructions require rd to be x0, these minimal behaviors
can be combined, so that the only thing done for all valid MEM.PREP.*
instructions is to copy rs2 to rd (which may be x0).

--------------------
Explicit cache control

The three cache control instructions I propose are:

    CACHE.CLEAN  (was CACHE.WRITEBACK)
    CACHE.FLUSH
    CACHE.INV    (was MEM.DISCARD)

I expect these will be familiar to anyone who has used similar
instructions on other processors, except possibly for the name "clean"
instead of "writeback" or "wb".  I'm not proposing changing the
fundamental semantics of the instructions, although I think the
description of CACHE.INV can be simplified a bit.

It's important to remember when talking about caches that writebacks
of dirty data from the cache are allowed to occur automatically at
any time, for any reason, or even for no apparent reason whatsoever.
Likewise, a cache line that isn't dirty can be automatically invalidated
at any time, with or without any apparent reason.  Therefore, our
description of CACHE.INV doesn't need to give the implementation
explicit license to flush whole cache lines to handle partial lines at
the start and end of the specified region, as it would already have the
authority to perform such flushes at will.

Here's how I might rewrite the description for CACHE.INV:

    CACHE.INV  [was MEM.DISCARD]


    Declare the contents of the region obsolete, dropping any copies
    present in the processor, without necessarily writing dirty data to
    main memory first.  The contents of the region are unspecified after

    the operation completes, but shall not include foreign data.

    Any pinned cache lines that are entirely within the region are
    automatically unpinned.

    Comment:
    Be aware that, because writebacks of dirty data in the cache can
    occur at any time, software has no guarantees that CACHE.INV will
    cause previous memory writes to be discarded.  Any combination

    of "old" and "new" data may appear in the region after executing
    CACHE.INV.

    Comment:
    If the region does not align with cache line boundaries and the
    cache is incapable of invalidating or flushing only a partial cache
    line, the implementation may need to flush the whole cache lines
    overlapping the start and end of the region, including bytes next to
    but outside the region.


The region to be cleaned/flushed/invalidated is specified the same as
I wrote above for the MEM.PREP instructions:  Assuming A and B are the
unsigned integer values of operands rs1 and rs2, then if A < B, the
region covers addresses A through B - 1, inclusive, and, conversely, if
B <= A, the region is empty.  The hardware does not guarantee to perform
the operation for the entire region at once, but instead returns a
progress in the destination rd.  The value written to rd is an address X
conforming to these rules:  If B <= A (region is empty), then X = B.
Else, if A < B (region is not empty), the value X must satisfy
A <= X <= B.  If the result value X = B, then the operation is complete
for the entire region.  If instead X < B (which can only happen if
the region wasn't empty), then software must execute the same CACHE
instruction again with rs1 set to X and with rs2 set to the same B as
before.

As long as no other memory instructions are executed between each CACHE
instruction, an implementation must guarantee that the original cache
operation will complete in a finite number of iterations of this
algorithm.  It is possible for an implementation to make progress on a
cache operation yet repeatedly return the original A as the result X,
until eventually returning B.

A CACHE instruction may cause a page fault for any page overlapping the
specified region.

When an implementation's entire memory system is known to be coherent,
including any device DMA, then the CACHE instructions may be implemented
in a minimal way simply by copying operand rs2 to destination rd.  On
the other hand, if the memory system is not entirely coherent, the CACHE
instructions cannot be implemented trivially, because their full effects
are needed for software to compensate for a lack of hardware-enforced
coherence.

--------------------
Concerning instructions for data obsolescence

Jacob defines two instructions for declarating data obsolescence:
MEM.DISCARD and MEM.REWRITE.  I've renamed MEM.DISCARD to CACHE.INV,
and otherwise tweaked its behavior in only minor ways.  Concerning
MEM.REWRITE, it's my belief that the proposed uses for this instruction
can each be satisfied by one of the following:

    MEM.PREP.W
    MEM.PREP.INCW
    CACHE.INV followed by MEM.PREP.W
    CACHE.INV followed by MEM.PREP.INCW

For instance, I believe Jacob's stack frame example could be rewritten
as:

    A function prologue could use CACHE.INV and MEM.PREP.W to allocate
    a stack frame, while a function epilogue could use CACHE.INV to

    release a stack frame without requiring the now-obsolete local
    variables ever be written back to main memory.

(Although it probably should be added that this wouldn't ordinarily be
advantageous to do for small stack frames.)

--------------------

That's it for now.  I invite feedback on any of the above.  I know
there's already been a lot of discussion on this topic, and I hope I
haven't overlooked any contrary conclusions from earlier.

Regards,

    - John Hauser

John Hauser

unread,
Mar 9, 2018, 11:59:47 PM3/9/18
to RISC-V ISA Dev
By the way, the Google Groups interface seems still to be a bit buggy,
so if you see some strange formatting in my previous message, that's
the reason why.  The mistakes weren't there when I pushed the button
to approve the message, but Google's machines then inserted them, the
clever bastards.  Not under my control, nor can I edit E-mail to fix it once it's
been sent.

    - John Hauser

John Hauser

unread,
Mar 10, 2018, 12:12:48 AM3/10/18
to RISC-V ISA Dev
Oh yeah, I see now, in the Google Groups archive I'm viewing, Google's
machines are automatically seeking out lines of text that match ones
from earlier messages, and then setting those lines off as quotations,
without realizing that that's having the effect of breaking up what I
wrote.  Hopefully, the E-mail text that was sent out doesn't include the
same effect.

I hope we're all preparing for the day soon when our lives are run
entirely by machines with the sense of a pea.

    - John Hauser

John Hauser

unread,
Mar 11, 2018, 4:17:27 PM3/11/18
to RISC-V ISA Dev
On further reflection, I'd like to amend my proposal a bit.

It occurs to me that there could be value to having a MEM.DISCARD
instruction separate from the CACHE.INV I defined.  I propose
MEM.DISCARD be a kind of "destructive" hint, taking a region specified
by rs1 and rs2 the same as my MEM.PREP hints.  The meaning would be:

  MEM.DISCARD  [new version]

      Software asserts that no bytes in the region will be read again
      by anyone (including other processors and devices) until first
      written.

      Hardware might respond by invalidating any cached instances of the
      relevant data.

The differences between CACHE.INV and MEM.DISCARD would be:

  - While CACHE.INV is _required_ to cause cache invalidations,
    MEM.DISCARD only _might_ do so.  An implementation is free to ignore
    MEM.DISCARD.

  - Like my various MEM.PREP.* hint instructions, MEM.DISCARD would not
    be permitted to cause any traps.

  - There would be no mechanism for iterating MEM.DISCARD.  Destination
    rd would be required to be x0, the same as most (though not all) of
    my MEM.PREP instructions.

To review, the complete set of instructions covered by my proposal would
now be:

    MEM.PREP.I
    MEM.PREP.R
    MEM.PREP.RW
    MEM.PREP.INCR
    MEM.PREP.W
    MEM.PREP.INCW

    MEM.DISCARD

    CACHE.CLEAN
    CACHE.FLUSH
    CACHE.INV

At this point, the earlier MEM.DISCARD and MEM.REWRITE have been
split into four instructions with more specific semantics:  CACHE.INV,
MEM.DISCARD, MEM.PREP.W, and MEM.PREP.INCW.

Returning to Jacob's stack frame example, it might be sensible for a
subroutine to use MEM.DISCARD to "free" its stack frame before exiting,
especially if the frame is large.  On entry to the same subroutine,
one could also use MEM.DISCARD followed by MEM.PREP.W.  However, in
practice, if subroutines were routinely using MEM.DISCARD at exit,
having MEM.DISCARD + MEM.PREP.W at entry probably wouldn't buy much.

As always, feedback is appreciated.

    - John Hauser

Rogier Brussee

unread,
Mar 12, 2018, 12:45:04 PM3/12/18
to RISC-V ISA Dev


Op zondag 11 maart 2018 21:17:27 UTC+1 schreef John Hauser:
On further reflection, I'd like to amend my proposal a bit.

It occurs to me that there could be value to having a MEM.DISCARD
instruction separate from the CACHE.INV I defined.  I propose
MEM.DISCARD be a kind of "destructive" hint, taking a region specified
by rs1 and rs2 the same as my MEM.PREP hints.  The meaning would be:

  MEM.DISCARD  [new version]

      Software asserts that no bytes in the region will be read again
      by anyone (including other processors and devices) until first
      written.

Is it necessary that nothing ever reads this region, or does a MEM.DISCARD
assert that until it starts rewriting the region, software _executing in the current hart_ :
no longer cares what the content of the memory region is, 
gives hardware the freedom to no longer keep cache coherent with the content of the memory region, and
gives hardware the freedom to drop any cache line for the memory region.

Thus, from the point of the hint on, the software treats the region in memory as undefined, and whatever is in 
cache for the region is considered useless garbage. Clearly that means that the region should be private to the 
hart (like a stack) and other harts or devices should not be reading or writing to the region (like a stack), but perhaps 
someone finds a use for having harts or devices reading and writing asynchronously to such a region and designs some 
protocol do deal with the resulting mess. 

In any case, what is the behaviour if the hart or other harts or devices read from the region anyway?

P.S. FWIW I do think your naming and split up do make things a lot clearer. I like the name discard,
but for symmetry perhaps there should be a mem.discard and cache.discard or a mem.inv and cache.inv.
Should the C.ADDI16SP imm instruction have an implicit  MEM.DISCARD min(sp, sp + imm << 4), max(sp, sp + imm<< 4) hint?  

Tommy Thorn

unread,
Mar 14, 2018, 11:49:33 AM3/14/18
to Rogier Brussee, RISC-V ISA Dev
I have no comments on the overall discussion at this point, but I must
comment on this:

> Should the C.ADDI16SP imm instruction have an implicit MEM.DISCARD min(sp, sp + imm << 4), max(sp, sp + imm<< 4) hint?

It is essential that we maintain the property of compressed instructions that they can be expanded directly into a single 32-bit instruction. This would break that.

Tommy

Christoph Hellwig

unread,
Mar 14, 2018, 1:29:40 PM3/14/18
to Jacob Bachmeyer, RISC-V ISA Dev
Hi Jacob,

thanks for doing this! I had started drafting text for a small subset
of your instructions. Comments are mostly related to those.

On Wed, Mar 07, 2018 at 10:19:39PM -0600, Jacob Bachmeyer wrote:
> Previous discussions suggested that explicit cache-control instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the user
> ISA.

Note that this can't always be the case. For at least two use cases
we will need cache control instructions that are not just hints:

a) persistent memory
b) cache-incoherent DMA

For persistent memory it is absolutely essential that we have a way
to force specific cache lines out to the persistence boundary. I've
started looking into porting the Linux kernel persistent memory support
and pmdk to RISC-V (so far in qemu emulation, but also looking into
rocket support), and the absolutely minimum required feature is
a cache line writeback instruction, which can not be treated as a hint.

For cache-incoherent DMA we also need a cache line invalidation
instruction that must not be a hint but guaranteed to work. I've
heard from a few folks that they'd like to mandate cache coherent
DMA for usual RISC-V systems. This sounds like a noble goal to me,
but just about every CPU architecture used in SOCs seems to grow
device with cache incoherent DMA rather sooner than later (due to
the fact that most SOCs are random pieces of barely debugged IP
glued together with shoe-string and paperclips). This even includes
x86 now.

> In general, this proposal uses "cacheline" to describe the hardware
> granularity for an operation that affects multiple words of memory or
> address space. Where these operations are implemented using traditional
> caches, the use of the term "cacheline" is entirely accurate, but this
> proposal does not prevent implementations from using other means to
> implement these instructions.

One of the thorny issues here is that we will have to have a way to
find out the cache line size for a given CPU.

> ====Flush====
>
> CACHE.WRITEBACK
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
> Writeback any cachelines in the requested region.
>
> CACHE.FLUSH
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
> Write any cachelines in the requested region (as by CACHE.WRITE), marking
> them invalid afterwards (as by MEM.DISCARD). Flushed cachelines are
> automatically unpinned.
>
> Rationale for including CACHE.FLUSH: small implementations may
> significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD; the
> implementations that most benefit lack the infrastructure to achieve such
> combination by macro-op fusion.

Yes, this is something both x86 and arm provide so it will help porting.

In terms of naming I'd rather avoid flush as a name as it is a very
overloaded term. I'd rather name the operations purely based on
'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and CACHE.WBINV

> These instructions create regions with undefined contents and share a
> requirement that foreign data never be introduced. Foreign data is,
> simply, data that was not previously visible to the current hart at the
> current privilege level at any address. Operating systems zero pages
> before attaching them to user address spaces to prevent foreign data from
> appearing in freshly-allocated pages. Implementations must ensure that
> these instructions do not cause foreign data to leak through caches or
> other structures.

This sounds extremely scary. Other architectures generally just define
instructions to invalidate the caches directly assesible to the hart
equivalent up to a given coherency domain (arm is particularly complicated
there, x86 just has one domain).

> MEM.DISCARD
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
> Declare the contents of the region obsolete, dropping any copies present
> in the processor without performing writes to main memory. The contents of
> the region are undefined after the operation completes, but shall not
> include foreign data.

What are copies present in the processor?

> If the region does not align with cacheline boundaries, any partial
> cachelines are written back. If hardware requires such, the full contents
> of a cacheline partially included may be written back, including data just
> declared obsolete. In a non-coherent system, partial cachelines written
> back are also invalidated. In a system with hardware cache coherency,
> partial cachelines must be written back, but may remain valid.

At least for data integrity operations like invalidate (or discard) and
writeback I would much, much prefer to require the operation to be
aligned to cache lines. Pretty much any partial behavior could be
doing the wrong thing for one case or another.

> MEM.REWRITE
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
> Declare the contents of the region obsolete, indicating that the current
> hart will soon overwrite the entire region. Reading any datum from the
> region before the current hart has written that datum (or other data fully
> overlapping that datum) is incorrect behavior and produces undefined
> results, but shall not return foreign data. Note that undefined results
> may include destruction of nearby data. For optimal performance, software
> should write the entire region before reading any part of the region and
> should do so sequentially, starting at the base address and moving towards
> the address produced by MEM.REWRITE.

This sounds like a really scary undefined behavior trap. What is the
use case for this instruction? Are there equivalents in other architectures?

Christoph Hellwig

unread,
Mar 14, 2018, 1:47:38 PM3/14/18
to Aaron Severance, jcb6...@gmail.com, RISC-V ISA Dev
On Thu, Mar 08, 2018 at 11:06:56PM +0000, Aaron Severance wrote:
> As a general point of confusion, the memory model spec states that future
> extensions may include cache management instructions but that they should
> be treated as hints, not functional requirements. For correctness it specs
> that a (possibly range-limited) fence must be used; the example they give
> is "fence rw[addr],w[addr]" for writeback.

Do you have a pointer to that part of the memory model? I couldn't
find anything that looks like it in the last draft on the memory-model
list.

> I take this to mean that with non-coherent caches/DMA on a fence with W in
> the predecessor set cached data needs to be written out to memomry and on a
> fence with R in the successor set the cache needs to be flushed. I'm not
> sure how useful the synchronous versions of WRITEBACK/FLUSH are then.

In general to me the RISC philosophy would imply keeping fence and
writeback instructions separate, although combining them would certainly
help with code density. Especially with the very weak ordered memory
model cache writebacks would almost always require some sort of fence
before the writeback. That being said we'd really want a ranged fence
to not entirely kill performance.

> The WRITEBACK and FLUSH instructions then seem mostly useful in their
> asynchronous form to initiate a writeback/flush early because a fence is
> needed to ensure correctness. As an example if working with a buffer
> shared with another non-coherent master then after you finish with a
> buffer: 1) do an asynchronous CACHE.FLUSH instruction on its addresses 2)
> do some other work 3) when the buffer is needed again by another hart or
> DMA device do a fence.

At least for the typical persistent memory use case, and the DMA API as
used in Linux we will need synchronous execution of the cache writeback
and invalidation as it is needed ASAP.

> Anyway, the points that I think should be clarified are:
> 1) Is if a fence is required for correctness when using the
> CACHE.WRITEBACK/CACHE.FLUSH operations?

The way I understood Jacob it is, but making this more clear would be
very useful.

> 2) Can CACHE.WRITEBACK, CACHE.FLUSH, MEM.DISCARD, and MEM.REWRITE be
> implemented as no-ops even on hardware with non-coherent caches?

It is not just non-coherent caches (which would be horrible) but also
not cache coherent DMA and persistent memory. In both case they must
not be no-ops if we want a working system.

Michael Chapman

unread,
Mar 14, 2018, 5:23:09 PM3/14/18
to Tommy Thorn, Rogier Brussee, RISC-V ISA Dev

On 14-Mar-18 22:49, Tommy Thorn wrote:
> ...
> It is essential that we maintain the property of compressed instructions that they can be expanded directly into a single 32-bit instruction.

Why?


Daniel Lustig

unread,
Mar 14, 2018, 6:21:55 PM3/14/18
to Christoph Hellwig, Aaron Severance, jcb6...@gmail.com, RISC-V ISA Dev
On 3/14/2018 10:47 AM, Christoph Hellwig wrote:
> On Thu, Mar 08, 2018 at 11:06:56PM +0000, Aaron Severance wrote:
>> As a general point of confusion, the memory model spec states that future
>> extensions may include cache management instructions but that they should
>> be treated as hints, not functional requirements. For correctness it specs
>> that a (possibly range-limited) fence must be used; the example they give
>> is "fence rw[addr],w[addr]" for writeback.
>
> Do you have a pointer to that part of the memory model? I couldn't
> find anything that looks like it in the last draft on the memory-model
> list.

Right now we're focused on getting RVWMO settled, so for now we just
say the following in the explanatory material appendix, as one of the
"Possible Future Extensions" that we expect should be made compatible
with the memory consistency model:

> Cache writeback/flush/invalidate/etc. hints, but these should be
> considered hints, not functional requirements. Any cache management
> operations which are required for basic correctness should be
> described as (possibly address range-limited) fences to comply with
> the RISC-V philosophy (see also fence.i and sfence.vma). For example,
> a functional cache writeback instruction might instead be written as
> “fence rw[addr],w[addr]”.

Really, the intention of that text (which we can clarify) is that
it should apply to normal memory: people shouldn't use cache
writeback/flush/invalidate instead of following the rules of the
normal memory consistency model. As a performance hint, sure, but
not as a functional replacement. But I/O, non-coherent DMA,
persistent memory, etc. are a different question, and may well want
all those things for actual functional correctness. The memory model
task group is mostly punting on trying to formalize all that until
after RVWMO for normal memory is settled.

The bit about using fences is not necessarily meant as a strict
correctness claim either. It's just an observation that RISC-V
already uses FENCE.I and SFENCE.VMA to describe what other
architectures might describe as "invalidate the instruction cache"
and "invalidate the TLB", respectively. So in that spirit, maybe
the instruction for "invalidate this cache (line)" could be
described as some kind of fence too. And then likewise for flush
writeback/etc.

Or, maybe in the end that doesn't work out, and separate flush/
writeback/invalidate instructions work better. I don't think
there would be anything inherently wrong with that approach either.
Or, people may simply insist that DMA is coherent, etc., and
sidestep the question. We in the memory model TG are not really
taking any stance at the moment.

Dan
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Rogier Brussee

unread,
Mar 14, 2018, 6:51:24 PM3/14/18
to RISC-V ISA Dev, rogier....@gmail.com


Op woensdag 14 maart 2018 16:49:33 UTC+1 schreef Tommy Thorn:
That is why it is a question. 

I would not even have posed the question, if the MEM.DISCARD would be more than just a hint that may or may not
be followed by the hardware. I must admit, however, that the hint gives the hardware a little extra leeway with respect to cache coherency
that _might_ have subtle interactions with memory model, even if that _should_ not make a difference if the instruction is used for its
intended purpose of manipulating the stack. 


Probably the question should have been framed as: 

implementing the MEM.DISCARD gives you (almost?) everything you need to implement an heuristic 
equivalent to MEM.DISCARD min(sp, sp + imm), max(sp, sp + imm) when manipulating the stack before exit i.e  on 

addi sp sp imm
j ra

is this allowed?
How does this interact with the MEM.DISCARD instruction?
If such an heuristic is allowed, is a MEM.DISCARD instruction still useful?


Rogier
  

Jacob Bachmeyer

unread,
Mar 14, 2018, 9:05:23 PM3/14/18
to Aaron Severance, RISC-V ISA Dev
Aaron Severance wrote:
> Thanks Jacob.
>
> As a general point of confusion, the memory model spec states that
> future extensions may include cache management instructions but that
> they should be treated as hints, not functional requirements. For
> correctness it specs that a (possibly range-limited) fence must be
> used; the example they give is "fence rw[addr],w[addr]" for writeback.

This proposal predates the new memory model. Unfortunately, the
original motivation for these instructions means that the memory model
is simply wrong on that point: in a system without hardware coherency,
cache management *must* have functional requirements. They can be
thought of as hint-like instructions, in that caches may immediately
continue their normal operations (for example, a just-prefetched
cacheline *can* be evicted if not pinned) but the synchronous forms
block execution (or create dependencies; an out-of-order processor is
not required to serialize on them, but must meet the implied fence)
until the operation is complete.

> I take this to mean that with non-coherent caches/DMA on a fence with
> W in the predecessor set cached data needs to be written out to
> memomry and on a fence with R in the successor set the cache needs to
> be flushed. I'm not sure how useful the synchronous versions of
> WRITEBACK/FLUSH are then.
>
> The WRITEBACK and FLUSH instructions then seem mostly useful in their
> asynchronous form to initiate a writeback/flush early because a fence
> is needed to ensure correctness. As an example if working with a
> buffer shared with another non-coherent master then after you finish
> with a buffer: 1) do an asynchronous CACHE.FLUSH instruction on its
> addresses 2) do some other work 3) when the buffer is needed again by
> another hart or DMA device do a fence.

Or (1) write data to buffer (2) perform synchronous CACHE.FLUSH on
buffer (3) initiate DMA disk write (or similar I/O) from buffer. The
equivalent read uses MEM.DISCARD: (1) perform synchronous MEM.DISCARD
on buffer (2) wait for DMA disk read (or similar I/O) to complete (3)
read from buffer.

> Anyway, the points that I think should be clarified are:
> 1) Is if a fence is required for correctness when using the
> CACHE.WRITEBACK/CACHE.FLUSH operations?

The REGION operations, if synchronous, imply all relevant fences. This
is a new requirement from the base specification and has been added to
draft 6.

> 2) Can CACHE.WRITEBACK, CACHE.FLUSH, MEM.DISCARD, and MEM.REWRITE be
> implemented as no-ops even on hardware with non-coherent caches? I
> assume the cache pinning and prefetching ops can. WRITEBACK/FLUSH
> depend on if fences are required for correctness. DISCARD/REWRITE are
> more problematic but I would think they can be as long as fences are
> required for correctness.

Prefetch is prefetch -- the program does not really care if it completes
or even if the prefetched address is valid. The cache pinning
operations must either actually have their defined effect or fail,
transferring the base address in rs1 to rd to indicate that nothing was
affected. WRITEBACK/FLUSH imply the appropriate fences, as does
DISCARD. REWRITE really is an (ISA-level) no-op -- it has no directly
visible effects, but permits a microarchitectural optimization that
*does* have very visible effects if REWRITE is used improperly and a
performance benefit (elide a useless memory load) if used properly.

> More notes inline.
> Aaron
>
> On Wed, Mar 7, 2018 at 8:19 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Previous discussions suggested that explicit cache-control
> instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the
> user ISA.
>
> I propose a new minor opcode REGION := 3'b001 within the existing
> MISC-MEM major opcode. Instructions in REGION are R-type and use
> rs1 to
> indicate a base address, rs2 to indicate an upper bound address, and
> produce a result in rd that is the first address after the highest
> address affected by the operation. If rd is x0, the instruction
> has no
> directly visible effects and can be executed entirely
> asynchronously as
> an implementation option.
>
> Is rs2's upper bound inclusive?

Yes -- rs1 and rs2 may be the same value (even the same register) to
affect a single hardware granule.

> Regarding having rd write back the first address after the highest
> address affected by the operation:
> This wording is a bit confusing; even if there is no data in the
> cache in the specified range those addresses are 'affected'. Not sure
> what better wording would be though...
> Is this always rs2 (or rs2+1) or can it be arbitrarily higher?
> I believe this is meant to allow partial completion, where the
> operation is restarted from the address returned by rd.

The purpose is to allow easy looping over a larger region than the
hardware can handle with these operations: use the address produced as
the base address for the next loop iteration.

The proposal states that a requested region may be expanded on both ends
to meet hardware granularity requirements. The result is the first
address after the expanded region. Was this unclear and can you suggest
better wording?

> Assuming partial completition is allowed:
> Is forward progress required? i.e. must rd be >= rs1 + 1?
No, an operation can fail.
> 0 must be a valid return value (if the region goes to the highest
> addressable value).
> I would suggest that rd must be >= the first address after the
> highest address affected by the operation.
Permitting a series of operations to "skip" addresses prevents the easy
loop mentioned above.
> Then an implementation that always fully completes could then
> return 0.
> No-op implementations could also always return 0.
A no-op implementation must return the base address, indicating that
nothing was done.
> Does this apply to FENCE.RD/FENCE.RI? It seems problematic to have
> FENCE.RD/FENCE.RI partially complete and return, and FENCE/FENCE.I
> must fully complete anyway.

FENCE.I does not use its registers, all of which are required to be x0.

The ranged fences will be required to fully complete in draft 6.
Returning the base address is equally simple in hardware (copy rs1 vs
copy x0) and allows software to know that nothing has actually
happened. Generally, conflating failure (nothing done) with success
(complete!) *really* rubs me the wrong way. Software can always either
ignore the return value or issue the instruction asynchronously if it
truly does not care.

>
>
> .
>
> The new MISC-MEM/REGION space will have room for 128 opcodes, one of
> which is the existing FENCE.I. I propose:
>
> [for draft 3, the function code assignments have changed to better
> group
> prefetch operations]
> [for draft 4, most of the mnemonics have been shortened and now
> indicate
> that these instructions affect the memory subsystem]
>
> ===Fences===
>
> [function 7'b0000000 is the existing FENCE.I instruction]
>
> [function 7'b0000001 reserved]
>
> FENCE.RD ("range data fence")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
> Perform a conservative fence affecting only data accesses to the
> chosen region. This instruction always has visible effects on memory
> consistency and is therefore synchronous in all cases.
>
>
> Does "only data accesses" mean it has the same effects as a "fence rw,
> rw"?

The ranged fences are conservative, which means that they are equivalent
to "FENCE rwio,rwio" for all addresses in the range. The instruction
fetch unit and load/store unit are permitted to have separate paths to
memory, so FENCE.RD affects the path from the load/store unit to main
memory, while FENCE.RI affects the path from the instruction fetch unit
to main memory.
Only atomic with respect to other accesses to the same region on the
same path to main memory as this hart. For the small implementations
that drove the inclusion of CACHE.FLUSH, there is probably only a single
hart.
Is "invalidate cache" never specifically required?

-- Jacob

Jacob Bachmeyer

unread,
Mar 14, 2018, 9:15:40 PM3/14/18
to Christoph Hellwig, RISC-V ISA Dev
The proposed instructions are not hints. This will be explicit in draft 6.

>> In general, this proposal uses "cacheline" to describe the hardware
>> granularity for an operation that affects multiple words of memory or
>> address space. Where these operations are implemented using traditional
>> caches, the use of the term "cacheline" is entirely accurate, but this
>> proposal does not prevent implementations from using other means to
>> implement these instructions.
>>
>
> One of the thorny issues here is that we will have to have a way to
> find out the cache line size for a given CPU.
>

The intent for REGION opcodes is that the instruction specifies an
extent of memory that is to be affected. The actual hardware
granularity is not relevant.

>> ====Flush====
>>
>> CACHE.WRITEBACK
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
>> Writeback any cachelines in the requested region.
>>
>> CACHE.FLUSH
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
>> Write any cachelines in the requested region (as by CACHE.WRITE), marking
>> them invalid afterwards (as by MEM.DISCARD). Flushed cachelines are
>> automatically unpinned.
>>
>> Rationale for including CACHE.FLUSH: small implementations may
>> significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD; the
>> implementations that most benefit lack the infrastructure to achieve such
>> combination by macro-op fusion.
>>
>
> Yes, this is something both x86 and arm provide so it will help porting.
>
> In terms of naming I'd rather avoid flush as a name as it is a very
> overloaded term. I'd rather name the operations purely based on
> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and CACHE.WBINV
>

Can you explain the plausible meanings of "flush" that could create
confusion? I had believed it to be a good synonym for
"writeback-and-invalidate".

>> These instructions create regions with undefined contents and share a
>> requirement that foreign data never be introduced. Foreign data is,
>> simply, data that was not previously visible to the current hart at the
>> current privilege level at any address. Operating systems zero pages
>> before attaching them to user address spaces to prevent foreign data from
>> appearing in freshly-allocated pages. Implementations must ensure that
>> these instructions do not cause foreign data to leak through caches or
>> other structures.
>>
>
> This sounds extremely scary. Other architectures generally just define
> instructions to invalidate the caches directly assesible to the hart
> equivalent up to a given coherency domain (arm is particularly complicated
> there, x86 just has one domain).
>

All operations in the proposal affect a path (and all nodes on that
path) between a hart and main memory. The MEM.DISCARD and MEM.REWRITE
instructions are ways of saying that some (not-yet-coherent) current
contents of a region no longer matter and need never be made coherent.
The scary prohibition on introducing foreign data is there to ensure
that this is safe. MEM.REWRITE particularly needs it to prevent
possible abuse.

>> MEM.DISCARD
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
>> Declare the contents of the region obsolete, dropping any copies present
>> in the processor without performing writes to main memory. The contents of
>> the region are undefined after the operation completes, but shall not
>> include foreign data.
>>
>
> What are copies present in the processor?
>

An old bit of wording (now) changed for draft 6: "copies present
between the hart's load/store unit and main memory" I was originally
thinking of a common modern PC architecture, with caches internal to the
processor module and memory on its own modules.

>> If the region does not align with cacheline boundaries, any partial
>> cachelines are written back. If hardware requires such, the full contents
>> of a cacheline partially included may be written back, including data just
>> declared obsolete. In a non-coherent system, partial cachelines written
>> back are also invalidated. In a system with hardware cache coherency,
>> partial cachelines must be written back, but may remain valid.
>>
>
> At least for data integrity operations like invalidate (or discard) and
> writeback I would much, much prefer to require the operation to be
> aligned to cache lines. Pretty much any partial behavior could be
> doing the wrong thing for one case or another.
>

The problem is that the entire point of region operations is to insulate
software from cache details. Partial cachelines are simply included in
non-destructive operations, but destructive operations require some
non-destructive substitute operation on partial cachelines. MEM.DISCARD
performs writeback, while MEM.REWRITE performs an exclusive prefetch.

>> MEM.REWRITE
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
>> Declare the contents of the region obsolete, indicating that the current
>> hart will soon overwrite the entire region. Reading any datum from the
>> region before the current hart has written that datum (or other data fully
>> overlapping that datum) is incorrect behavior and produces undefined
>> results, but shall not return foreign data. Note that undefined results
>> may include destruction of nearby data. For optimal performance, software
>> should write the entire region before reading any part of the region and
>> should do so sequentially, starting at the base address and moving towards
>> the address produced by MEM.REWRITE.
>>
>
> This sounds like a really scary undefined behavior trap.

The destruction of nearby data has been clarified to "nearby data within
the region" for draft 6. The idea is that MEM.REWRITE produces
(temporarily) an inconsistent state, where the cacheline is valid (so
that writes will hit) but was never loaded from memory (executing
MEM.REWRITE *did* declare that the contents of main memory at those
addresses did not matter) so contains garbage. Other words within the
cacheline (or within a wider cacheline used at an outer level but still
entirely within the region) may be clobbered as a result, if the region
is not entirely overwritten before the cacheline is written back.

The warnings of undefined behavior are supposed to be scary -- the
programmer is expected to take heed and promptly overwrite the entire
region MEM.REWRITE affects, as is MEM.REWRITE's purpose.

> What is the use case for this instruction?

Initializing or bulk copying data. One use case is to use MEM.REWRITE
when allocating and initializing a stack frame in a function prologue
(every word in that block will be written shortly) and MEM.DISCARD when
releasing a stack frame in a function epilogue (the locals are now dead,
why waste cycles writing them back?). Another use case (also the
inspiration) for MEM.REWRITE (for which the proposal gives some
heuristics) is memset(3) and memcpy(3). Both functions completely
overwrite a destination buffer without first reading anything from it.
Why waste cycles loading a destination from main memory just to write
the whole thing back?

> Are there equivalents in other architectures?
As far as I know, MEM.REWRITE, exactly, is new. The concept of a block
of memory that must be written before reading from it is nothing new,
however -- C local variables and buffers returned from malloc(3) have
always been like this as far as I know.

There may be similar instructions on PowerPC that explicitly zero
cachelines and similar, but those would have been mentioned in the
earlier discussions on this topic on isa-dev and I do not have any
message-ids close at hand.


-- Jacob

Jacob Bachmeyer

unread,
Mar 14, 2018, 9:23:51 PM3/14/18
to Daniel Lustig, Christoph Hellwig, Aaron Severance, RISC-V ISA Dev
Daniel Lustig wrote:
> Really, the intention of that text (which we can clarify) is that
> it should apply to normal memory: people shouldn't use cache
> writeback/flush/invalidate instead of following the rules of the
> normal memory consistency model. As a performance hint, sure, but
> not as a functional replacement. But I/O, non-coherent DMA,
> persistent memory, etc. are a different question, and may well want
> all those things for actual functional correctness. The memory model
> task group is mostly punting on trying to formalize all that until
> after RVWMO for normal memory is settled.
>

This proposal predates the new memory model, when FENCE/FENCE.I was
really all we had. I would be particularly interested in feedback on
how best to arrange this and how to advise programmers who may see the
cache-control operations and miss the fences that they should be using
on RISC-V because other architectures have similar cache-control
operations instead of fences.

Should the cache-control instructions themselves be defined in terms of
fences? I am not yet entirely certain how to describe them that way.

> The bit about using fences is not necessarily meant as a strict
> correctness claim either. It's just an observation that RISC-V
> already uses FENCE.I and SFENCE.VMA to describe what other
> architectures might describe as "invalidate the instruction cache"
> and "invalidate the TLB", respectively. So in that spirit, maybe
> the instruction for "invalidate this cache (line)" could be
> described as some kind of fence too. And then likewise for flush
> writeback/etc.
>

This is the intent behind defining the cache-control operations on
regions instead of directly on cachelines. All of the cache-control
operations should be expressible as ranged fences with particular
(sometimes peculiar) semantics.


-- Jacob

Aaron Severance

unread,
Mar 14, 2018, 9:53:13 PM3/14/18
to jcb6...@gmail.com, RISC-V ISA Dev
Clarifying that there are implied FENCEs in the region instruction certainly helps my understanding.

Sure it works with the synchronous version.

For asynchronous DISCARD/REWRITE I assume it would be undefined to touch the memory before it completed.  I assume a FENCE would be the normal way to know it had completed?
I think I was just being pedantic about what affected means; don't worry about it.
 

> Assuming partial completition is allowed:
>     Is forward progress required? i.e. must rd be >= rs1 + 1?
No, an operation can fail.

Can you give an example?  Would failing mean just that nothing was done this iteration so you should keep trying, or that the operation cannot complete?
 
>     0 must be a valid return value (if the region goes to the highest
> addressable value).
>     I would suggest that rd must be >= the first address after the
> highest address affected by the operation.
Permitting a series of operations to "skip" addresses prevents the easy
loop mentioned above.

Sorry, I did not mean to imply that a partially complete operation should return an address that is higher than the first address it has not completed on.

What I meant was that if a region can be arbitrarily expanded a fully complete operation should be able to expand the region to all of memory and return 0 (highest addressable address + 1).
 
>       Then an implementation that always fully completes could then
> return 0.
>       No-op implementations could also always return 0.
A no-op implementation must return the base address, indicating that
nothing was done.

As an example take a system with no data cache (or a disabled data cache through some other mechanism).  If it runs into a WB instruction, after performing the implied FENCE should it return the base address indicating nothing was done or an address greater than the high address?
 
>     Does this apply to FENCE.RD/FENCE.RI? It seems problematic to have
> FENCE.RD/FENCE.RI partially complete and return, and FENCE/FENCE.I
> must fully complete anyway.

FENCE.I does not use its registers, all of which are required to be x0.

The ranged fences will be required to fully complete in draft 6.

Great.
Doing nothing does not necessarily mean failure.  Sometimes it means having nothing to do (again the example of a system with the cache disabled).  Writing software I want to use the return value to check for completion.  I need to see either partial completion and re-run the loop or full completion and stop.  If I see failure but the operation was not needed now I need to decode why that happened and if it's safe to proceed.
A cache can always have written back all of its data for no user discernable reason flushed its data the cycle before you issue the DISCARD instruction.  So I don't see how using a FLUSH to do a DISCARD isn't valid.  The user can never guarantee that all the data they are trying to discard didn't get written back.
 
 
-- Jacob

Jacob Bachmeyer

unread,
Mar 14, 2018, 10:40:07 PM3/14/18
to Aaron Severance, RISC-V ISA Dev
Draft 6 will clarify that MEM.DISCARD/MEM.REWRITE can only be executed
synchronously. They are no-ops if rd is x0.
Generally, a complete failure indicates that the operation is not
implemented. Different regions of memory could have different sets of
supported operations, so a failure after a success suggests that you
have crossed from a region that can support that operation to one that
cannot. For CACHE.PIN, which allocates a finite resource, failure could
simply mean that there are no more cachelines available.

Looking at it another way, lack of forward progress *is* an error result.

> > 0 must be a valid return value (if the region goes to the
> highest
> > addressable value).
> > I would suggest that rd must be >= the first address after the
> > highest address affected by the operation.
> Permitting a series of operations to "skip" addresses prevents the
> easy
> loop mentioned above.
>
>
> Sorry, I did not mean to imply that a partially complete operation
> should return an address that is higher than the first address it has
> not completed on.
>
> What I meant was that if a region can be arbitrarily expanded a fully
> complete operation should be able to expand the region to all of
> memory and return 0 (highest addressable address + 1).

Only if the entire address space is effectively a single cacheline.
(This is a bad implementation, but you are correct that it is
technically permitted.) Note that such a return probably makes most of
these operations effectively useless.

Also, MEM.DISCARD and MEM.REWRITE can *not* be so expanded. Such an
implementation would be required to always fail those operations.

> > Then an implementation that always fully completes could then
> > return 0.
> > No-op implementations could also always return 0.
> A no-op implementation must return the base address, indicating that
> nothing was done.
>
>
> As an example take a system with no data cache (or a disabled data
> cache through some other mechanism). If it runs into a WB
> instruction, after performing the implied FENCE should it return the
> base address indicating nothing was done or an address greater than
> the high address?

If the system is capable of flushing caches, CACHE.WRITEBACK is not a
no-op and simply succeeds immediately if the cache is disabled. The of
the no-op implementation as returning ENOSYS.
If the cache is present but disabled, then a writeback succeeds (all of
the zero lines in the cache have been written back, after all).

> Writing software I want to use the return value to check for
> completion. I need to see either partial completion and re-run the
> loop or full completion and stop. If I see failure but the operation
> was not needed now I need to decode why that happened and if it's safe
> to proceed.

If you see partial completion, you also need to actually do part of
whatever you are working on before trying again. Partial completion
indicates that the hardware is expecting to handle your request piecemeal.

> [...]
While you are correct and MEM.DISCARD could be implemented as
CACHE.FLUSH (with some performance penalty due to useless writebacks),
such an implementation would use exactly the loophole that you
describe. (And which I have no intention of trying to close -- the
cache *could* have been flushed by a preemptive context-switch.) Any
copies present in caches at the time MEM.DISCARD is executed should be
dropped with no further action, however.

The important part (and the reason that non-coherent systems must
write-back-and-invalidate partial cachelines) is that MEM.DISCARD
ensures that any subsequent reads from the region will go all the way to
main memory. This is needed for DMA input in systems without hardware
coherency.

The reason that the contents of the region are undefined is that
different harts may actually see different contents. MEM.DISCARD and
MEM.REWRITE relax coherency for performance.


-- Jacob

Aaron Severance

unread,
Mar 15, 2018, 3:10:51 PM3/15/18
to jcb6...@gmail.com, RISC-V ISA Dev
Of course.  I'm not sure if you're also implying that DISCARD/FLUSH must always return rs2+1 on successful completion though.

One other question, if rs1 is 0 and rs2 is 0xFFFF_FFFF (for RV32) is there a way to signal failure vs full completion?
Yes.  I think you should change the MEM.DISCARD description from:
"Declare the contents of the region obsolete, dropping any copies
present in the processor without performing writes to main memory." to 
"Declare the contents of the region obsolete; the processor may drop any copies present in the processor without performing writes to main memory."
 

-- Jacob

Jacob Bachmeyer

unread,
Mar 15, 2018, 9:30:55 PM3/15/18
to Aaron Severance, RISC-V ISA Dev
Aaron Severance wrote:
> On Wed, Mar 14, 2018 at 7:40 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Aaron Severance wrote:
> > Clarifying that there are implied FENCEs in the region instruction
> > certainly helps my understanding.
> >
> > On Wed, Mar 14, 2018 at 6:05 PM Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
> > <mailto:jcb6...@gmail.com <mailto:jcb6...@gmail.com>>> wrote:
> >
> > Aaron Severance wrote:
> [...]
Partial completion is permitted. Some implementations may choose to
always process at most one cacheline, rather than iterating in hardware.

> One other question, if rs1 is 0 and rs2 is 0xFFFF_FFFF (for RV32) is
> there a way to signal failure vs full completion?

Unfortunately not. But this (affecting the entire address space) is
very much an edge case, and the workaround is to not do that.

> >[...]
I see no reason for this: either the now-obsolete data is dropped from
any intermediate caches or has already been written back. The important
effect (need for software-enforced coherency) is that no cachelines for
the region are valid after MEM.DISCARD completes.


-- Jacob

Albert Cahalan

unread,
Mar 16, 2018, 1:29:15 AM3/16/18
to jcb6...@gmail.com, Christoph Hellwig, RISC-V ISA Dev
On 3/14/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Christoph Hellwig wrote:
>> On Wed, Mar 07, 2018 at 10:19:39PM -0600, Jacob Bachmeyer wrote:

>> For cache-incoherent DMA we also need a cache line invalidation
>> instruction that must not be a hint but guaranteed to work.

This must be privileged because it may expose old data.
Combined writeback+invalidate doesn't have the problem.

>> In terms of naming I'd rather avoid flush as a name as it is a very
>> overloaded term. I'd rather name the operations purely based on
>> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and
>> CACHE.WBINV
>
> Can you explain the plausible meanings of "flush" that could create
> confusion? I had believed it to be a good synonym for
> "writeback-and-invalidate".

PowerPC terminology would do writeback w/o invalidate.

> The problem is that the entire point of region operations is to insulate
> software from cache details.

I appreciate the thought, but this may be more trouble than it's worth.

Consider instead having two sizes, 4096 bytes and the full address space.
The 4096-byte one would require alignment, at least for operations
that can be destructive. These operations tend to be done on whole pages
or on all of the address space, so just going with those values is fine.

>> Are there equivalents in other architectures?
> As far as I know, MEM.REWRITE, exactly, is new.

No, it was part of POWER ("cli") and PowerPC ("dcbi").
Originally, those were privileged instructions that would
invalidate a cache line ("cli") or cache block ("dcbi") just
as you describe.

Later, the instructions were made to act the same as the
unprivileged ones ("dclz" and "dcbz") that zeroed.

Even later, the instructions were removed from the architecture.
Perhaps this ought to be taken as a hint regarding the desirability
of supporting such instructions.

Christoph Hellwig

unread,
Mar 16, 2018, 6:31:50 AM3/16/18
to Daniel Lustig, Christoph Hellwig, Aaron Severance, jcb6...@gmail.com, RISC-V ISA Dev
On Wed, Mar 14, 2018 at 03:21:52PM -0700, Daniel Lustig wrote:
> Really, the intention of that text (which we can clarify) is that
> it should apply to normal memory: people shouldn't use cache
> writeback/flush/invalidate instead of following the rules of the
> normal memory consistency model. As a performance hint, sure, but
> not as a functional replacement. But I/O, non-coherent DMA,
> persistent memory, etc. are a different question, and may well want
> all those things for actual functional correctness. The memory model
> task group is mostly punting on trying to formalize all that until
> after RVWMO for normal memory is settled.

The big issue is that a cache controller often can't really see
the difference. We could in theory force it through PMAs, but
that might turn very complicated really soon.

> The bit about using fences is not necessarily meant as a strict
> correctness claim either. It's just an observation that RISC-V
> already uses FENCE.I and SFENCE.VMA to describe what other
> architectures might describe as "invalidate the instruction cache"
> and "invalidate the TLB", respectively. So in that spirit, maybe
> the instruction for "invalidate this cache (line)" could be
> described as some kind of fence too. And then likewise for flush
> writeback/etc.

As long as it just is about naming I don't care to much. But both
for persistent memory and cache incoherent dma we do of course
require some amount of fencing as well, as we need to guarantee
any effects before the invalidation or writeback are actually
covered.

Christoph Hellwig

unread,
Mar 16, 2018, 6:46:18 AM3/16/18
to Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev
On Wed, Mar 14, 2018 at 08:15:37PM -0500, Jacob Bachmeyer wrote:
> The intent for REGION opcodes is that the instruction specifies an extent
> of memory that is to be affected. The actual hardware granularity is not
> relevant.

At least for cache invalidation it absolutely is relevant, as it changes
the data visible at a given address.

E.g. I want to invalidate address 4096 to 4159 because I am going to do
a cache incoherent dma operation, but it turns out the implementation
has a cache line size of 128 (or to take the extreme example allowed by
your defintion infinite) it will invalidate all kinds of data that the
driver might have written close to it, and we get data corruption.

That is why the supervisor needs to:

a) know the cache line size so that it can align dma-able structures
based on it
b) any destructive operation must operate on said granularity.

And as Albert already mentioned, descructive operations exposed to U-mode
are a non-started unless we can come up with very specific carefully
drafted circumstances that are probably too complex to implement.

>> Yes, this is something both x86 and arm provide so it will help porting.
>>
>> In terms of naming I'd rather avoid flush as a name as it is a very
>> overloaded term. I'd rather name the operations purely based on
>> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and CACHE.WBINV
>>
>
> Can you explain the plausible meanings of "flush" that could create
> confusion? I had believed it to be a good synonym for
> "writeback-and-invalidate".

In a lot of the world people use it just for writeback. concrete examples
are the ATA and NVMe storage protocols, and large parts of the Linux
kernel.

>>> MEM.DISCARD
>>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
>>> Declare the contents of the region obsolete, dropping any copies present
>>> in the processor without performing writes to main memory. The contents
>>> of the region are undefined after the operation completes, but shall not
>>> include foreign data.
>>>
>>
>> What are copies present in the processor?
>>
>
> An old bit of wording (now) changed for draft 6: "copies present between
> the hart's load/store unit and main memory" I was originally thinking of a
> common modern PC architecture, with caches internal to the processor module
> and memory on its own modules.

Much better. But it is important that we do not restrict the wording
to main memory. A lot of these cache operations are especially imporant
for I/O devices, or a of yet not categorized types like persistent
(or persistent-ish) memory.

> The problem is that the entire point of region operations is to insulate
> software from cache details. Partial cachelines are simply included in
> non-destructive operations, but destructive operations require some
> non-destructive substitute operation on partial cachelines. MEM.DISCARD
> performs writeback, while MEM.REWRITE performs an exclusive prefetch.

Which is an excellent way to make performance unusable. If each of
my invalidation for DMA requires two writebacks at either end it is
going to perform horribly. While at the same time the supervisor could
almost trivially align the data structures properly if it knows the
cache line size.

>> What is the use case for this instruction?
>
> Initializing or bulk copying data. One use case is to use MEM.REWRITE when
> allocating and initializing a stack frame in a function prologue (every
> word in that block will be written shortly) and MEM.DISCARD when releasing
> a stack frame in a function epilogue (the locals are now dead, why waste
> cycles writing them back?). Another use case (also the inspiration) for
> MEM.REWRITE (for which the proposal gives some heuristics) is memset(3) and
> memcpy(3). Both functions completely overwrite a destination buffer
> without first reading anything from it. Why waste cycles loading a
> destination from main memory just to write the whole thing back?

I'd really like to see a prototype of thise and very careful measurement
if it is actually worth it. Adding new instructions just because they
might sounds useful is a guarantee to arrive at a bloated spec. Especially
as RISC-V seems to bundle instructions in extensions instead of allowing
to probe for individual instruction as in the x86 cpuid leaves.

>> Are there equivalents in other architectures?
> As far as I know, MEM.REWRITE, exactly, is new. The concept of a block of
> memory that must be written before reading from it is nothing new, however
> -- C local variables and buffers returned from malloc(3) have always been
> like this as far as I know.
>
> There may be similar instructions on PowerPC that explicitly zero
> cachelines and similar, but those would have been mentioned in the earlier
> discussions on this topic on isa-dev and I do not have any message-ids
> close at hand.

Explicit zeroing sounds like a much easier to understand, use and optimize
for concept to me.

Allen Baum

unread,
Mar 16, 2018, 11:59:42 AM3/16/18
to Michael Chapman, Tommy Thorn, Rogier Brussee, RISC-V ISA Dev
It compressed instruction expands to more than a single 32 instruction,  then it is no longer Risc-V, it is CISC-V

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

John Hauser

unread,
Mar 16, 2018, 5:01:41 PM3/16/18
to RISC-V ISA Dev
I think there's an underlying question in this debate, and that is
whether every conforming RISC-V system will be required to have a
hardware-implemented coherent memory system, including for all device
DMA.

It's clear that some people want the answer to this question to
be "yes".  However, it's also true that many low-end systems have
traditionally not had memory systems of such complexity, presumably for
valid economic reasons.  If the hardware has caches and also supports
device DMA but doesn't automatically guarantee cache coherence for DMA'd
data, then some set of active cache control instructions are _required_
in the ISA, and not just what RISC-V calls "hints".

It's certainly within the rights of the Foundation, if it so chooses,
to make complete cache coherence an official requirement for conforming
RISC-V systems.  However, I'm not sure the Foundation's rectitude alone
will be sufficient to change the economics of small systems and convince
vendors en masse to accept the costs of complete memory coherence.  And
if the market doesn't bend, then will many low-end systems be denied the
official RISC-V mark for this heresy?

Efforts to develop standard cache control instructions are predicated
on the assumptions that they'll be both needed for some systems and
officially accepted for RISC-V.  Anyone who wants to argue against the
need ought to explain why they're certain low-end systems can absorb the
costs implied by their preferred memory model.

Regards,

    - John Hauser

Jacob Bachmeyer

unread,
Mar 16, 2018, 10:10:18 PM3/16/18
to Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev
Albert Cahalan wrote:
> On 3/14/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Christoph Hellwig wrote:
>>
>>> For cache-incoherent DMA we also need a cache line invalidation
>>> instruction that must not be a hint but guaranteed to work.
>>>
>
> This must be privileged because it may expose old data.
> Combined writeback+invalidate doesn't have the problem.
>

This is why the proposal includes language prohibiting the exposure of
foreign data. A process may see its own old data, but cache
invalidation must not expose data from another process.

>>> In terms of naming I'd rather avoid flush as a name as it is a very
>>> overloaded term. I'd rather name the operations purely based on
>>> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and
>>> CACHE.WBINV
>>>
>> Can you explain the plausible meanings of "flush" that could create
>> confusion? I had believed it to be a good synonym for
>> "writeback-and-invalidate".
>>
>
> PowerPC terminology would do writeback w/o invalidate.
>

Is the presence of a CACHE.WRITEBACK instruction sufficient to resolve
this potential ambiguity?

>>> Are there equivalents in other architectures?
>>>
>> As far as I know, MEM.REWRITE, exactly, is new.
>>
>
> No, it was part of POWER ("cli") and PowerPC ("dcbi").
> Originally, those were privileged instructions that would
> invalidate a cache line ("cli") or cache block ("dcbi") just
> as you describe.
>
> Later, the instructions were made to act the same as the
> unprivileged ones ("dclz" and "dcbz") that zeroed.
>
> Even later, the instructions were removed from the architecture.
> Perhaps this ought to be taken as a hint regarding the desirability
> of supporting such instructions.

MEM.REWRITE does *not* invalidate cachelines -- it allocates cachelines
with undefined initial contents. MEM.DISCARD invalidates cachelines.



-- Jacob

Jacob Bachmeyer

unread,
Mar 16, 2018, 10:32:34 PM3/16/18
to Christoph Hellwig, RISC-V ISA Dev
Christoph Hellwig wrote:
> On Wed, Mar 14, 2018 at 08:15:37PM -0500, Jacob Bachmeyer wrote:
>
>> The intent for REGION opcodes is that the instruction specifies an extent
>> of memory that is to be affected. The actual hardware granularity is not
>> relevant.
>>
>
> At least for cache invalidation it absolutely is relevant, as it changes
> the data visible at a given address.
>
> E.g. I want to invalidate address 4096 to 4159 because I am going to do
> a cache incoherent dma operation, but it turns out the implementation
> has a cache line size of 128 (or to take the extreme example allowed by
> your defintion infinite) it will invalidate all kinds of data that the
> driver might have written close to it, and we get data corruption.
>

No, destructive operations are required to instead perform their
non-destructive counterparts on any partially-included cachelines. On a
cacheline partially included in the region, MEM.DISCARD performs
CACHE.FLUSH (writing the entire cacheline back before invalidating it)
and MEM.REWRITE performs MEM.PF.EXCL (actually loading the data from
memory).

> That is why the supervisor needs to:
>
> a) know the cache line size so that it can align dma-able structures
> based on it
>

That is beyond the scope of this proposal and is expected to be included
in the platform configuration structures.

> b) any destructive operation must operate on said granularity.
>

The proposal explicitly states that destructive operations perform
non-destructive equivalents on cachelines that are only partially
included in the region. Is this unclear?

> And as Albert already mentioned, descructive operations exposed to U-mode
> are a non-started unless we can come up with very specific carefully
> drafted circumstances that are probably too complex to implement.
>

MEM.DISCARD drops pending writes that have not yet been committed to
main memory. Presumably, the supervisor zeroed the page (and forced
writeback) before attaching it to the user process.

MEM.REWRITE allocates cachelines without actually reading memory. If
the implementation can prove that the cachelines thus made valid contain
data belonging to the current process, then they may retain their
contents, otherwise, the hardware must fill those cachelines with some
constant, presumably zero.

Neither of these is dangerous to expose to U-mode.

>>> Yes, this is something both x86 and arm provide so it will help porting.
>>>
>>> In terms of naming I'd rather avoid flush as a name as it is a very
>>> overloaded term. I'd rather name the operations purely based on
>>> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and CACHE.WBINV
>>>
>>>
>> Can you explain the plausible meanings of "flush" that could create
>> confusion? I had believed it to be a good synonym for
>> "writeback-and-invalidate".
>>
>
> In a lot of the world people use it just for writeback. concrete examples
> are the ATA and NVMe storage protocols, and large parts of the Linux
> kernel.
>

Does the presence of a (different) CACHE.WRITEBACK instruction resolve
the ambiguity?

>>>> MEM.DISCARD
>>>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
>>>> Declare the contents of the region obsolete, dropping any copies present
>>>> in the processor without performing writes to main memory. The contents
>>>> of the region are undefined after the operation completes, but shall not
>>>> include foreign data.
>>>>
>>>>
>>> What are copies present in the processor?
>>>
>>>
>> An old bit of wording (now) changed for draft 6: "copies present between
>> the hart's load/store unit and main memory" I was originally thinking of a
>> common modern PC architecture, with caches internal to the processor module
>> and memory on its own modules.
>>
>
> Much better. But it is important that we do not restrict the wording
> to main memory. A lot of these cache operations are especially imporant
> for I/O devices, or a of yet not categorized types like persistent
> (or persistent-ish) memory.
>

Draft 6 will clarify that "main memory" refers to any ultimate memory
bus target, including MMIO or other hardware.

>> The problem is that the entire point of region operations is to insulate
>> software from cache details. Partial cachelines are simply included in
>> non-destructive operations, but destructive operations require some
>> non-destructive substitute operation on partial cachelines. MEM.DISCARD
>> performs writeback, while MEM.REWRITE performs an exclusive prefetch.
>>
>
> Which is an excellent way to make performance unusable. If each of
> my invalidation for DMA requires two writebacks at either end it is
> going to perform horribly. While at the same time the supervisor could
> almost trivially align the data structures properly if it knows the
> cache line size.
>

The cacheline size should be in the processor ID ROM, which means it
will be in the platform configuration given to the supervisor. The
instructions will work correctly in all cases, but will give optimal
performance if the regions are aligned.

>>> What is the use case for this instruction?
>>>
>> Initializing or bulk copying data. One use case is to use MEM.REWRITE when
>> allocating and initializing a stack frame in a function prologue (every
>> word in that block will be written shortly) and MEM.DISCARD when releasing
>> a stack frame in a function epilogue (the locals are now dead, why waste
>> cycles writing them back?). Another use case (also the inspiration) for
>> MEM.REWRITE (for which the proposal gives some heuristics) is memset(3) and
>> memcpy(3). Both functions completely overwrite a destination buffer
>> without first reading anything from it. Why waste cycles loading a
>> destination from main memory just to write the whole thing back?
>>
>
> I'd really like to see a prototype of thise and very careful measurement
> if it is actually worth it. Adding new instructions just because they
> might sounds useful is a guarantee to arrive at a bloated spec. Especially
> as RISC-V seems to bundle instructions in extensions instead of allowing
> to probe for individual instruction as in the x86 cpuid leaves.
>

MEM.REWRITE is also present because it is a dual to MEM.DISCARD in
fully-coherent systems. MEM.DISCARD is a performance optimization for
DMA input, while MEM.REWRITE is the corresponding optimization for DMA
output.

>>> Are there equivalents in other architectures?
>>>
>> As far as I know, MEM.REWRITE, exactly, is new. The concept of a block of
>> memory that must be written before reading from it is nothing new, however
>> -- C local variables and buffers returned from malloc(3) have always been
>> like this as far as I know.
>>
>> There may be similar instructions on PowerPC that explicitly zero
>> cachelines and similar, but those would have been mentioned in the earlier
>> discussions on this topic on isa-dev and I do not have any message-ids
>> close at hand.
>>
>
> Explicit zeroing sounds like a much easier to understand, use and optimize
> for concept to me.

Explicit zeroing is also wasteful if the program needs to write anything
other than zero. Currently, MEM.REWRITE has minimal effect --
cachelines already valid are not touched except to move them into
"exclusive" state if needed, so those cachelines would retain whatever
data was previously loaded.


-- Jacob

Albert Cahalan

unread,
Mar 17, 2018, 1:48:42 AM3/17/18
to jcb6...@gmail.com, Christoph Hellwig, RISC-V ISA Dev
On 3/16/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Christoph Hellwig wrote:

>> And as Albert already mentioned, descructive operations exposed to U-mode
>> are a non-started unless we can come up with very specific carefully
>> drafted circumstances that are probably too complex to implement.
>
> MEM.DISCARD drops pending writes that have not yet been committed to
> main memory. Presumably, the supervisor zeroed the page (and forced
> writeback) before attaching it to the user process.
>
> MEM.REWRITE allocates cachelines without actually reading memory. If
> the implementation can prove that the cachelines thus made valid contain
> data belonging to the current process, then they may retain their
> contents, otherwise, the hardware must fill those cachelines with some
> constant, presumably zero.
>
> Neither of these is dangerous to expose to U-mode.

They are dangerous: "Presumably, the supervisor zeroed the page (and
forced writeback) before attaching it to the user process."

That might not hold. I otherwise like the instructions. I think each privilege
level needs a way to control the access that lower levels have to the feature.
This should be opt-in for the software, so that a hazardous condition is not
created by a firmware/hypervisor/OS/sandbox developer who fails to notice
the potential problems.

I'd like to propose a few more cache instructions:

MEM.AGEALL causes the cache lines to be prime candidates for eviction.

MEM.AGEDIRTY does that only for dirty cache lines.

MEM.AGECLEAN does that only for clean cache lines.

MEM.CLEANOUT invalidates only clean cache lines.

These may be a bit less hazardous.

For all cache-related instructions, a bit of thought about side-channel attacks
would be prudent.

Albert Cahalan

unread,
Mar 17, 2018, 1:57:01 AM3/17/18
to Allen Baum, Michael Chapman, Tommy Thorn, Rogier Brussee, RISC-V ISA Dev
On 3/16/18, Allen Baum <allen...@esperantotech.com> wrote:

> It compressed instruction expands to more than a single 32 instruction,
> then it is no longer Risc-V, it is CISC-V

No, because internally the instructions can be an implementation-specific
size that is larger than 32 bits. It's still RISC with a 37-bit instruction.

This is not to say that a bit of CISC here and there wouldn't have some
good points. ARM and PowerPC both have multi-register load/store.

Jose Renau

unread,
Mar 17, 2018, 1:59:29 AM3/17/18
to Albert Cahalan, Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev
They are a source if side channel attack.

Strong vote to not have any cache management in user mode.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Mar 17, 2018, 5:43:28 PM3/17/18
to Albert Cahalan, Allen Baum, Michael Chapman, Tommy Thorn, Rogier Brussee, RISC-V ISA Dev
Multi-register load/store is convenient (and I have previously proposed
a special case in the form of fast context save/restore instructions),
but requires either hardware sequencing or more ports on the register
file, both of which RISC-V is very keen to avoid mandating.


-- Jacob

Jacob Bachmeyer

unread,
Mar 17, 2018, 5:56:11 PM3/17/18
to Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev
Albert Cahalan wrote:
> On 3/16/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Christoph Hellwig wrote:
>>
>>> And as Albert already mentioned, descructive operations exposed to U-mode
>>> are a non-started unless we can come up with very specific carefully
>>> drafted circumstances that are probably too complex to implement.
>>>
>> MEM.DISCARD drops pending writes that have not yet been committed to
>> main memory. Presumably, the supervisor zeroed the page (and forced
>> writeback) before attaching it to the user process.
>>
>> MEM.REWRITE allocates cachelines without actually reading memory. If
>> the implementation can prove that the cachelines thus made valid contain
>> data belonging to the current process, then they may retain their
>> contents, otherwise, the hardware must fill those cachelines with some
>> constant, presumably zero.
>>
>> Neither of these is dangerous to expose to U-mode.
>>
>
> They are dangerous: "Presumably, the supervisor zeroed the page (and
> forced writeback) before attaching it to the user process."
>
> That might not hold.

Current supervisors must do this anyway to avoid the foreign data
problem. If the supervisor attaches pages to a user process without
ensuring that those pages have known contents that should be available
to the receiving process, then the supervisor does not actually
implement a security boundary between processes and the issue of foreign
data is moot. In this case, the data is not foreign at all: the
process has attached a shared memory segment.

> I otherwise like the instructions. I think each privilege
> level needs a way to control the access that lower levels have to the feature.
> This should be opt-in for the software, so that a hazardous condition is not
> created by a firmware/hypervisor/OS/sandbox developer who fails to notice
> the potential problems.
>
> I'd like to propose a few more cache instructions:
>
> MEM.AGEALL causes the cache lines to be prime candidates for eviction.
>
> MEM.AGEDIRTY does that only for dirty cache lines.
>
> MEM.AGECLEAN does that only for clean cache lines.
>
> MEM.CLEANOUT invalidates only clean cache lines.
>

The mnemonic prefix would be CACHE rather than MEM, since these
explicitly affect caches.

> These may be a bit less hazardous.
>
> For all cache-related instructions, a bit of thought about side-channel attacks
> would be prudent.

The general solution assumed for side-channel prevention is that caches
will be partitioned by ASID and/or aggressively flushed at context
switch. One strategy is to double the physical size of the cache, using
one half for the "current" ASID and one half for the "previous" ASID.
(Supervisor caches are a separate structure.) Aggressively writeback
any dirty lines in the "previous" ASID and then invalidate the whole
thing if the next ASID switch introduces another ASID. This generalizes
to any number of ASID-bound caches and not only gives each process an
"independent" cache, but (if available space allows, and 3D IC
fabrication will allow this if it ever actually works) expands the
overall usable size of a VIPT cache by giving each ASID its own cache,
both improving performance and closing side-channels.


-- Jacob

Albert Cahalan

unread,
Mar 18, 2018, 3:59:21 AM3/18/18
to jcb6...@gmail.com, Christoph Hellwig, RISC-V ISA Dev
On 3/17/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Albert Cahalan wrote:

>> They are dangerous: "Presumably, the supervisor zeroed the page (and
>> forced writeback) before attaching it to the user process."
>>
>> That might not hold.
>
> Current supervisors must do this anyway to avoid the foreign data

They would zero the page. I've never heard of a supervisor that would
force writeback; this is not required for any typical CPU.

Guy Lemieux

unread,
Mar 18, 2018, 7:16:15 AM3/18/18
to Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev, jcb6...@gmail.com
systems without coherent dma should force a write back after zeroing, or else there is a security hole as other process data can leak through via dma.

guy

Albert Cahalan

unread,
Mar 18, 2018, 6:29:20 PM3/18/18
to Guy Lemieux, Christoph Hellwig, RISC-V ISA Dev, jcb6...@gmail.com
I'm not seeing it.

Process X has the page. DMA occurs, creating incoherency. Suppose
there are a mix of out-of-cache parts, cached clean parts that match,
cached clean parts that have the incoherency issue, and cached dirty
parts. There are even DMA transfers active, both read and write.

Process X dies. The OS has to ensure that ongoing DMA comes to a
stop before the page can be repurposed. Once that happens, the OS
zeroes the page. Nothing is explicitly done about the cache.

All parts of the page are now out-of-cache, cached dirty with zeroes
in the cache, or cached clean with zeroes in the cache. There is no
place where a clean cache line fails to match RAM content.

Process Y is assigned the page. Since operations to invalidate without
writeback are prohibited, there is no way to simply discard dirty parts of
the cache to see the underlying RAM. All the rest is zero.

Process Y attempts to do IO involving DMA. This must be requested from
the OS. The OS will generally force writeback before starting the DMA,
then invalidate the cache after the DMA has completed. There are a few
situations where one of those cache operations may be skipped:

For an outgoing DMA, there is no need to invalidate the cache.

For an incoming DMA, there are certain cases where the forced writeback
can be skipped. One is if a writeback had been forced right after the OS had
zeroed the page, but this would be a low-performance choice because most
pages never get used for incoming DMA. (so this case doesn't apply) Another
case would be when an exact multiple of the cache line/block size will be
written by the DMA and the OS is able to stop all threads from viewing the
data until after the DMA has completed. The OS might do this by unmapping
the page or by preventing the threads from being scheduled.

I thus do not think the OS has any reason to force writeback of a freshly
zeroed page before giving the page to a process.

Guy Lemieux

unread,
Mar 18, 2018, 6:37:41 PM3/18/18
to Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev, Jacob Bachmeyer
On Sun, Mar 18, 2018 at 3:29 PM, Albert Cahalan <acah...@gmail.com> wrote:
> I'm not seeing it.

you have to think like an attacker, not a well-behaved process.

a newly allocated page for attacker process X is written zeros (for
security) by the OS. however, due to write back caches, most of these
0s stay in the cache.

if the OS doesn't force a writeback, to ensure memory is really 0,
then attacker process X can start an outbound non coherent DMA to
transfer the page (old/stale contents) from DRAM to an external
device.

Guy

Albert Cahalan

unread,
Mar 18, 2018, 7:01:09 PM3/18/18
to Guy Lemieux, Christoph Hellwig, RISC-V ISA Dev, Jacob Bachmeyer
On 3/18/18, Guy Lemieux <glem...@vectorblox.com> wrote:
> On Sun, Mar 18, 2018 at 3:29 PM, Albert Cahalan <acah...@gmail.com> wrote:
>> I'm not seeing it.
>
> you have to think like an attacker, not a well-behaved process.
>
> a newly allocated page for attacker process X is written zeros (for
> security) by the OS. however, due to write back caches, most of these
> 0s stay in the cache.
>
> if the OS doesn't force a writeback, to ensure memory is really 0,
> then attacker process X can start an outbound non coherent DMA to
> transfer the page (old/stale contents) from DRAM to an external
> device.

No they can't.

Or, rather, if the DMA controller is controllable by the user process
then the system has no security at all. Process X doesn't need to
mess around with cache coherency. It can go straight for the kernel.
Just DMA the kernel out, or DMA something in over top of the kernel.

If the DMA controller is properly secured, then the kernel will do the
required cache operations before and after the DMA occurs.

Cesar Eduardo Barros

unread,
Mar 18, 2018, 10:31:30 PM3/18/18
to Albert Cahalan, Guy Lemieux, Christoph Hellwig, RISC-V ISA Dev, Jacob Bachmeyer
If the system has an IOMMU, a user process can be allowed to directly
request a bus master to initiate DMA without being able to "go straight
for the kernel". Consider for instance the scenario where a PCI device
(for instance, a graphics card) is "exported" to a virtual machine; this
is used nowadays to do things like running Windows on a VM with fully
accelerated graphics (for gaming). Clearly there's something preventing
the OS within the VM from simply asking the PCI card to DMA outside the
VM's boundaries.

This could also be useful for running device drivers in user space (for
instance, on a microkernel). Since the device driver is in user space,
the kernel can't know whether a write to a device register means "output
bytes 12345678" or "write to memory at address 12345678"; this needs an
IOMMU or similar to work safely.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Jacob Bachmeyer

unread,
Mar 18, 2018, 11:08:48 PM3/18/18
to Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev
A hardware solution with two additional tag bits in a cache line:
record the privilege level that most recently wrote to the cacheline, if
higher than the privilege level executing MEM.DISCARD, perform
writeback-and-invalidate, otherwise, simply invalidate. This is
equivalent to writing back the cache on a trap return, which is also
permitted.


-- Jacob

Richard Herveille

unread,
Mar 19, 2018, 7:25:00 AM3/19/18
to jcb6...@gmail.com, Albert Cahalan, Allen Baum, Michael Chapman, Tommy Thorn, Rogier Brussee, RISC-V ISA Dev, Richard Herveille

Multi-port memories are very expensive. In area a dual-port (2RW) is almost twice as big as a single port memory.

It’s not the bit-cells that count (those are tiny), but the row/address lines, amplifiers, drivers, address decoders.

Therefore single-port (1RW) memory wherever possible.

 

FGPAs made life easier. Most of them implement dual-port (2RW) in predefined blocks. That makes it possible to implement a 2R1W memory relatively easy. Adding read ports is also straightforward, you simply double the memory. For example a 4R1W is build using two 2R1W, where the write data is shared between the two memories. In an FPGA it is not possible to go beyond 2 write ports without serious logic overhead and complexity.

 

Taking both reasons into account I’d implement multi-register load/store using a sequencer. It can reuse the existing architecture and load/store units and only needs a simple RF.

If this is handled at the ID-level, then even an OoO CPU would not be affected, at least for stores, as it can schedule new instructions around the stores.

 

Richard

 

 

-- Jacob

 

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.

Christoph Hellwig

unread,
Mar 19, 2018, 8:39:34 AM3/19/18
to Albert Cahalan, jcb6...@gmail.com, Christoph Hellwig, RISC-V ISA Dev
On Sat, Mar 17, 2018 at 01:48:40AM -0400, Albert Cahalan wrote:
> They are dangerous: "Presumably, the supervisor zeroed the page (and
> forced writeback) before attaching it to the user process."

The second part does generally not hold, at least not in any general
purpose OS.

Christoph Hellwig

unread,
Mar 19, 2018, 8:42:07 AM3/19/18
to Guy Lemieux, Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev, jcb6...@gmail.com
On Sun, Mar 18, 2018 at 11:16:02AM +0000, Guy Lemieux wrote:
> > They would zero the page. I've never heard of a supervisor that would
> > force writeback; this is not required for any typical CPU.
>
>
> systems without coherent dma should force a write back after zeroing, or
> else there is a security hole as other process data can leak through via
> dma.

Systems without coherent DMA in general can't offer user land DMA.
No general purpose supervisor is going to slow down the common case
(allocating memory to user processes) for a super special case like
user level DMA. Also currently all user level DMA in typical
architectures requires the memory to be pinned down, which would be
a place to force out the cache lines. With SVA this is going to be
changing, so the problem space will become more interesting.

Christoph Hellwig

unread,
Mar 19, 2018, 8:48:51 AM3/19/18
to Cesar Eduardo Barros, Albert Cahalan, Guy Lemieux, Christoph Hellwig, RISC-V ISA Dev, Jacob Bachmeyer
On Sun, Mar 18, 2018 at 11:31:18PM -0300, Cesar Eduardo Barros wrote:
> If the system has an IOMMU, a user process can be allowed to directly
> request a bus master to initiate DMA without being able to "go straight for
> the kernel". Consider for instance the scenario where a PCI device (for
> instance, a graphics card) is "exported" to a virtual machine; this is used
> nowadays to do things like running Windows on a VM with fully accelerated
> graphics (for gaming). Clearly there's something preventing the OS within
> the VM from simply asking the PCI card to DMA outside the VM's boundaries.

Not so fast.

The current state of the art for iommus is that you have a static
pool premapped for the VM or user level driver case. That memory is
assigned ahead of time to the process or VM, and any cache flushing
is done at that time.

If you want a more dynamic issue where cache flushing could be come
a problem you bleeding edge iommu and platform features, most notably
an iommu that understands the CPU page table format, and support for
page faults from I/O devices. Latest x86 and arm systems can support
these, but generally only on PCIe or PCIe-like architectures and with
strong cache coherency.

Aaron Severance

unread,
Mar 19, 2018, 1:39:08 PM3/19/18
to jcb6...@gmail.com, Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev
1) Some implementations will not do this.
2) Some implementations will try to do something like this and still expose side-channel information due to a faulty implementation or overly aggressive optimizations.

I agree with Albert's suggestion of making cache-control optimizations controllable by higher privelege levels and disabled by default.  Granted if there is side channel information the attacker probably will not NEED cache control instructions to get it but it may make attacks much more practical.
  Aaron

 

-- Jacob

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Allen Baum

unread,
Mar 20, 2018, 12:45:06 PM3/20/18
to Aaron Severance, Jacob Bachmeyer, Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev
There are programming models and implementations that will require user code to explicitly evict address ranges from the cache, so that, at least, must b available to user coe.
Even if it weren't, malicious code can always force eviction by dummy loads, so preventing that isn't much of a security measure

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jose Renau

unread,
Mar 20, 2018, 5:26:11 PM3/20/18
to Allen Baum, Aaron Severance, Jacob Bachmeyer, Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev

 It may not be a security protection because of dummy loads, but it really speeds up the attack speed. I think
that by default the user application level should not have permission to do this, but I am OK to allow with CSR
to expose this to the application for some "weird" apps.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

lkcl .

unread,
Mar 27, 2018, 11:27:50 AM3/27/18
to Jacob Bachmeyer, RISC-V ISA Dev
On Thu, Mar 8, 2018 at 4:19 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Previous discussions suggested that explicit cache-control instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the user ISA.

interesting. i've just encountered a design which contains a couple
of specific cache-management issues / instructions, see "Cache
Coherence" section:
https://github.com/jbush001/NyuziProcessor/wiki/Microarchitecture

> Thoughts?

as i'm catching up it's quite hard to understand the motivation (the
"why") of this clearly comprehensive and well-thought-through work,
which, from the references that you give jacob, clearly indicates that
people have been working on it for some time. lists however... are
extremely well-known for being utterly impossible to use as "document
management and structured informational ordering and retrieval
systems". is there a wiki somewhere (similar to jeff's document) or
any online documentation or a paper which outlines the motivation and
why these specific instructions are needed, and what their benefit is?
would that help people to understand them and thus motivate people to
implement them?

secondly, in looking at what jeff has done in Nyuzi, he's added a
couple of things which he's clearly thought through from first
principles, defining what a cache should be, defining an explicit
"membar" instruction and then proving that it meets the criteria of
cache coherency. the addition of this explicity instruction hugely
simplifies the entire cache design over alternative architectures.
would it be reasonable (as in, reason-able) to add something like this
as an option?

thirdly: further up the document he points to a specific optimisation
where, due to a huge vector write, there is *clearly* no need to issue
a cache read: the vector write is *going* to overwrite and fill the
entire cache line, so why even do an external memory read? and so,
that's exactly what he implements: vector writes *directly* overwrite
the entire cache line. normally, detecting that this is going to
occur would be... difficult. it would require read-ahead analysis of
multiple instructions to see if a loop was going to result in multiple
writes. question: would it be reasonable to add an instruction that
*explicitly* allocates/marks a region of memory [to be cached] as
"write-only"? and whilst we're at it, a "read-only" version as well?

or... on this latter, and i apologise for not being completely
familiar with "pinning" (some help on reading up on it greatly
appreciated), is cache pinning effectively equivalent to marking an
area (cache line or lines) as "write-only" anyway?

if so that would be fantastic as the proposed extensions could be used
to effectively "port" Nyuzi 3D engine to a parallel RISC-V paradigm
without losing too much of the specialist work that Jeff put into
Nyuzi.

l.

Guy Lemieux

unread,
Mar 27, 2018, 12:25:55 PM3/27/18
to lkcl ., Jacob Bachmeyer, RISC-V ISA Dev
On Tue, Mar 27, 2018 at 8:27 AM, lkcl . <luke.l...@gmail.com> wrote:
> On Thu, Mar 8, 2018 at 4:19 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>> Previous discussions suggested that explicit cache-control instructions
>> could be useful, but RISC-V has some constraints here that other
>> architectures lack, namely that caching must be transparent to the user ISA.
>
> interesting. i've just encountered a design which contains a couple
> of specific cache-management issues / instructions, see "Cache
> Coherence" section:
> https://github.com/jbush001/NyuziProcessor/wiki/Microarchitecture
>
>> Thoughts?

this design is cache-coherent. the proposed cache-control instructions are
for systems without coherence.


> as i'm catching up it's quite hard to understand the motivation (the
> "why")

this was all stripped to make the spec as concise as possible.

> of this clearly comprehensive and well-thought-through work,
> which, from the references that you give jacob, clearly indicates that
> people have been working on it for some time. lists however... are
> extremely well-known for being utterly impossible to use as "document
> management and structured informational ordering and retrieval
> systems". is there a wiki somewhere (similar to jeff's document) or
> any online documentation or a paper which outlines the motivation and
> why these specific instructions are needed, and what their benefit is?
> would that help people to understand them and thus motivate people to
> implement them?

unfortunately not, just the collective wisdom of the contributors.

> secondly, in looking at what jeff has done in Nyuzi, he's added a
> couple of things which he's clearly thought through from first
> principles, defining what a cache should be, defining an explicit
> "membar" instruction and then proving that it meets the criteria of
> cache coherency.

similar to a FENCE in RISC-V

> the addition of this explicity instruction hugely
> simplifies the entire cache design over alternative architectures.
> would it be reasonable (as in, reason-able) to add something like this
> as an option?

use FENCE. warning: the RISC-V spec is overly conservative on FENCE
instructions. to stay in spec, a FENCE will flush non-coherent caches.
applications that depend upon particular memory ordering rely upon this
behaviour, so it can't be changed.

this has long bothered me... perhaps we need to add a lighter weight
version to the cache-control instructions that simply flushes write buffers.

> thirdly: further up the document he points to a specific optimisation
> where, due to a huge vector write, there is *clearly* no need to issue
> a cache read: the vector write is *going* to overwrite and fill the
> entire cache line, so why even do an external memory read? and so,
> that's exactly what he implements: vector writes *directly* overwrite
> the entire cache line.

from my reading of the pages you linked, it sounds like this only happens
when vector writes are aligned with the cache. this is easy to detect.

> normally, detecting that this is going to
> occur would be... difficult. it would require read-ahead analysis of
> multiple instructions to see if a loop was going to result in multiple
> writes.

correct, on a general CPU, this is hard to detect by analyzing
individual store-word instructions. (but not impossible -- it is a
stream buffer pattern that can be detected.)

> question: would it be reasonable to add an instruction that
> *explicitly* allocates/marks a region of memory [to be cached] as
> "write-only"?

See MEM.REWRITE

> and whilst we're at it, a "read-only" version as well?

See MEM.PFx, MEM.PF.ONCE and MEM.PF.EXCL

Neither of these enforce write-only or read-only behaviour. However, a
read after MEM.REWRITE returns undefined results (some implementations
will write zeros, but this cannot be relied upon.) And a write after a
MEM.PF is perfectly valid behaviour.

> or... on this latter, and i apologise for not being completely
> familiar with "pinning" (some help on reading up on it greatly
> appreciated), is cache pinning effectively equivalent to marking an
> area (cache line or lines) as "write-only" anyway?

Pinning forces the contents to remain in-cache and not be evicted
by other accesses that might otherwise attempt to use this cache line.

The intent is to use this for N-way set associative caches, where
up to N-1 ways (addresses that map to the same set) can be pinned.

> if so that would be fantastic as the proposed extensions could be used
> to effectively "port" Nyuzi 3D engine to a parallel RISC-V paradigm
> without losing too much of the specialist work that Jeff put into
> Nyuzi.

see also:
https://www.github.com/vectorblox/mxp
which already works with RISC-V. Compared to Nyuzi, MXP is fully
scalable (almost everything that controls size is parameterized), has
support for variable-length vectors, and is already FPGA-optimized so
it achieves about twice the clock frequency; not sure about the size
differential. MXP also includes a full software architectural
simulator, allowing code to be run on a desktop. However, MXP does not
use cache coherence.

Thanks,
Guy

lkcl .

unread,
Mar 27, 2018, 1:46:47 PM3/27/18
to Guy Lemieux, Jacob Bachmeyer, RISC-V ISA Dev
On Tue, Mar 27, 2018 at 5:25 PM, Guy Lemieux <glem...@vectorblox.com> wrote:
> On Tue, Mar 27, 2018 at 8:27 AM, lkcl . <luke.l...@gmail.com> wrote:
>> On Thu, Mar 8, 2018 at 4:19 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>> Previous discussions suggested that explicit cache-control instructions
>>> could be useful, but RISC-V has some constraints here that other
>>> architectures lack, namely that caching must be transparent to the user ISA.
>>
>> interesting. i've just encountered a design which contains a couple
>> of specific cache-management issues / instructions, see "Cache
>> Coherence" section:
>> https://github.com/jbush001/NyuziProcessor/wiki/Microarchitecture
>>
>>> Thoughts?
>
> this design is cache-coherent.

ah: surprisingly it's not. there's one case which is explicitly not
covered, for which jeff proposes (and created) the membar instruction.
let me find... ok:

"This processor uses a relaxed memory consistency model. Because the
pipeline issues memory instructions in order, it preserves
read-to-read, and write-to-write ordering. But it cannot guarantee
write-to-read ordering between threads because execution of a thread
can proceed while a write is in the store queue. The membar (memory
barrier) instruction enforces explicit ordering by suspending the
thread if a pending store is in in the store queue. When the L2 cache
acknowledges it, it resumes the thread. This guarantees all other L1
caches have the stored data."

so it's only cache-coherent *if* you *explicitly* call the membar instruction.


> the proposed cache-control instructions are
> for systems without coherence.

oo goood. so really useful for something like high-performance
clustered multi-processing where the sheer number of processors (and
L1 caches) would make it otherwise really difficult to do cache
coherency, right? ok, so what you answered here would be something
that would be great to put on a summary / wiki page somewhere :) or,
maybe just... put that in right at the top of the spec.

btw the reason i'm interested is because i'm doing a feasibility
study of throwing multiple minimalist RV32 or RV64 cores at the
problem of doing a 3D engine, along-side something like e.g. Jeff's
ChiselGPU Rasteriser function.

>
>> as i'm catching up it's quite hard to understand the motivation (the
>> "why")
>
> this was all stripped to make the spec as concise as possible.

ok. understandable, as the spec outlines "what" and "how".

>> systems". is there a wiki somewhere (similar to jeff's document) or
>> any online documentation or a paper which outlines the motivation and
>> why these specific instructions are needed, and what their benefit is?
>> would that help people to understand them and thus motivate people to
>> implement them?
>
> unfortunately not, just the collective wisdom of the contributors.

ok. well, even that one first sentence you gave answers one of the
key critical questions.

>> secondly, in looking at what jeff has done in Nyuzi, he's added a
>> couple of things which he's clearly thought through from first
>> principles, defining what a cache should be, defining an explicit
>> "membar" instruction and then proving that it meets the criteria of
>> cache coherency.
>
> similar to a FENCE in RISC-V

oh! ok. just looking that up... so section A.3.4... which doesn't
explain what it is... urrr... ok, intuitively i get the general idea,
you can define "fences" that protect one set of instructions' reads
(or writes) from another set.

>> the addition of this explicity instruction hugely
>> simplifies the entire cache design over alternative architectures.
>> would it be reasonable (as in, reason-able) to add something like this
>> as an option?
>
> use FENCE. warning: the RISC-V spec is overly conservative on FENCE
> instructions. to stay in spec, a FENCE will flush non-coherent caches.
> applications that depend upon particular memory ordering rely upon this
> behaviour, so it can't be changed.
>
> this has long bothered me... perhaps we need to add a lighter weight
> version to the cache-control instructions that simply flushes write buffers.

you understand this stuff way better than i do. so whilst i cannot
say technically if what you propose would do the job, i can definitely
say that if what you propose works that would be *really* useful, as a
far simpler multi-core networking topology and cache arrangement could
be feasible, where right now it's exceedingly challenging.



>> thirdly: further up the document he points to a specific optimisation
>> where, due to a huge vector write, there is *clearly* no need to issue
>> a cache read: the vector write is *going* to overwrite and fill the
>> entire cache line, so why even do an external memory read? and so,
>> that's exactly what he implements: vector writes *directly* overwrite
>> the entire cache line.
>
> from my reading of the pages you linked, it sounds like this only happens
> when vector writes are aligned with the cache. this is easy to detect.

jeff says on his wiki, "If a store writes a full line to the L2 cache
as a block vector store and the line is not cache resident, it doesn't
load it from memory--which would be unnecessary--but instead puts the
new data into the line."

so i would interpret that to mean that the *start* of vector writes
would not necessarily need to be aligned with a cache, but that if any
part of a vector write happens to coincide with one or more cache
lines, that's when the load is avoided.

which brings us to an interesting issue: in the case of the proposed
lighter-weight "flush write buffers only", (in combination i assume,
perhaps incorrectly, with FENCE instructions?) or in fact even the
proposed instructions as-is, what _does_ happen to the rest of the
cache line if you start writing say part-way through a cache line?

so let's say you issue "flush write buffer" on cache line 1 and it's
16 bytes in length, and you then start writing to bytes 8-15, are
bytes 0-7 still valid? i guess what i'm saying is, are the bytes that
you *don't* write to "defined" or are they "undefined"?

and if they're undefined, is there a way to detect (programmatically)
what the cache line size / boundary is, so that a program can
*guarantee* to start e.g. writing vectors at the beginning of a cache
boundary?


>> normally, detecting that this is going to
>> occur would be... difficult. it would require read-ahead analysis of
>> multiple instructions to see if a loop was going to result in multiple
>> writes.
>
> correct, on a general CPU, this is hard to detect by analyzing
> individual store-word instructions. (but not impossible -- it is a
> stream buffer pattern that can be detected.)

honestly if faced with following some well-thought-out rules that can
be dealt with in the compiler, or making the hardware more complex,
i'd go every time for the software.

>> question: would it be reasonable to add an instruction that
>> *explicitly* allocates/marks a region of memory [to be cached] as
>> "write-only"?
>
> See MEM.REWRITE

okay! cool!

>> and whilst we're at it, a "read-only" version as well?
>
> See MEM.PFx, MEM.PF.ONCE and MEM.PF.EXCL

ha, even better.

> Neither of these enforce write-only or read-only behaviour. However, a
> read after MEM.REWRITE returns undefined results (some implementations
> will write zeros, but this cannot be relied upon.) And a write after a
> MEM.PF is perfectly valid behaviour.
>
>> or... on this latter, and i apologise for not being completely
>> familiar with "pinning" (some help on reading up on it greatly
>> appreciated), is cache pinning effectively equivalent to marking an
>> area (cache line or lines) as "write-only" anyway?
>
> Pinning forces the contents to remain in-cache and not be evicted
> by other accesses that might otherwise attempt to use this cache line.

but does it also (deliberately) break cache coherency? i.e. if
you've pinned a cache line as "write-only" (sorry i don't know the
precise terminology) and another core also tries to write to the same
address, would that cause:

* A an exception
* B the second core to be STALLED until the pinning is released?
(this might be desirable)
* C deliberate undefined behaviour

apologies if that's unclear.

>> if so that would be fantastic as the proposed extensions could be used
>> to effectively "port" Nyuzi 3D engine to a parallel RISC-V paradigm
>> without losing too much of the specialist work that Jeff put into
>> Nyuzi.
>
> see also:
> https://www.github.com/vectorblox/mxp

ah.. darn-it, i got really excited... and then noticed that there's
no libre license. the project i'm working on, it's mandatory that all
code be BSD/MIT licensed and publicly available for independent
auditing, for security reasons (we've had enough Intel MEs and
spectres and meltdowns). someone else (reading in the future) for
whom that constraint is not an issue may be particularly interested to
know that mxp exists.

l.

lkcl .

unread,
Mar 27, 2018, 1:59:10 PM3/27/18
to Aaron Severance, Guy Lemieux, Jacob Bachmeyer, RISC-V ISA Dev
On Tue, Mar 27, 2018 at 6:33 PM, Aaron Severance
<aseve...@vectorblox.com> wrote:

> FENCE behaviour with non-coherent caches is something that I don't think is
> explicit and needs to be clarified.

so, jeff's example would be quite useful there, in defining a
real-world case where FENCE and the proposed cache extensions would
actually be used?

> I'm guessing the main RISC-V ISA and
> memory specs are going to punt on this (their specs seem to be only valid
> for coherent systems) but if this spec clarifies cache FLUSHING/WRITEBACK
> behaviour for non-coherent caches then it would be good to state what FENCE
> behaviour is.
>
> Initially I had thought that FENCEs should cause writebacks/flushes, but
> recently my opinion has shifted to think they should not. Two reasons:
>
> 1) Code written for coherent systems with appropriate FENCEs will not always
> work on non-coherent cached systems even if the cache is flushed at each
> FENCE. So since code will not be portable I think it's better to just
> define FENCEs as ordering coherent memory, and require the cache control
> instructions to be used for proper behaviour on non-coherent systems.

would it be reasonable to conclude that anyone giving serious
consideration to non-coherent systems would be looking to develop an
ultra-specialist customised hardware block, for which running
general-purpose software written for use on "standard" cache-coherent
RISC-V systems would be completely out of the question?

the reason i ask is because in discussions with the shakti team,
where we are considering an 8-core SMP design, we briefly touched on
the idea of splitting the L2 cache into 4 parts, so as to avoid having
massive-way set associativity and associated heavy routing (and to
solve inter-group coherency by some as-yet-to-be-discussed mechanism)

and that *is* intended as a general-purpose SMP design but, thinking
it through... it would have to run general-purpose software (standard
debian-riscv and fedora-riscv linux), no special recompilations
allowed, therefore the software could not possibly include the
proposed cache coherency instructions.... therefore it would have to
solve the 4-way split L2 cache coherency transparently... so is not a
valid counter-example.

ok so, sorry, yes, i think i am agreeing with you :)

l.

Guy Lemieux

unread,
Mar 27, 2018, 2:04:02 PM3/27/18
to lkcl ., Jacob Bachmeyer, RISC-V ISA Dev
> ah: surprisingly it's not. there's one case which is explicitly not
> covered, for which jeff proposes (and created) the membar instruction.
> let me find... ok:
...
> so it's only cache-coherent *if* you *explicitly* call the membar instruction.

cache coherence is not the same as memory consistency.

membar is used to ensure consistency. it is similar to the FENCE
instruction in RISC-V.


> oo goood. so really useful for something like high-performance
> clustered multi-processing where the sheer number of processors (and
> L1 caches) would make it otherwise really difficult to do cache
> coherency, right?

with a large number of processors, cache coherence does start to
generate a lot of traffic. this is why directory-based coherence
schemes were developed, where a list of sharing processors is
maintained to cut down on broadcasts.

>> similar to a FENCE in RISC-V
>
> oh! ok. just looking that up... so section A.3.4... which doesn't
> explain what it is... urrr... ok, intuitively i get the general idea,
> you can define "fences" that protect one set of instructions' reads
> (or writes) from another set.

yes, that's the general idea.

in an uncached system, there is an implication you need to write back
dirty cache lines, allowing other agents to see all previous writes
before any subsequent reads or writes. it also implies you must flush
clean cache lines, so that subsequent reads can be observed as well.
the wording is ambiguous, and different people have slightly different
interpretations here.


>> from my reading of the pages you linked, it sounds like this only happens
>> when vector writes are aligned with the cache. this is easy to detect.
>
> jeff says on his wiki, "If a store writes a full line to the L2 cache
> as a block vector store and the line is not cache resident, it doesn't
> load it from memory--which would be unnecessary--but instead puts the
> new data into the line."

key part: "writes a full line".

I may have skimmed too much, but it looked to me the the vector
lengths were fixed and matched the cache line length, so I inferred
"aligned" from this statement. I could be wrong.


>> Pinning forces the contents to remain in-cache and not be evicted
>> by other accesses that might otherwise attempt to use this cache line.
>
> but does it also (deliberately) break cache coherency?

pinning is for non-coherent systems. there is no coherence to break.



> ah.. darn-it, i got really excited... and then noticed that there's
> no libre license. the project i'm working on, it's mandatory that all
> code be BSD/MIT licensed and publicly available for independent
> auditing, for security reasons (we've had enough Intel MEs and
> spectres and meltdowns). someone else (reading in the future) for
> whom that constraint is not an issue may be particularly interested to
> know that mxp exists.

source code licenses are available, but cannot be disclosed publicly.

for the right price, the entire MXP design could be released under
some type of open source agreement (perhaps non-commcerial only).

Guy

Jacob Bachmeyer

unread,
Mar 27, 2018, 7:31:23 PM3/27/18
to lkcl ., RISC-V ISA Dev
lkcl . wrote:
> On Thu, Mar 8, 2018 at 4:19 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Previous discussions suggested that explicit cache-control instructions
>> could be useful, but RISC-V has some constraints here that other
>> architectures lack, namely that caching must be transparent to the user ISA.
>>
>
> interesting. i've just encountered a design which contains a couple
> of specific cache-management issues / instructions, see "Cache
> Coherence" section:
> https://github.com/jbush001/NyuziProcessor/wiki/Microarchitecture
>
>
>> Thoughts?
>>
>
> as i'm catching up it's quite hard to understand the motivation (the
> "why") of this clearly comprehensive and well-thought-through work,
> which, from the references that you give jacob, clearly indicates that
> people have been working on it for some time. lists however... are
> extremely well-known for being utterly impossible to use as "document
> management and structured informational ordering and retrieval
> systems". is there a wiki somewhere (similar to jeff's document) or
> any online documentation or a paper which outlines the motivation and
> why these specific instructions are needed, and what their benefit is?
> would that help people to understand them and thus motivate people to
> implement them?
>

Unfortunately not; although I have kept the same proposal title through
the drafts, so the past discussions should not be too hard to dig out of
the list archives.

> secondly, in looking at what jeff has done in Nyuzi, he's added a
> couple of things which he's clearly thought through from first
> principles, defining what a cache should be, defining an explicit
> "membar" instruction and then proving that it meets the criteria of
> cache coherency. the addition of this explicity instruction hugely
> simplifies the entire cache design over alternative architectures.
> would it be reasonable (as in, reason-able) to add something like this
> as an option?
>

I think that this is similar to the RISC-V baseline FENCE and the
proposed ranged FENCE instructions.

> thirdly: further up the document he points to a specific optimisation
> where, due to a huge vector write, there is *clearly* no need to issue
> a cache read: the vector write is *going* to overwrite and fill the
> entire cache line, so why even do an external memory read? and so,
> that's exactly what he implements: vector writes *directly* overwrite
> the entire cache line. normally, detecting that this is going to
> occur would be... difficult. it would require read-ahead analysis of
> multiple instructions to see if a loop was going to result in multiple
> writes. question: would it be reasonable to add an instruction that
> *explicitly* allocates/marks a region of memory [to be cached] as
> "write-only"? and whilst we're at it, a "read-only" version as well?
>

Those are MEM.REWRITE (allocate writable cachelines ignoring current
contents) and MEM.DISCARD (drop any pending writes that have not
committed). A "read-only" version sounds like an ordinary PREFETCH to me.

Detecting and optimizing vector writes that clobber entire cachelines is
something an RVV (RISC-V vector extension) implementation could do, but
probably out-of-scope for any reasonable scalar unit.

> or... on this latter, and i apologise for not being completely
> familiar with "pinning" (some help on reading up on it greatly
> appreciated), is cache pinning effectively equivalent to marking an
> area (cache line or lines) as "write-only" anyway?
>

It is effectively equivalent to dynamically-provisioned cache-as-RAM
(while the rest of the cache operates normally) up to some
implementation-defined limit. It was inspired by a problem with
implementing self-reflashing on the HiFive board that Bruce Hoult
complained about on the list.

> if so that would be fantastic as the proposed extensions could be used
> to effectively "port" Nyuzi 3D engine to a parallel RISC-V paradigm
> without losing too much of the specialist work that Jeff put into
> Nyuzi.

The proposed MEM.REWRITE instruction declares that an extent of memory
will be overwritten in its entirety. What to do with improper use of
that instruction has been the source of some recent complaints,
including demands to make it a privileged instruction, which would
defeat the purpose of having it in the first place.


-- Jacob


Jacob Bachmeyer

unread,
Mar 27, 2018, 7:52:28 PM3/27/18
to Guy Lemieux, lkcl ., RISC-V ISA Dev
Guy Lemieux wrote:
> On Tue, Mar 27, 2018 at 8:27 AM, lkcl . <luke.l...@gmail.com> wrote:
>
>> On Thu, Mar 8, 2018 at 4:19 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>>> Previous discussions suggested that explicit cache-control instructions
>>> could be useful, but RISC-V has some constraints here that other
>>> architectures lack, namely that caching must be transparent to the user ISA.
>>>
>> interesting. i've just encountered a design which contains a couple
>> of specific cache-management issues / instructions, see "Cache
>> Coherence" section:
>> https://github.com/jbush001/NyuziProcessor/wiki/Microarchitecture
>>
>>
>>> Thoughts?
>>>
>
> this design is cache-coherent. the proposed cache-control instructions are
> for systems without coherence.
>

A minor correction: the proposal is for systems of all types. Systems
without hardware cache coherence may need the proposed instructions, but
they should still be useful on systems with hardware cache coherency as
performance optimization hints.

>> as i'm catching up it's quite hard to understand the motivation (the
>> "why")
>>
>
> this was all stripped to make the spec as concise as possible.
>

Agreed. The proposal is a specification. It is possible that
commentary explaining some of the decisions may be added at a later
date, but for now, we are still finalizing the technical details.

>> the addition of this explicity instruction hugely
>> simplifies the entire cache design over alternative architectures.
>> would it be reasonable (as in, reason-able) to add something like this
>> as an option?
>>
>
> use FENCE. warning: the RISC-V spec is overly conservative on FENCE
> instructions. to stay in spec, a FENCE will flush non-coherent caches.
> applications that depend upon particular memory ordering rely upon this
> behaviour, so it can't be changed.
>
> this has long bothered me... perhaps we need to add a lighter weight
> version to the cache-control instructions that simply flushes write buffers.
>

Do the proposed ranged FENCE instructions fit this need? What about
CACHE.WRITEBACK? There is also a reserved slot near FENCE.I that could
provide a new fence instruction, if there is no form of the baseline
FENCE that simply flushes write buffers. ("FENCE po,pw,si,so,sr,sw"
perhaps?)

>> and whilst we're at it, a "read-only" version as well?
>>
>
> See MEM.PFx, MEM.PF.ONCE and MEM.PF.EXCL
>
> Neither of these enforce write-only or read-only behaviour. However, a
> read after MEM.REWRITE returns undefined results (some implementations
> will write zeros, but this cannot be relied upon.) And a write after a
> MEM.PF is perfectly valid behaviour.
>

MEM.REWRITE is defined to *not* modify previously valid cachelines: if
a cacheline was valid when MEM.REWRITE is executed, it may retain its
contents.

A cacheline that is already part of the region and valid is unchanged.
A clean cacheline that will be evicted to cache the region (or a
previously invalid cacheline that will be allocated) may simply have its
tag changed, if the hardware can prove that the cacheline previously
mapped memory that is accessible to the current thread, leaving the old
contents from somewhere else in the address space. The only time the
implementation is required to write to the cacheline allocated by
MEM.REWRTIE is when the hardware cannot prove that the previous contents
of the cacheline were accessible to the current thread at some address.
("out of thin air" is permitted; introducing "foreign data" is not
permitted)

One way to prove that the cache cannot introduce foreign data is to
simply flush and zero the entire (user) cache upon context switch, which
hardware can detect by observing writes to the satp CSR. Another option
is to partition the cache by ASID; this also allows a VIPT cache to have
independent arrays for each ASID, effectively providing a larger cache
while closing inter-process side channels.


-- Jacob

Guy Lemieux

unread,
Mar 27, 2018, 8:14:18 PM3/27/18
to Jacob Bachmeyer, lkcl ., RISC-V ISA Dev
> MEM.REWRITE is defined to *not* modify previously valid cachelines: if a
> cacheline was valid when MEM.REWRITE is executed, it may retain its
> contents.

either I forgot about this, or didn't realize it.

I haven't fully thought this through... but it might take extra work
to implement with this behaviour. while iterating through cache lines,
it has to check if the cache line is anywhere within the valid range
from start to end. however, if the instruction only does one cache
line at a time and returns a progress indicator, then you lose the
original starting point, and when you get to the end you will evict
valid lines at the beginning. when iterating through the cache
entirely in hardware, this can probably be done correctly (comparisons
done only once for each line), but the software-looping version has
trouble. I don't see any immediate solution to this.

> A cacheline that is already part of the region and valid is unchanged. A
> clean cacheline that will be evicted to cache the region (or a previously
> invalid cacheline that will be allocated) may simply have its tag changed,
> if the hardware can prove that the cacheline previously mapped memory that
> is accessible to the current thread, leaving the old contents from somewhere
> else in the address space. The only time the implementation is required to
> write to the cacheline allocated by MEM.REWRTIE is when the hardware cannot
> prove that the previous contents of the cacheline were accessible to the
> current thread at some address. ("out of thin air" is permitted;
> introducing "foreign data" is not permitted)

for simplicity, I suspect most implementations will simply write 0s
into the data cache, and set the cache tag state to valid+clean.

> One way to prove that the cache cannot introduce foreign data is to simply
> flush and zero the entire (user) cache upon context switch, which hardware
> can detect by observing writes to the satp CSR. Another option is to
> partition the cache by ASID; this also allows a VIPT cache to have
> independent arrays for each ASID, effectively providing a larger cache while
> closing inter-process side channels.

when you say "partition the cache by ASID", I hope you mean add ASID
to the tag. having a hard cache partition is a waste.

g

Jacob Bachmeyer

unread,
Mar 27, 2018, 9:16:01 PM3/27/18
to lkcl ., Guy Lemieux, RISC-V ISA Dev
lkcl . wrote:
> On Tue, Mar 27, 2018 at 5:25 PM, Guy Lemieux <glem...@vectorblox.com> wrote:
>
>> On Tue, Mar 27, 2018 at 8:27 AM, lkcl . <luke.l...@gmail.com> wrote:
>>
> [...]
> which brings us to an interesting issue: in the case of the proposed
> lighter-weight "flush write buffers only", (in combination i assume,
> perhaps incorrectly, with FENCE instructions?) or in fact even the
> proposed instructions as-is, what _does_ happen to the rest of the
> cache line if you start writing say part-way through a cache line?
>

The unchanged part must be loaded from memory.

> so let's say you issue "flush write buffer" on cache line 1 and it's
> 16 bytes in length, and you then start writing to bytes 8-15, are
> bytes 0-7 still valid? i guess what i'm saying is, are the bytes that
> you *don't* write to "defined" or are they "undefined"?
>

A "flush write buffer" operation only ensures that pending writes have
been flushed to memory. Any bytes not written retain their previous values.

> and if they're undefined, is there a way to detect (programmatically)
> what the cache line size / boundary is, so that a program can
> *guarantee* to start e.g. writing vectors at the beginning of a cache
> boundary?
>

MEM.REWRITE is the only proposed instruction that can create undefined
contents. The boundaries are transparent: an attempt to rewrite an
unaligned region causes the partial cachelines in the region to be
prefetched instead.

A program that declares 0-15 will be rewritten but only updates 8-15 is
an invalid program and 0-7 will have undefined contents afterwards,
which *will* be written back to main memory. This is how improper use
of MEM.REWRITE can destroy nearby data within the region.

> [...]
>
>
>>> or... on this latter, and i apologise for not being completely
>>> familiar with "pinning" (some help on reading up on it greatly
>>> appreciated), is cache pinning effectively equivalent to marking an
>>> area (cache line or lines) as "write-only" anyway?
>>>
>> Pinning forces the contents to remain in-cache and not be evicted
>> by other accesses that might otherwise attempt to use this cache line.
>>
>
> but does it also (deliberately) break cache coherency? i.e. if
> you've pinned a cache line as "write-only" (sorry i don't know the
> precise terminology) and another core also tries to write to the same
> address, would that cause:
>
> * A an exception
> * B the second core to be STALLED until the pinning is released?
> (this might be desirable)
> * C deliberate undefined behaviour
>
> apologies if that's unclear.
>

This is an issue that has not yet been determined. I lean towards
updating the pinned cacheline in a hardware-coherent system: if hart A
has pinned a cacheline for address X and hart B seeks to write to
address X, the write is carried out, but goes directly to hart A's
cacheline. This is expected to be quite slow compared to normal memory
access. Unpinning the cacheline upon remote intervention is another
alternative, since cachelines can already be unpinned without warning,
for example, due to a preemptive context switch. Guarantees that a
cacheline will remain pinned are implementation-specific and
environment-specific.



-- Jacob

Jacob Bachmeyer

unread,
Mar 27, 2018, 9:18:15 PM3/27/18
to Guy Lemieux, lkcl ., RISC-V ISA Dev
Guy Lemieux wrote:
>>> Pinning forces the contents to remain in-cache and not be evicted
>>> by other accesses that might otherwise attempt to use this cache line.
>>>
>> but does it also (deliberately) break cache coherency?
>>
>
> pinning is for non-coherent systems. there is no coherence to break.
>

Cacheline pinning is intended for all systems. The interaction of
pinning and hardware-enforced coherency remains an open issue with the
proposal.

-- Jacob

Guy Lemieux

unread,
Mar 27, 2018, 9:22:09 PM3/27/18
to Jacob Bachmeyer, lkcl ., RISC-V ISA Dev
> This is an issue that has not yet been determined. I lean towards updating
> the pinned cacheline in a hardware-coherent system: if hart A has pinned a
> cacheline for address X and hart B seeks to write to address X, the write is
> carried out, but goes directly to hart A's cacheline.

You cannot do this.

Most coherence and consistency protocols are invalidation-based.

You are now dictating that all protocols must support an update-based
mechanism. This will be extremely painful for implementers, and
extremely difficult to do proper verification of a system.

This seemingly "little" requirement has such a big impact that it must
be ruled out.

> This is expected to
> be quite slow compared to normal memory access. Unpinning the cacheline
> upon remote intervention is another alternative, since cachelines can
> already be unpinned without warning, for example, due to a preemptive
> context switch. Guarantees that a cacheline will remain pinned are
> implementation-specific and environment-specific.

Pinning will screw up most coherence protocols, which must be able to
reliably invalidate lines.

Pinning is not compatible with coherence, unless you want to write a
brand new coherence protocol that everyone must follow.

Guy

lkcl

unread,
Mar 27, 2018, 9:35:01 PM3/27/18
to Guy Lemieux, Jacob Bachmeyer, RISC-V ISA Dev
On Wed, Mar 28, 2018 at 2:21 AM, Guy Lemieux <glem...@vectorblox.com> wrote:

> Pinning is not compatible with coherence, unless you want to write a
> brand new coherence protocol that everyone must follow.

can i just say... _dang_ this is complex stuff :)

could i make a suggestion / recommendation? write an actual simple
simulation, in python, with appropriate unit tests which demonstrate
various use-cases? multi-threaded or preferably a discrete
event-driven simulation with a bit of randomness deliberately thrown
in on the "clock", have an emulated L1/L2 cache, an "execution" engine
with the proposed instructions, and see how it goes?

err... err.... oh!

https://pypi.python.org/pypi/pycachesim
https://github.com/caleb531/cache-simulator
https://github.com/auxiliary/CacheSimulator
https://devhub.io/repos/vaskevich-CacheSim

so, ha, the task's not that daunting, if multiple people have already
written code, he said (without actually downloading any of the above
yet...)

just a thought?

l.

Jacob Bachmeyer

unread,
Mar 27, 2018, 9:38:33 PM3/27/18
to Guy Lemieux, lkcl ., RISC-V ISA Dev
Guy Lemieux wrote:
>> MEM.REWRITE is defined to *not* modify previously valid cachelines: if a
>> cacheline was valid when MEM.REWRITE is executed, it may retain its
>> contents.
>>
>
> either I forgot about this, or didn't realize it.
>
> I haven't fully thought this through... but it might take extra work
> to implement with this behaviour. while iterating through cache lines,
> it has to check if the cache line is anywhere within the valid range
> from start to end. however, if the instruction only does one cache
> line at a time and returns a progress indicator, then you lose the
> original starting point, and when you get to the end you will evict
> valid lines at the beginning. when iterating through the cache
> entirely in hardware, this can probably be done correctly (comparisons
> done only once for each line), but the software-looping version has
> trouble. I don't see any immediate solution to this.
>

Software must not simply iterate REGION instructions without actually
doing some of the intended work. If used on a buffer larger than cache,
obviously the entire buffer cannot be brought into cache at once.
Hardware must return incremental success and software must make other
forward progress before executing that REGION instruction for the next
part of the region.

For example, memset(3) is one of the simplest uses for MEM.REWRITE.
Suppose that memset() is called to overwrite a buffer with zero that is
larger than the cache and memset() uses MEM.REWRITE. Obviously the
entire buffer cannot be brought into cache at once. The first use of
MEM.REWRITE returns incremental success after flushing the cache and
allocating all of the cachelines to the buffer. The inner loop then
writes zero to every word in the buffer up to the address produced from
MEM.REWRITE. This address is less than the ending address for the
operation, so memset() iterates its outer loop and executes MEM.REWRITE
again. This iteration continues until the entire buffer is written.
That the earliest cachelines have been evicted when the end of the
buffer is reached is not a concern, because the buffer has been
efficiently overwritten.

If an implementation only processes at most one cacheline for MEM.PF*
and MEM.REWRITE, then memset(3) and memcpy(3) will iterate by cachelines
instead of by larger blocks.

For some other uses of MEM.REWRITE, such as function prologues,
iteration may be unneeded anyway, since a one-shot "clear up to this
much of the stack" can only improve local variable initialization and
fully using MEM.REWRITE may not be worth the loop overhead.

>> A cacheline that is already part of the region and valid is unchanged. A
>> clean cacheline that will be evicted to cache the region (or a previously
>> invalid cacheline that will be allocated) may simply have its tag changed,
>> if the hardware can prove that the cacheline previously mapped memory that
>> is accessible to the current thread, leaving the old contents from somewhere
>> else in the address space. The only time the implementation is required to
>> write to the cacheline allocated by MEM.REWRTIE is when the hardware cannot
>> prove that the previous contents of the cacheline were accessible to the
>> current thread at some address. ("out of thin air" is permitted;
>> introducing "foreign data" is not permitted)
>>
>
> for simplicity, I suspect most implementations will simply write 0s
> into the data cache, and set the cache tag state to valid+clean.
>

This is entirely acceptable, but portable software must not assume that
an implementation will do this.

>> One way to prove that the cache cannot introduce foreign data is to simply
>> flush and zero the entire (user) cache upon context switch, which hardware
>> can detect by observing writes to the satp CSR. Another option is to
>> partition the cache by ASID; this also allows a VIPT cache to have
>> independent arrays for each ASID, effectively providing a larger cache while
>> closing inter-process side channels.
>>
>
> when you say "partition the cache by ASID", I hope you mean add ASID
> to the tag. having a hard cache partition is a waste.
>

I believe that hard partitioning is needed to close side channels. Each
process needs its own cache. On the other hand, for a VIPT cache, hard
partitioning allows the cache to be larger overall, since the size of a
VIPT cache is limited by the page size and adding ASID effectively gives
additional untranslated address bits.


-- Jacob

Jacob Bachmeyer

unread,
Mar 27, 2018, 9:45:21 PM3/27/18
to Guy Lemieux, lkcl ., RISC-V ISA Dev
Guy Lemieux wrote:
>> This is an issue that has not yet been determined. I lean towards updating
>> the pinned cacheline in a hardware-coherent system: if hart A has pinned a
>> cacheline for address X and hart B seeks to write to address X, the write is
>> carried out, but goes directly to hart A's cacheline.
>>
>
> You cannot do this.
>
> Most coherence and consistency protocols are invalidation-based.
>
> You are now dictating that all protocols must support an update-based
> mechanism. This will be extremely painful for implementers, and
> extremely difficult to do proper verification of a system.
>
> This seemingly "little" requirement has such a big impact that it must
> be ruled out.
>

Like so many things, convenient for software, infeasible to implement
for hardware.

>> This is expected to
>> be quite slow compared to normal memory access. Unpinning the cacheline
>> upon remote intervention is another alternative, since cachelines can
>> already be unpinned without warning, for example, due to a preemptive
>> context switch. Guarantees that a cacheline will remain pinned are
>> implementation-specific and environment-specific.
>>
>
> Pinning will screw up most coherence protocols, which must be able to
> reliably invalidate lines.
>
> Pinning is not compatible with coherence, unless you want to write a
> brand new coherence protocol that everyone must follow.
>

A solution already mentioned: remote invalidation unpins a cacheline.
Cacheline pinning would then be somewhat analogous to Linux file leases
(fcntl(2), section "Leases"; F_SETLEASE) which allow exclusive access
until another process wants to open the file.


-- Jacob

Jacob Bachmeyer

unread,
Mar 27, 2018, 9:53:24 PM3/27/18
to lkcl, Guy Lemieux, RISC-V ISA Dev
If we are going that far, I would prefer a description in Lamport's
Temporal Logic of Actions with a correctness proof. Simulations always
leave "that one case" that was not checked. Murphy says the one case
not checked will crash production.


-- Jacob

lkcl

unread,
Mar 27, 2018, 9:54:54 PM3/27/18
to Jacob Bachmeyer, Guy Lemieux, RISC-V ISA Dev
On Wed, Mar 28, 2018 at 2:15 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> MEM.REWRITE is the only proposed instruction that can create undefined
> contents. The boundaries are transparent: an attempt to rewrite an
> unaligned region causes the partial cachelines in the region to be
> prefetched instead.
>
> A program that declares 0-15 will be rewritten but only updates 8-15 is an
> invalid program and 0-7 will have undefined contents afterwards, which
> *will* be written back to main memory. This is how improper use of
> MEM.REWRITE can destroy nearby data within the region.

ok so is there any scenario in which destroying nearby data is
actually acceptable? if 100% absolutely not, then by logical
inference it should not even be permitted, and if it's not even
permitted then like a "Capability" system rather than an "ACL" system
it should not even be made possible to *issue* a MEM.REWRITE on
anything other than a cache boundary. which in turn implies that
knowledge of the cache boundaries would need to be made available up
to userspace in order for the instruction *to* be issued only on a
cache boundary.

l.

lkcl

unread,
Mar 27, 2018, 10:05:26 PM3/27/18
to Jacob Bachmeyer, Guy Lemieux, RISC-V ISA Dev
ah, rats. i forgot about formal proofs. which immediately leads me
to suggest haskell instead:

https://www.google.com/search?q=haskell+cache+simulator

ok this implementation even supports multi-processing:
https://github.com/frizensami/cs4223-as2

regarding TLA...
https://www.reddit.com/r/haskell/comments/5m2mzf/which_highlevel_specification_languages_mix_well/

l.

Guy Lemieux

unread,
Mar 27, 2018, 10:56:10 PM3/27/18
to lkcl, Jacob Bachmeyer, RISC-V ISA Dev
> https://pypi.python.org/pypi/pycachesim
> https://github.com/caleb531/cache-simulator
> https://github.com/auxiliary/CacheSimulator
> https://devhub.io/repos/vaskevich-CacheSim
>
> so, ha, the task's not that daunting, if multiple people have already
> written code, he said (without actually downloading any of the above
> yet...)
>
> just a thought?

cache coherence != memory consistency.

a basic coherence protocol is easy peasy.

ensuring a memory consistency protocol is correct is incredibly
difficult. there can be thousands of corner cases that each have to be
handled correctly. each time you add a small optimization, like data
forwarding, things get multiplicatively more complex.

and I have written a multiprocessor simulator, and helped design a
cache coherence + memory consistency protocol for a
ring-interconnected multiprocessor. I know first-hand how difficult it
is

http://www.ece.ubc.ca/~lemieux/publications/lemieux-icpp2000.pdf

with more here:
http://www.eecg.toronto.edu/parallel/parallel/numadocs.html

In particular, see Alex Grbic's PhD thesis:
http://www.eecg.toronto.edu/parallel/parallel/theses/grbic_phd.pdf

which has an evaluation of advanced protocol options like
invalidate-based vs update-based protocols, as well as hybrids that
can support both and switch between modes on the fly.

Guy

lkcl

unread,
Mar 27, 2018, 11:13:20 PM3/27/18
to Guy Lemieux, Jacob Bachmeyer, RISC-V ISA Dev
On Wed, Mar 28, 2018 at 3:55 AM, Guy Lemieux <glem...@vectorblox.com> wrote:
>> https://pypi.python.org/pypi/pycachesim
>> https://github.com/caleb531/cache-simulator
>> https://github.com/auxiliary/CacheSimulator
>> https://devhub.io/repos/vaskevich-CacheSim
>>
>> so, ha, the task's not that daunting, if multiple people have already
>> written code, he said (without actually downloading any of the above
>> yet...)
>>
>> just a thought?
>
> cache coherence != memory consistency.

i remember you saying: i've got a draft message outstanding about
that, which needs more thought and research on my part before sending.

> a basic coherence protocol is easy peasy.
>
> ensuring a memory consistency protocol is correct is incredibly
> difficult. there can be thousands of corner cases that each have to be
> handled correctly. each time you add a small optimization, like data
> forwarding, things get multiplicatively more complex.
>
> and I have written a multiprocessor simulator, and helped design a
> cache coherence + memory consistency protocol for a
> ring-interconnected multiprocessor.

okay! very cool! so that in turn explains ("learning is doing") why
you're extremely knowledgeable in this area.

> I know first-hand how difficult it is
>
> http://www.ece.ubc.ca/~lemieux/publications/lemieux-icpp2000.pdf
>
> with more here:
> http://www.eecg.toronto.edu/parallel/parallel/numadocs.html
>
> In particular, see Alex Grbic's PhD thesis:
> http://www.eecg.toronto.edu/parallel/parallel/theses/grbic_phd.pdf
>
> which has an evaluation of advanced protocol options like
> invalidate-based vs update-based protocols, as well as hybrids that
> can support both and switch between modes on the fly.

ok so the immediate question that springs to mind is, in your opinion
would there be any benefit to (a) releasing the source code of that
simulator and (b) modifying it as a means to test and demonstrate the
proposed cache extensions... *at this phase*?

l.

Jacob Bachmeyer

unread,
Mar 27, 2018, 11:48:36 PM3/27/18
to lkcl, Guy Lemieux, RISC-V ISA Dev
lkcl wrote:
> On Wed, Mar 28, 2018 at 2:15 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>> MEM.REWRITE is the only proposed instruction that can create undefined
>> contents. The boundaries are transparent: an attempt to rewrite an
>> unaligned region causes the partial cachelines in the region to be
>> prefetched instead.
>>
>> A program that declares 0-15 will be rewritten but only updates 8-15 is an
>> invalid program and 0-7 will have undefined contents afterwards, which
>> *will* be written back to main memory. This is how improper use of
>> MEM.REWRITE can destroy nearby data within the region.
>>
>
> ok so is there any scenario in which destroying nearby data is
> actually acceptable?

Such effects are limited to the region specified to be overwritten. The
*entire* region *will* be overwritten, either with intended data or, if
the program is invalid, whatever the implementation had in that cacheline.

Note that the example was a program that executed MEM.REWRITE for bytes
0-15 but only actually wrote to bytes 8-15. That is invalid. The data
destroyed is "nearby" to what was actually written, but is entirely
*within* the region that the program declared would be written.

If the program instead (correctly) executes MEM.REWRITE for bytes 8-15,
then hardware must detect the inclusion of a partial cacheline and
prefetch that cacheline instead, in order to preserve bytes 0-7 which
must not be affected since they are *outside* of the region.

> if 100% absolutely not, then by logical
> inference it should not even be permitted, and if it's not even
> permitted then like a "Capability" system rather than an "ACL" system
> it should not even be made possible to *issue* a MEM.REWRITE on
> anything other than a cache boundary. which in turn implies that
> knowledge of the cache boundaries would need to be made available up
> to userspace in order for the instruction *to* be issued only on a
> cache boundary.

The proposal requires hardware to handle this by prefetching (as by
MEM.PF.EXCL) any partial cachelines (there can be at most two) affected
by MEM.REWRITE, instead of applying MEM.REWRITE to those cachelines.
Analogously, MEM.DISCARD must write back partial cachelines, but can
discard cachelines entirely within the region.


-- Jacob

lkcl

unread,
Mar 27, 2018, 11:59:55 PM3/27/18
to Jacob Bachmeyer, Guy Lemieux, RISC-V ISA Dev
On Wed, Mar 28, 2018 at 4:48 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> lkcl wrote:
>>
>> On Wed, Mar 28, 2018 at 2:15 AM, Jacob Bachmeyer <jcb6...@gmail.com>
>> wrote:
>>
>>
>>>
>>> MEM.REWRITE is the only proposed instruction that can create undefined
>>> contents. The boundaries are transparent: an attempt to rewrite an
>>> unaligned region causes the partial cachelines in the region to be
>>> prefetched instead.
>>>
>>> A program that declares 0-15 will be rewritten but only updates 8-15 is
>>> an
>>> invalid program and 0-7 will have undefined contents afterwards, which
>>> *will* be written back to main memory. This is how improper use of
>>> MEM.REWRITE can destroy nearby data within the region.
>>>
>>
>>
>> ok so is there any scenario in which destroying nearby data is
>> actually acceptable?
>
>
> Such effects are limited to the region specified to be overwritten. The
> *entire* region *will* be overwritten, either with intended data or, if the
> program is invalid, whatever the implementation had in that cacheline.

why would the program be invalid? put another way: is there a direct
100% correlation between "program being invalid" and "MEM.REWRITE not
being on a cache boundary"?

> Note that the example was a program that executed MEM.REWRITE for bytes 0-15
> but only actually wrote to bytes 8-15. That is invalid. The data destroyed
> is "nearby" to what was actually written, but is entirely *within* the
> region that the program declared would be written.
>
> If the program instead (correctly) executes MEM.REWRITE for bytes 8-15, then
> hardware must detect the inclusion of a partial cacheline and prefetch that
> cacheline instead, in order to preserve bytes 0-7 which must not be affected
> since they are *outside* of the region.

ok. so, correct me if i'm wrong here: the purpose of MEM.REWRITE is
to avoid prefetching, is that right? therefore having to do the
prefetch for bytes 0-7 entirely defeats that purpose. and, if
MEM.REWRITE is totally transparent (i.e. has no way of knowing if the
memory is on a cache boundary), then the probability of *actually*
avoiding the prefetch that MEM.REWRITE is intended to avoid is really
low!

i'm trying to explore making the case that the user application needs
to know the cache boundary (and cache line size) if that's not already
part of the specification. apologies i don't know if it is already or
not.

l.

Jacob Bachmeyer

unread,
Mar 28, 2018, 1:24:52 AM3/28/18
to lkcl, Guy Lemieux, RISC-V ISA Dev
lkcl wrote:
> On Wed, Mar 28, 2018 at 4:48 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> lkcl wrote:
>>
>>> On Wed, Mar 28, 2018 at 2:15 AM, Jacob Bachmeyer <jcb6...@gmail.com>
>>> wrote:
>>>
>>>
>>>> MEM.REWRITE is the only proposed instruction that can create undefined
>>>> contents. The boundaries are transparent: an attempt to rewrite an
>>>> unaligned region causes the partial cachelines in the region to be
>>>> prefetched instead.
>>>>
>>>> A program that declares 0-15 will be rewritten but only updates 8-15 is
>>>> an
>>>> invalid program and 0-7 will have undefined contents afterwards, which
>>>> *will* be written back to main memory. This is how improper use of
>>>> MEM.REWRITE can destroy nearby data within the region.
>>>>
>>> ok so is there any scenario in which destroying nearby data is
>>> actually acceptable?
>>>
>> Such effects are limited to the region specified to be overwritten. The
>> *entire* region *will* be overwritten, either with intended data or, if the
>> program is invalid, whatever the implementation had in that cacheline.
>>
>
> why would the program be invalid? put another way: is there a direct
> 100% correlation between "program being invalid" and "MEM.REWRITE not
> being on a cache boundary"?
>

No, the program is invalid because it declared intent to write to bytes
0-7 (as part of the region 0-15) but did not do so.

>> Note that the example was a program that executed MEM.REWRITE for bytes 0-15
>> but only actually wrote to bytes 8-15. That is invalid. The data destroyed
>> is "nearby" to what was actually written, but is entirely *within* the
>> region that the program declared would be written.
>>
>> If the program instead (correctly) executes MEM.REWRITE for bytes 8-15, then
>> hardware must detect the inclusion of a partial cacheline and prefetch that
>> cacheline instead, in order to preserve bytes 0-7 which must not be affected
>> since they are *outside* of the region.
>>
>
> ok. so, correct me if i'm wrong here: the purpose of MEM.REWRITE is
> to avoid prefetching, is that right? therefore having to do the
> prefetch for bytes 0-7 entirely defeats that purpose. and, if
> MEM.REWRITE is totally transparent (i.e. has no way of knowing if the
> memory is on a cache boundary), then the probability of *actually*
> avoiding the prefetch that MEM.REWRITE is intended to avoid is really
> low!
>

For regions smaller than a cacheline, MEM.REWRITE reduces to a
prefetch. MEM.REWRITE (and MEM.DISCARD) are intended for larger buffers
that span many cachelines. While each can produce at most two
prefetches (for partial cachelines at the beginning and end of a
region), the full cachelines in the middle get the special treatment
that avoids access to memory.

> i'm trying to explore making the case that the user application needs
> to know the cache boundary (and cache line size) if that's not already
> part of the specification. apologies i don't know if it is already or
> not.

Avoiding dependency on the cacheline size is one of the goals of this
proposal; thus the requirement that hardware execute destructive
operations on less-than-entire-cachelines as prefetches. This allows,
for example, different harts (where migration is transparent to
software) or even different regions of the address space on a single
hart to have different cacheline sizes or other boundaries. The
proposal does not limit these instructions specifically to cache-based
implementations.

An application that really wants to know the location of cacheline
boundaries can examine the results of synchronous prefetches, which
produce results aligned to a boundary. A major goal of the proposal is
that most applications should not need to care and hardware will "do the
right thing" with respect to alignment. This will also greatly simplify
describing the cache management operations in terms of ranged fences to
fit the new memory model. (The new memory model is newer than the early
drafts of this proposal.)


-- Jacob

Albert Cahalan

unread,
Mar 28, 2018, 4:29:03 AM3/28/18
to lkcl ., Guy Lemieux, Jacob Bachmeyer, RISC-V ISA Dev
Normal terminology would be that breaking cache coherency is intended.
The CPU gains a small amount of fast memory that can be used as
scratch space for heavily-optimized computation or for running firmware
before DRAM has been configured.

I'm aware of a computation example using the MPC7410 hardware.
That allowed half or all of the cache to be mapped in at a user-chosen
physical address that was only seen by that CPU. In this case, there
is no underlying RAM.

I'm aware of a firmware example for Intel's Haswell in coreboot:
https://github.com/coreboot/coreboot/blob/master/src/cpu/intel/haswell/cache_as_ram.inc
https://github.com/coreboot/coreboot/blob/master/src/cpu/intel/haswell/romstage.c

I'm aware of a PowerPC case that is similar to the Intel Haswell one.
In such cases, flash or even a true mask ROM can be made to seem
as if it were writable. Variables get updated just fine, until the pinned
cache lines are purposely discarded by resetting the cache.

The idea of being able to pin only N-1 of N cache lines seems odd,
as does the idea that cache lines might suddenly come unpinned,
and even the idea that they would ever have a place to be written
back to.

lkcl

unread,
Mar 28, 2018, 5:20:21 AM3/28/18
to Jacob Bachmeyer, Guy Lemieux, RISC-V ISA Dev
On Wed, Mar 28, 2018 at 6:24 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> why would the program be invalid? put another way: is there a direct
>> 100% correlation between "program being invalid" and "MEM.REWRITE not
>> being on a cache boundary"?
>>
>
>
> No, the program is invalid because it declared intent to write to bytes 0-7
> (as part of the region 0-15) but did not do so.

right. ok. got it. so in that circumstance... that would almost
certainly be Extremely Bad For Security if the contents of the
unwritten cache line are not set to a known (cleared) value. i would
strongly recommend that it be made mandatory on MEM.REWRITE that the
cache lines be cleared to zero immediately. of course this will be
"abused" to write memset(0...) but hey, that's not an actual
"problem".


> For regions smaller than a cacheline, MEM.REWRITE reduces to a prefetch

... which is in turn a reduction in performance.

> MEM.REWRITE (and MEM.DISCARD) are intended for larger buffers that span many
> cachelines.

can you be certain that there are no prominent use-cases where the
buffers are fragmented, or perhaps alternating in memory? i.e. at
regular offsets? i can think of an example: a c struct which contains
say an 8-byte... "thing" as part of a... 32-byte or possibly even a
20-bit "thing". lots of them in a huge array. or... if that's not a
totally suitable example, then make it a.... 18-byte "thing" in a
32-byte or a 34-byte "thing" as a statically-allocated array.

in the case of the 18-byte object that's part of a 34-byte in a
statically-allocated contiguous memory block, the number of times that
the MEM.REWRITE gets turned into prefetch(es) is disproportionately
high.

i think you would agree that it would be reasonable to have such a
data structure (without having to get into the full details, such as
whether it's a... B+ Tree or a... packed database array or a... video
decode algorithm).


>> i'm trying to explore making the case that the user application needs
>> to know the cache boundary (and cache line size) if that's not already
>> part of the specification. apologies i don't know if it is already or
>> not.
>
>
> Avoiding dependency on the cacheline size is one of the goals of this
> proposal;

ah ok. one of the advantages / disadvantages of someone coming in
new is that the parameters / scope gets accidentally tested due to
lack of information / assumptions. new people are good for black-box
testing of proposals, despite it appearing to be a pain :)


> thus the requirement that hardware execute destructive operations
> on less-than-entire-cachelines as prefetches. This allows, for example,
> different harts (where migration is transparent to software) or even
> different regions of the address space on a single hart to have different
> cacheline sizes or other boundaries.

a properly-written program, if given direct and expllicit access to
the cache line size and boundary, should be capable of doing exactly
that.


> The proposal does not limit these
> instructions specifically to cache-based implementations.

*click*... interesting! hmm... so... that would be bizarre. i would
hazard a guess in cache-less implementation circumstances that
declaring a fake cache line size and boundary would result in at least
predictable behaviour without too significant down-sides.

l.

Jacob Bachmeyer

unread,
Mar 28, 2018, 7:36:17 PM3/28/18
to Albert Cahalan, lkcl ., Guy Lemieux, RISC-V ISA Dev
This is the option for implementations to use scratchpad memory.

> I'm aware of a firmware example for Intel's Haswell in coreboot:
> https://github.com/coreboot/coreboot/blob/master/src/cpu/intel/haswell/cache_as_ram.inc
> https://github.com/coreboot/coreboot/blob/master/src/cpu/intel/haswell/romstage.c
>
> I'm aware of a PowerPC case that is similar to the Intel Haswell one.
> In such cases, flash or even a true mask ROM can be made to seem
> as if it were writable. Variables get updated just fine, until the pinned
> cache lines are purposely discarded by resetting the cache.
>
> The idea of being able to pin only N-1 of N cache lines seems odd,
>

I had assumed that systems with caches might only be able to access
memory through the cache. In this case, pinning all N cachelines would
mean that no other memory addresses are accessible.

> as does the idea that cache lines might suddenly come unpinned,
>

This is required to support preemptive multitasking, if cachelines are
to be pinnable from U-mode. M-mode can pin cachelines that will stay
pinned because M-mode can ensure that the events that would cause a
cacheline to be unpinned do not occur. (This is also why instruction
cache pinning is proposed as M-mode-only.)

> and even the idea that they would ever have a place to be written
> back to.

The idea was that an AES implementation might pin its tables, for
example. For situations (such as boot firmware) where pinned cachelines
are actually making ROM "writable", the solution is to unpin with
MEM.DISCARD.


-- Jacob

Jacob Bachmeyer

unread,
Mar 28, 2018, 7:54:56 PM3/28/18
to lkcl, Guy Lemieux, RISC-V ISA Dev
lkcl wrote:
> On Wed, Mar 28, 2018 at 6:24 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>
>>> why would the program be invalid? put another way: is there a direct
>>> 100% correlation between "program being invalid" and "MEM.REWRITE not
>>> being on a cache boundary"?
>>>
>> No, the program is invalid because it declared intent to write to bytes 0-7
>> (as part of the region 0-15) but did not do so.
>>
>
> right. ok. got it. so in that circumstance... that would almost
> certainly be Extremely Bad For Security if the contents of the
> unwritten cache line are not set to a known (cleared) value.

This is the reason that the proposal explicitly requires that
implementations ensure that MEM.DISCARD and MEM.REWRITE cannot introduce
foreign data, defined as data not otherwise accessible to the current
hart at the current privilege level at any address. Data previously
accessible at some other address may "appear" in the region, but that is
not a security concern.

> i would
> strongly recommend that it be made mandatory on MEM.REWRITE that the
> cache lines be cleared to zero immediately.

An earlier draft had MEM.REWRITE as "CACHE.PREZERO" with exactly that.

> of course this will be
> "abused" to write memset(0...) but hey, that's not an actual
> "problem".
>

It is a problem: cachelines already present and valid are *not*
affected. (Well, they might be, but portable software cannot assume that.)

>> For regions smaller than a cacheline, MEM.REWRITE reduces to a prefetch
>>
>
> ... which is in turn a reduction in performance.
>

Why? A prefetch simply initiates an operation (memory fetch) that the
processor will stall for later when LOAD is executed. If no prefetch
the operation initiates as the processor stalls.

>> MEM.REWRITE (and MEM.DISCARD) are intended for larger buffers that span many
>> cachelines.
>>
>
> can you be certain that there are no prominent use-cases where the
> buffers are fragmented, or perhaps alternating in memory? i.e. at
> regular offsets? i can think of an example: a c struct which contains
> say an 8-byte... "thing" as part of a... 32-byte or possibly even a
> 20-bit "thing". lots of them in a huge array. or... if that's not a
> totally suitable example, then make it a.... 18-byte "thing" in a
> 32-byte or a 34-byte "thing" as a statically-allocated array.
>
> in the case of the 18-byte object that's part of a 34-byte in a
> statically-allocated contiguous memory block, the number of times that
> the MEM.REWRITE gets turned into prefetch(es) is disproportionately
> high.
>
> i think you would agree that it would be reasonable to have such a
> data structure (without having to get into the full details, such as
> whether it's a... B+ Tree or a... packed database array or a... video
> decode algorithm).
>

If processing objects that are approximately on the scale of a
cacheline, the correct solution is to simply prefetch the larger array
(or segments thereof) and not worry about it. MEM.REWRITE is
appropriate if an entire several-KiB array (like a frame of video) is
being copied.

>> thus the requirement that hardware execute destructive operations
>> on less-than-entire-cachelines as prefetches. This allows, for example,
>> different harts (where migration is transparent to software) or even
>> different regions of the address space on a single hart to have different
>> cacheline sizes or other boundaries.
>>
>
> a properly-written program, if given direct and expllicit access to
> the cache line size and boundary, should be capable of doing exactly
> that.
>

That does not work in the presence of migrations between harts with
different cache structures, since the migration can occur between any
two instructions, introducing a time-of-check-time-of-use condition.

>> The proposal does not limit these
>> instructions specifically to cache-based implementations.
>>
>
> *click*... interesting! hmm... so... that would be bizarre. i would
> hazard a guess in cache-less implementation circumstances that
> declaring a fake cache line size and boundary would result in at least
> predictable behaviour without too significant down-sides.

I would prefer not to require that such an implementation emulate
caches. It would be silly, much like the "emulated geometries" of
larger IDE hard disks before the widespread use of LBA mode.


-- Jacob

lkcl .

unread,
Mar 29, 2018, 1:49:14 AM3/29/18
to Jacob Bachmeyer, Guy Lemieux, RISC-V ISA Dev
On Thu, Mar 29, 2018 at 12:54 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> lkcl wrote:
>>
>>> No, the program is invalid because it declared intent to write to bytes
>>> 0-7
>>> (as part of the region 0-15) but did not do so.
>>>
>>
>>
>> right. ok. got it. so in that circumstance... that would almost
>> certainly be Extremely Bad For Security if the contents of the
>> unwritten cache line are not set to a known (cleared) value.
>
>
> This is the reason that the proposal explicitly requires that
> implementations ensure that MEM.DISCARD and MEM.REWRITE cannot introduce
> foreign data, defined as data not otherwise accessible to the current hart
> at the current privilege level at any address.

ok so the proposal does define that they cannot introduce (or access)
foreign data, but that is a requirement not an implementation detail.

> Data previously accessible
> at some other address may "appear" in the region, but that is not a security
> concern.

it sounds to me as if that is precisely *the* definition of a
security concern! :)

is it made clear precisely how, through its implementation,
MEM.REWRITE meets the requirement to not access foreign data in this
[programmatically-invalid] way?

what has me concerned is the potential scenario similar to one of
those meltdown or spectre-esque things where you swap privilege levels
or even just swaps thread/process contexts, and a given cache line
contains data that is from a former privilege level.

a security flaw is not necessarily just related to privilege levels,
it is also a security flaw for different processes to gain access to
another process' data, because you don't know if those processes are
supposed to be under different user-ids or not.... therefore you have
to assume the worst.

so is it *guaranteed* to be the case, in *all* implementations of
*all* RISC-V processors, that the cache will *ALWAYS* be cleared
between process context-switches and also cleared when switching
privilege levels?


>> i would
>> strongly recommend that it be made mandatory on MEM.REWRITE that the
>> cache lines be cleared to zero immediately.
>
>
> An earlier draft had MEM.REWRITE as "CACHE.PREZERO" with exactly that.

ok. interesting. CACHE.PREONES would do the job too :)


>> of course this will be
>> "abused" to write memset(0...) but hey, that's not an actual
>> "problem".
>>
>
>
> It is a problem: cachelines already present and valid are *not* affected.
> (Well, they might be, but portable software cannot assume that.)

ok. worth noting then, "don't try abusing this as memset(0)!" :)


>>> For regions smaller than a cacheline, MEM.REWRITE reduces to a prefetch
>>>
>>
>>
>> ... which is in turn a reduction in performance.
>>
>
>
> Why? A prefetch simply initiates an operation (memory fetch) that the
> processor will stall for later when LOAD is executed. If no prefetch the
> operation initiates as the processor stalls.

if the processor is guaranteed to stall as a result of the prefetch,
or there are other resources guaranteed to be requested, that would be
the definition of a reduction in performance, would it not?

if however i have misunderstood and the prefetch does *not*
necessarily result in a guaranteed stall or a guaranteed request for
resources (memory?) then there would not be corrrespondingly a
guaranteed reduction in perfomance, and i will stop asking about this
:)

>> 20-bit "thing". lots of them in a huge array. or... if that's not a
>> totally suitable example, then make it a.... 18-byte "thing" in a
>> 32-byte or a 34-byte "thing" as a statically-allocated array.
>
> If processing objects that are approximately on the scale of a cacheline,
> the correct solution is to simply prefetch the larger array (or segments
> thereof) and not worry about it. MEM.REWRITE is appropriate if an entire
> several-KiB array (like a frame of video) is being copied.

that sounds like a lost opportunity, to me.

>> a properly-written program, if given direct and expllicit access to
>> the cache line size and boundary, should be capable of doing exactly
>> that.
>
> That does not work in the presence of migrations between harts with
> different cache structures,

oh! eek! :) yes, of course it doesn't. dang.

> since the migration can occur between any two
> instructions, introducing a time-of-check-time-of-use condition.

yehyeh. yuck. i'd assumed that the cache line sizes would be
uniform, such that a program could dynamically allocate matching data
structures. ok. good call, jacob.

>>> The proposal does not limit these
>>> instructions specifically to cache-based implementations.
>>>
>>
>>
>> *click*... interesting! hmm... so... that would be bizarre. i would
>> hazard a guess in cache-less implementation circumstances that
>> declaring a fake cache line size and boundary would result in at least
>> predictable behaviour without too significant down-sides.
>
>
> I would prefer not to require that such an implementation emulate caches.
> It would be silly, much like the "emulated geometries" of larger IDE hard
> disks before the widespread use of LBA mode.

well, the suggestion is moot now give that a case was made
(non-uniform cache sizes) which makes invalid the suggestion to make
programs aware of cache-line sizes and boundaries.

darn this stuff's so detailed, isn't it?

l.

Cesar Eduardo Barros

unread,
Mar 29, 2018, 7:39:48 AM3/29/18
to jcb6...@gmail.com, lkcl, Guy Lemieux, RISC-V ISA Dev
Em 28-03-2018 20:54, Jacob Bachmeyer escreveu:
> lkcl wrote:
>> On Wed, Mar 28, 2018 at 6:24 AM, Jacob Bachmeyer <jcb6...@gmail.com>
>> wrote:
>>
>>>>  why would the program be invalid?  put another way: is there a direct
>>>> 100% correlation between "program being invalid" and "MEM.REWRITE not
>>>> being on a cache boundary"?
>>> No, the program is invalid because it declared intent to write to
>>> bytes 0-7
>>> (as part of the region 0-15) but did not do so.
>>
>>  right.  ok.  got it.  so in that circumstance... that would almost
>> certainly be Extremely Bad For Security if the contents of the
>> unwritten cache line are not set to a known (cleared) value.
>
> This is the reason that the proposal explicitly requires that
> implementations ensure that MEM.DISCARD and MEM.REWRITE cannot introduce
> foreign data, defined as data not otherwise accessible to the current
> hart at the current privilege level at any address.  Data previously
> accessible at some other address may "appear" in the region, but that is
> not a security concern.

That's insufficiently strict. Consider a process doing a MEM.REWRITE to
a shared memory region, which is then preempted and switched to another
process which also has access to the same shared memory region. Data
accessible at some other address to the first process might not be
accessible at any address to the second process; if MEM.REWRITE makes it
appear on the cache, the second process will be able to read what it
shouldn't, since the cache is usually physical tagged, and since the
memory region is shared the physical address is the same.

I think MEM.REWRITE must be made stricter. Each byte in the region
affected by MEM.REWRITE must either keep its current value, or be set to
a deterministic value. In a common implementation, the current value
would be used when the cache line was already allocated for that
address, and a deterministic value (probably zero) would be used when it
had to allocate a new cache line.

Also, this shows that MEM.REWRITE followed by a read instead of a write
can happen on a valid program: the program itself might have intended to
write immediately after the MEM.REWRITE, but it can be preempted and
something else sharing the same memory (or a debugger, or a
debugging-like tool) can read from that cache line before the program
gets its time slice back. Simply saying "don't do that" is not enough.
Even saying "flush the cache when switching processes" is not enough,
since what reads the memory can be a BPF program for a seccomp filter
running on the kernel (entering the kernel doesn't count as a process
switch).

And a "debugging-like tool" includes rr, which really doesn't like
non-determinism (it already has problems with lr/sc; in some other
thread we were discussing ways of trapping a failed sc so rr can record
it). For tools like that, it would be good to be able to switch
MEM.REWRITE into a deterministic mode (either turning into a prefetch,
or always zeroing, or even trapping).

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Bruce Hoult

unread,
Mar 29, 2018, 8:22:56 AM3/29/18
to Cesar Eduardo Barros, Jacob Bachmeyer, lkcl, Guy Lemieux, RISC-V ISA Dev
There's also the whole issue of supposedly sandboxed untrusted code (possibly interpreted or jitted from a source such as Javascript) running inside another process.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d696cd18-94f9-1c4e-ddc9-aa1c2c02137b%40cesarb.eti.br.

Albert Cahalan

unread,
Mar 29, 2018, 9:55:49 AM3/29/18
to jcb6...@gmail.com, lkcl ., Guy Lemieux, RISC-V ISA Dev
On 3/28/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Albert Cahalan wrote:

>> as does the idea that cache lines might suddenly come unpinned,
>
> This is required to support preemptive multitasking, if cachelines are
> to be pinnable from U-mode.

I think that is the problem right there. U-mode normally shouldn't be
able to do this. It's fundamentally an M-mode feature.

More-privileged levels can of course provide handlers to deal with
exceptions from attempting the operation or from explicitly asking
via something like a system call. Intermediate privilege levels can
even translate the request either way between implicit (just try it)
and explicit (do a system call).

More-privileged levels ought to be able to control availability of all
these instructions independently, without revealing if any restrictions
are in use. I think the default ought to be that U-mode gets just
enough to do a JIT, and all other modes get everything enabled.
Hypervisor and M-mode code can disable things as needed for
protection against a hostile OS, if that protection is even desired.

Jacob Bachmeyer

unread,
Mar 29, 2018, 7:10:40 PM3/29/18
to Albert Cahalan, lkcl ., Guy Lemieux, RISC-V ISA Dev
Perhaps I am chasing the Golden Fleece, but I would prefer to define
these operations such that they are not abusable. There should be no
need for them to ever trap (excepting page faults under certain conditions).


-- Jacob

Jacob Bachmeyer

unread,
Mar 29, 2018, 7:42:10 PM3/29/18
to lkcl ., Guy Lemieux, RISC-V ISA Dev
lkcl . wrote:
> On Thu, Mar 29, 2018 at 12:54 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> lkcl wrote:
>>
>>>> No, the program is invalid because it declared intent to write to bytes
>>>> 0-7 (as part of the region 0-15) but did not do so.
>>>>
>>> right. ok. got it. so in that circumstance... that would almost
>>> certainly be Extremely Bad For Security if the contents of the
>>> unwritten cache line are not set to a known (cleared) value.
>>>
>> This is the reason that the proposal explicitly requires that
>> implementations ensure that MEM.DISCARD and MEM.REWRITE cannot introduce
>> foreign data, defined as data not otherwise accessible to the current hart
>> at the current privilege level at any address.
>>
>
> ok so the proposal does define that they cannot introduce (or access)
> foreign data, but that is a requirement not an implementation detail.
>

That is a requirement that implementations must meet, yes. How exactly
they meet that requirement is left to implementors.

>> Data previously accessible
>> at some other address may "appear" in the region, but that is not a security
>> concern.
>>
>
> it sounds to me as if that is precisely *the* definition of a
> security concern! :)
>

Data belonging to the same process appearing "out of thin air" does not
leak information, since that data was simply a LOAD away in any case.
Foreign data belonging to another process appearing "out of thin air"
*would* leak information, be a serious security problem, and is
explicitly prohibited in the proposed spec.

> is it made clear precisely how, through its implementation,
> MEM.REWRITE meets the requirement to not access foreign data in this
> [programmatically-invalid] way?
>

How exactly the implementation meets this requirement is left to
implementors.

> what has me concerned is the potential scenario similar to one of
> those meltdown or spectre-esque things where you swap privilege levels
> or even just swaps thread/process contexts, and a given cache line
> contains data that is from a former privilege level.
>

The implementation is not permitted to make that data visible as a
result of MEM.REWRITE -- the cacheline must not be readable until the
foreign data is removed. Draft 5 gives a few options for this, ranging
from implementing MEM.REWRITE by temporarily making the region "write
through", to using dedicated write-combining buffers for MEM.REWRITE, to
using a monitor trap to zero cachelines invalidly read. (Such use of a
monitor trap could produce the (allowed) destruction of data within the
region near an invalid read, if byte 12 is written before attempting to
read byte 4, trapping to the monitor to clear the cacheline, then the
entire 16-byte cacheline may be zeroed by the monitor, destroying the
data previously (validly) written at byte 12.)

Intel's PR aside, Meltdown *is* the result of a defective design.
Period. If the x86 ISA allows that behavior, then the x86 ISA (which
Intel also controls, no excuse there) is defective. There is *no*
*excuse* for Meltdown.

> a security flaw is not necessarily just related to privilege levels,
> it is also a security flaw for different processes to gain access to
> another process' data, because you don't know if those processes are
> supposed to be under different user-ids or not.... therefore you have
> to assume the worst.
>

That is covered under the prohibition on introducing foreign data.
Perhaps this is unclear in the proposal?

> so is it *guaranteed* to be the case, in *all* implementations of
> *all* RISC-V processors, that the cache will *ALWAYS* be cleared
> between process context-switches and also cleared when switching
> privilege levels?
>

There is no requirement to clear the cache, although clearing the cache
is certainly sufficient. Partitioning the cache is the approach I
prefer for closing cache-based side channels, but this proposal does not
itself seek to close existing side channels, only to avoid introducing
new side channels.

An implementation of RISC-V that leaks information across hardware
security boundaries in this way is a defective implementation. The
proposed spec specifically requires that foreign data not be
introduced. An implementation that fails to meet that requirement is
defective and does not correctly implement the cache-control instructions.

>>> of course this will be
>>> "abused" to write memset(0...) but hey, that's not an actual
>>> "problem".
>>>
>> It is a problem: cachelines already present and valid are *not* affected.
>> (Well, they might be, but portable software cannot assume that.)
>>
>
> ok. worth noting then, "don't try abusing this as memset(0)!" :)
>

From draft 5: "NOTE WELL: MEM.REWRITE is *not* memset(3)" :-)

>>>> For regions smaller than a cacheline, MEM.REWRITE reduces to a prefetch
>>>>
>>> ... which is in turn a reduction in performance.
>>>
>> Why? A prefetch simply initiates an operation (memory fetch) that the
>> processor will stall for later when LOAD is executed. If no prefetch the
>> operation initiates as the processor stalls.
>>
>
> if the processor is guaranteed to stall as a result of the prefetch,
> or there are other resources guaranteed to be requested, that would be
> the definition of a reduction in performance, would it not?
>
> if however i have misunderstood and the prefetch does *not*
> necessarily result in a guaranteed stall or a guaranteed request for
> resources (memory?) then there would not be corrrespondingly a
> guaranteed reduction in perfomance, and i will stop asking about this
> :)
>

This is correct -- the prefetch could hit in cache, and for a region
smaller than a cacheline, need not block the processor even for
synchronous prefetch since the address to be returned (next cacheline
boundary) is known immediately. (MEM.REWRITE might return the upper
bound of the region instead of a cacheline boundary.)

>>> 20-bit "thing". lots of them in a huge array. or... if that's not a
>>> totally suitable example, then make it a.... 18-byte "thing" in a
>>> 32-byte or a 34-byte "thing" as a statically-allocated array.
>>>
>> If processing objects that are approximately on the scale of a cacheline,
>> the correct solution is to simply prefetch the larger array (or segments
>> thereof) and not worry about it. MEM.REWRITE is appropriate if an entire
>> several-KiB array (like a frame of video) is being copied.
>>
>
> that sounds like a lost opportunity, to me.
>

At some point, the cache's normal operation is what is needed. For data
smaller than a cacheline, this is the case.

>>> a properly-written program, if given direct and expllicit access to
>>> the cache line size and boundary, should be capable of doing exactly
>>> that.
>>>
>> That does not work in the presence of migrations between harts with
>> different cache structures,
>>
>
> oh! eek! :) yes, of course it doesn't. dang.
>
>
>> since the migration can occur between any two
>> instructions, introducing a time-of-check-time-of-use condition.
>>
>
> yehyeh. yuck. i'd assumed that the cache line sizes would be
> uniform, such that a program could dynamically allocate matching data
> structures. ok. good call, jacob.
>

There was an issue with some ARM "big.LITTLE" SoCs that had non-uniform
cacheline sizes and the implementing SoC vendor (not ARM itself) screwed
it up. We will have even less control over vendors than ARM.


-- Jacob

Jacob Bachmeyer

unread,
Mar 29, 2018, 8:04:16 PM3/29/18
to Cesar Eduardo Barros, lkcl, Guy Lemieux, RISC-V ISA Dev
The process context switch has either changed the current ASID or
changed the page table root for the current ASID. As such, the shared
memory region (as seen by process A) is now foreign data to process B.
Since I expect VIPT caches to be ASID-partitioned (ASID == more
untranslated address bits), process B will effectively have its own
cache independent of process A.

This is the foreign data prohibition from draft 5:

"""
These instructions create regions with undefined contents and share a
requirement that foreign data never be introduced. Foreign data is,
simply, data that was not previously visible to the current hart at the
current privilege level at any address. Operating systems zero pages
before attaching them to user address spaces to prevent foreign data
from appearing in freshly-allocated pages. Implementations must ensure
that these instructions do not cause foreign data to leak through caches
or other structures.
"""

Advice on improving it is welcome; it is intended to cover the case
mentioned above.

> I think MEM.REWRITE must be made stricter. Each byte in the region
> affected by MEM.REWRITE must either keep its current value, or be set
> to a deterministic value. In a common implementation, the current
> value would be used when the cache line was already allocated for that
> address, and a deterministic value (probably zero) would be used when
> it had to allocate a new cache line.

That was CACHE.PREZERO in a previous draft. The main reason that I want
to leave the non-deterministic option is that some implementations might
be able to safely elide the cacheline load entirely and just set the tag
to "valid, exclusive" in at least some cases. (If hardware cannot
*prove* that the cacheline contains previously-accessible data, it
*must* be overwritten with some constant before the program is permitted
to load from that cacheline.)

> Also, this shows that MEM.REWRITE followed by a read instead of a
> write can happen on a valid program: the program itself might have
> intended to write immediately after the MEM.REWRITE, but it can be
> preempted and something else sharing the same memory (or a debugger,
> or a debugging-like tool) can read from that cache line before the
> program gets its time slice back. Simply saying "don't do that" is not
> enough. Even saying "flush the cache when switching processes" is not
> enough, since what reads the memory can be a BPF program for a seccomp
> filter running on the kernel (entering the kernel doesn't count as a
> process switch).

Entering the supervisor is a privilege level change, however, and no
data accessible to any user process is foreign to the supervisor.
(There might be a spectre haunting the supervisor here, though.)
Similarly for the debugger: no data accessible to the debugged program
is foreign to the debugger. This might cause a problem for the "monitor
trap" option, but could also be avoided by partitioning the cache: the
debugger will not hit the same cacheline as the user process itself,
since ptrace(2) goes through the supervisor.

This could complicate inter-process cache coherency, suggesting that
CACHE.WRITEBACK, when executed in S-mode, needs to be able to target an
ASID or that a new "SFENCE.CACHE" instruction for writing back
cachelines associated with a non-current ASID may be needed.

> And a "debugging-like tool" includes rr, which really doesn't like
> non-determinism (it already has problems with lr/sc; in some other
> thread we were discussing ways of trapping a failed sc so rr can
> record it). For tools like that, it would be good to be able to switch
> MEM.REWRITE into a deterministic mode (either turning into a prefetch,
> or always zeroing, or even trapping).

This is the first legitimate argument I have seen for allowing any of
the cache-control instructions to be explicitly trapped. As an
alternative (and to avoid a proliferation of control flags), could a
future "deterministic execution" extension add a single "deterministic
execution" flag, one effect of setting which would be making MEM.REWRITE
deterministic (or a no-op)?


-- Jacob

Jacob Bachmeyer

unread,
Mar 29, 2018, 8:05:38 PM3/29/18
to Bruce Hoult, Cesar Eduardo Barros, lkcl, Guy Lemieux, RISC-V ISA Dev
Bruce Hoult wrote:
> There's also the whole issue of supposedly sandboxed untrusted code
> (possibly interpreted or jitted from a source such as Javascript)
> running inside another process.

That does not cross a hardware security boundary, so is out-of-scope for
this proposal. Presumably, an interpreter or JIT will not produce
MEM.REWRITE.


-- Jacob
> ces...@cesarb.eti.br <mailto:ces...@cesarb.eti.br>
>
> --
> You received this message because you are subscribed to the Google
> Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to isa-dev+u...@groups.riscv.org
> <mailto:isa-dev%2Bunsu...@groups.riscv.org>.
> To post to this group, send email to isa...@groups.riscv.org
> <mailto:isa...@groups.riscv.org>.
> <https://groups.google.com/a/groups.riscv.org/group/isa-dev/>.
> <https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/d696cd18-94f9-1c4e-ddc9-aa1c2c02137b%40cesarb.eti.br>.
>
>

Cesar Eduardo Barros

unread,
Mar 29, 2018, 9:56:21 PM3/29/18
to jcb6...@gmail.com, Bruce Hoult, lkcl, Guy Lemieux, RISC-V ISA Dev
Em 29-03-2018 21:05, Jacob Bachmeyer escreveu:
> Bruce Hoult wrote:
>> There's also the whole issue of supposedly sandboxed untrusted code
>> (possibly interpreted or jitted from a source such as Javascript)
>> running inside another process.
>
> That does not cross a hardware security boundary, so is out-of-scope for
> this proposal.  Presumably, an interpreter or JIT will not produce
> MEM.REWRITE.

But code outside the interpreter/JIT (but within the same process) might
still use MEM.REWRITE. For instance, if there's a shared buffer between
the sandboxed code and normal code outside the sandbox, and the normal
code attempts to memcpy() something into that shared buffer, and the
memcpy() code uses MEM.REWRITE, you would have a leak.

Jacob Bachmeyer

unread,
Mar 29, 2018, 10:27:30 PM3/29/18
to Cesar Eduardo Barros, Bruce Hoult, lkcl, Guy Lemieux, RISC-V ISA Dev
The leak is only possible if the (trusted) host memcpy() implementation
uses MEM.REWRITE incorrectly and does not overwrite the entire region
that it declares as its destination.


-- Jacob

Cesar Eduardo Barros

unread,
Mar 29, 2018, 10:54:38 PM3/29/18
to jcb6...@gmail.com, lkcl, Guy Lemieux, RISC-V ISA Dev
I don't expect that to be the case, for the simple reason that I don't
recall ever hearing about any real processor using the ASID to partition
the L1 cache. And I can guess why: for the still very common
single-threaded case, that would waste more than half of the cache.
(Although the ghost of Spectre might change things; we'll see.)

The ASID could be part of the tag instead, but I also don't expect that
to be the case, since it would make the tag bigger for no real gain (a
physical address is the same memory, no matter which ASID was used to
access it).

>
> This is the foreign data prohibition from draft 5:
>
> """
> These instructions create regions with undefined contents and share a
> requirement that foreign data never be introduced.  Foreign data is,
> simply, data that was not previously visible to the current hart at the
> current privilege level at any address.  Operating systems zero pages
> before attaching them to user address spaces to prevent foreign data
> from appearing in freshly-allocated pages.  Implementations must ensure
> that these instructions do not cause foreign data to leak through caches
> or other structures.
> """
>
> Advice on improving it is welcome; it is intended to cover the case
> mentioned above.

I see at least three issues with that wording:

1. It implies that the "foreign data is not introduced" rule applies at
the moment the undefined content is created. Instead, it should apply
when the undefined content is _read_. That is, if MEM.REWRITE is used on
a shared memory area, the data which magically appears at the undefined
region must not be "foreign data" for the process which is _reading_ it,
which is not necessarily the process which did the MEM.REWRITE.

This happens naturally on a ASID-split cache (in which MEM.REWRITE
affects only one ASID), since only one of the processes will see the
effects of the MEM.REWRITE (but then you have all the headaches of
cacheline aliasing, since the other ASID might also have it cached). It
also happens naturally if the undefined content can come only from the
same cacheline or from the corresponding memory location, or if the
undefined content is all zeros or all ones.

2. Even a completely separate process can be in the same hart, and will
be at the same privilege level. That is, even in the absence of shared
memory, this definition does not respect process isolation.

3. The "ghost from the past". In cryptography, there are situations in
which you *must* completely erase a value from existence (in particular,
forward secrecy requires that the key be ephemeral; also, some
algorithms break if even one bit of their intermediate values is
leaked). This definition uses "previously visible" which implies that
any value which ever existed in the memory of the process could come
back from the grave.

>> I think MEM.REWRITE must be made stricter. Each byte in the region
>> affected by MEM.REWRITE must either keep its current value, or be set
>> to a deterministic value. In a common implementation, the current
>> value would be used when the cache line was already allocated for that
>> address, and a deterministic value (probably zero) would be used when
>> it had to allocate a new cache line.
>
> That was CACHE.PREZERO in a previous draft.  The main reason that I want
> to leave the non-deterministic option is that some implementations might
> be able to safely elide the cacheline load entirely and just set the tag
> to "valid, exclusive" in at least some cases.  (If hardware cannot
> *prove* that the cacheline contains previously-accessible data, it
> *must* be overwritten with some constant before the program is permitted
> to load from that cacheline.)

You can still elide the cacheline load and still keep it
non-deterministic by allowing "either zero or valid", that is: if the
cacheline was already valid, make it exclusive (keeping the contents);
if the cacheline was not valid, make a new valid and exclusive cacheline
filled with zeros. The only bus traffic is to make it exclusive, and it
never has to go to memory (or to another hart) to fill the cacheline.

Of course, if you already have the hardware to set the cacheline to
all-zeros, it's simpler to always zero it, and it becomes PREZERO again.
However, I can imagine a (slightly convoluted) implementation in which
this is not the case: instead of being able to zero the cacheline, it
does a cacheline fill from a dummy "always return zeros" memory bus. For
that implementation, zeroing a valid cacheline would require a discard
followed by a fill.

I have, however, a good argument for REWRITE instead of PREZERO: for
REWRITE, a "do nothing" implementation is acceptable.

>> Also, this shows that MEM.REWRITE followed by a read instead of a
>> write can happen on a valid program: the program itself might have
>> intended to write immediately after the MEM.REWRITE, but it can be
>> preempted and something else sharing the same memory (or a debugger,
>> or a debugging-like tool) can read from that cache line before the
>> program gets its time slice back. Simply saying "don't do that" is not
>> enough. Even saying "flush the cache when switching processes" is not
>> enough, since what reads the memory can be a BPF program for a seccomp
>> filter running on the kernel (entering the kernel doesn't count as a
>> process switch).
>
> Entering the supervisor is a privilege level change, however, and no
> data accessible to any user process is foreign to the supervisor. (There
> might be a spectre haunting the supervisor here, though.) Similarly for
> the debugger:  no data accessible to the debugged program is foreign to
> the debugger.  This might cause a problem for the "monitor trap" option,
> but could also be avoided by partitioning the cache:  the debugger will
> not hit the same cacheline as the user process itself, since ptrace(2)
> goes through the supervisor.

However, the debugger runs in S-mode (or at least asks S-mode to read
from the process), so if nothing is foreign to the supervisor, the
debugger could read anything, including the supervisor itself.

> This could complicate inter-process cache coherency, suggesting that
> CACHE.WRITEBACK, when executed in S-mode, needs to be able to target an
> ASID or that a new "SFENCE.CACHE" instruction for writing back
> cachelines associated with a non-current ASID may be needed.

That is only the case if you allow aliases in the cache. In the absence
of aliases, a given physical address can only be in a single cacheline,
and a writeback (or anything else) to any virtual address mapping to
that physical address will work as expected.

>> And a "debugging-like tool" includes rr, which really doesn't like
>> non-determinism (it already has problems with lr/sc; in some other
>> thread we were discussing ways of trapping a failed sc so rr can
>> record it). For tools like that, it would be good to be able to switch
>> MEM.REWRITE into a deterministic mode (either turning into a prefetch,
>> or always zeroing, or even trapping).
>
> This is the first legitimate argument I have seen for allowing any of
> the cache-control instructions to be explicitly trapped.  As an
> alternative (and to avoid a proliferation of control flags), could a
> future "deterministic execution" extension add a single "deterministic
> execution" flag, one effect of setting which would be making MEM.REWRITE
> deterministic (or a no-op)?

Most people will simply set the "deterministic" bit and leave it
enabled. There's less debugging headaches that way.

But keep in mind that there's two levels of "deterministic". There's
"will always do the same thing on this chip", and there's "will always
do the same thing on every chip which implements this standard". An
example of the later is the DIV instruction, where the result of a
division by zero is precisely defined.

Cesar Eduardo Barros

unread,
Mar 29, 2018, 11:02:12 PM3/29/18
to jcb6...@gmail.com, Bruce Hoult, lkcl, Guy Lemieux, RISC-V ISA Dev
There's a time interval between preparing the cacheline with MEM.REWRITE
and writing to it. In that time interval, another thread (in the same
process, so same ASID) can read the undefined values. This thread might
also be running sandboxed code (for instance, a JS worker thread sharing
a SharedArrayBuffer with the other thread).

That is: it's not enough that all code which uses MEM.REWRITE be trusted
to overwrite all the undefined memory. It's also necessary that the
memory not be shared with untrusted code while it's undefined.

lkcl .

unread,
Mar 30, 2018, 12:14:07 AM3/30/18
to Jacob Bachmeyer, Guy Lemieux, RISC-V ISA Dev
On Fri, Mar 30, 2018 at 12:42 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> ok so the proposal does define that they cannot introduce (or access)
>> foreign data, but that is a requirement not an implementation detail.
>
> That is a requirement that implementations must meet, yes. How exactly they
> meet that requirement is left to implementors.

so, my point being: this stuff is hellishly complicated, and there's
a break (a disconnect) between the requirement and how it actually
should be met. leaving something this complex up to implementors is
pretty much guaranteed to end up with someone getting it wrong. it
would be far better, i feel, for the specification to state something
like this [sorry for not knowing exactly the right phrases, you know
them better than i do]:

section N.N: meeting the requirement of not introducing or accessing
foreign data.

in order to meet this requirement, implementors MUST choose one and
ONLY one of the following schemes:

A MEM.REWRITE and all other cache instructions must do absolutely
nothing at all

B MEM.REWRITE must be implemented as zeroing all data

C MEM.REWRITE must be implemented as xyz abc

implementors are NOT permitted to create their own implementation.
an implementation which does NOT follow one and only one of the above
schemes is NOT conformant with this specification.


that would make it ABSOLUTELY clear to implementors how to go about
things in a way that does not require them to think or "get creative"
[and possibly get things wrong]. each section A B C (and D? and E?)
could have its own sub-section describing PRECISELY how the
recommended method satisfies, in its own way, the requirement. doing
so would give implementors the confidence that the method had been
properly reviewed.

i would also strongly suggest that there be *no* wiggle-room for
implementors to go about "rolling their own" method of meeting this
requirement. that would force implementors to approach the RISC-V
Foundation for a revision of the extension, thus in turn requiring a
full and public review of the implementation that they propose.

l.

Aaron Severance

unread,
Mar 30, 2018, 3:28:26 PM3/30/18
to Cesar Eduardo Barros, jcb6...@gmail.com, lkcl, Guy Lemieux, RISC-V ISA Dev
Agreed; I can't see wanting to implement REWRITE as anything other than a prefetch exclusive or no-op.  The same issue applies to DISCARD, which I think (as opposed to REWRITE) is necessary for certain use-cases.  I think for both there are three ways you might want to implement them:
1) Non-deterministic (as spec'd).
2) Deterministic (REWRITE does prefetch/no-op, DISCARD does WRITEBACK+INVALIDATE).
3) Trap and let a higher privelege level handle it.

Ideally I'd want the destructive ops to be set on a per-privelege level to only be allowed to do deterministic or trapping behaviour.  The hardware cost to doing a check for this seems minimal, but the benefit of being able to force deterministic behaviour (or trapping) at lower privelege levels seems great.

Also a true non-deterministic DISCARD at user level may have some use but in general I think it would cause so much pain.  At the very least an OS initializing a page for a process would need to ensure that zeroed data had been written out all the way to memory; if it zeroed a page in the cache then the user process discarded it the user process could get foreign data.  Higher privelege levels should have the option to make DISCARD non-destrutive; otherwise you're creating a performance hit for a common case (initializing a page, which now has to be written back to memory) for a case that may not be desired at all (user process can destroy data in cache).
 
--
Cesar Eduardo Barros
ces...@cesarb.eti.br

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/95cf3a44-33cd-3f5d-d572-b29e29bb817a%40cesarb.eti.br.

Jacob Bachmeyer

unread,
Mar 30, 2018, 9:58:12 PM3/30/18
to lkcl ., Guy Lemieux, RISC-V ISA Dev
lkcl . wrote:
> On Fri, Mar 30, 2018 at 12:42 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>>> ok so the proposal does define that they cannot introduce (or access)
>>> foreign data, but that is a requirement not an implementation detail.
>>>
>> That is a requirement that implementations must meet, yes. How exactly they
>> meet that requirement is left to implementors.
>>
>
> so, my point being: this stuff is hellishly complicated, and there's
> a break (a disconnect) between the requirement and how it actually
> should be met. leaving something this complex up to implementors is
> pretty much guaranteed to end up with someone getting it wrong. it
> would be far better, i feel, for the specification to state something
> like this [sorry for not knowing exactly the right phrases, you know
> them better than i do]:
>
> [...]
>
> that would make it ABSOLUTELY clear to implementors how to go about
> things in a way that does not require them to think or "get creative"
> [and possibly get things wrong]. each section A B C (and D? and E?)
> could have its own sub-section describing PRECISELY how the
> recommended method satisfies, in its own way, the requirement. doing
> so would give implementors the confidence that the method had been
> properly reviewed.
>

That directly conflicts with the general principle of "mechanism not
policy" that leads to enduring standards. Perhaps a set of "known good"
memory subsystems will be developed, possibly as part of Rocket, but
that really is out of scope for this proposal.

> i would also strongly suggest that there be *no* wiggle-room for
> implementors to go about "rolling their own" method of meeting this
> requirement. that would force implementors to approach the RISC-V
> Foundation for a revision of the extension, thus in turn requiring a
> full and public review of the implementation that they propose.
>

While I believe that only the insane would prefer a closed
implementation over an open implementation, I see no reason to
effectively force all implementations to be open. A vendor that
produces a defective product will acquire the reputation that they deserve.


-- Jacob

Jacob Bachmeyer

unread,
Mar 30, 2018, 10:06:44 PM3/30/18
to Cesar Eduardo Barros, Bruce Hoult, lkcl, Guy Lemieux, RISC-V ISA Dev
Such a program is technically invalid, but that is an easy mistake to
make, much easier than the single-thread case where the code is
obviously wrong. In this case, the code is subtly wrong only in a
preemptive multithreaded environment. Would the interpreter otherwise
need to take steps to ensure that a reader does not see a
half-overwritten buffer or are there scenarios where synchronization
that would prevent this situation is not already needed for basic
correctness?

> That is: it's not enough that all code which uses MEM.REWRITE be
> trusted to overwrite all the undefined memory. It's also necessary
> that the memory not be shared with untrusted code while it's undefined.

In this context, "shared" gets complicated itself -- two harts may have
mappings to the same page but independent caches. In that case, hart
B's read would force hart A's cacheline to be written back, possibly
after only being partially updated, potentially causing an information
leak across that software boundary, but not across any hardware-enforced
boundary.

Is there a viable means for software to work around this or must
hardware address this potential issue?


-- Jacob

Jacob Bachmeyer

unread,
Mar 30, 2018, 10:19:22 PM3/30/18
to Aaron Severance, Cesar Eduardo Barros, lkcl, Guy Lemieux, RISC-V ISA Dev
Aaron Severance wrote:
> On Thu, Mar 29, 2018 at 7:54 PM Cesar Eduardo Barros
> <ces...@cesarb.eti.br <mailto:ces...@cesarb.eti.br>> wrote:
>
> Em 29-03-2018 21:04, Jacob Bachmeyer escreveu:
> > Cesar Eduardo Barros wrote:
>
> [...]
All three of these should be permitted under the current spec. Case (2)
exploits the loophole that hardware prefetching is deliberately left
implementation-defined, so cachelines targeted by REWRITE can
"magically" appear to have already been cached and lines targeted by
DISCARD similarly have already been written back. There is another
variant of case (2), where cachelines allocated by REWRITE are loaded
with some hardware-defined constant. (Said constant can be any value,
but must in fact be constant for any particular hart. This satisfies
the "no foreign data" requirement.)

> Ideally I'd want the destructive ops to be set on a per-privelege
> level to only be allowed to do deterministic or trapping behaviour.
> The hardware cost to doing a check for this seems minimal, but the
> benefit of being able to force deterministic behaviour (or trapping)
> at lower privelege levels seems great.
>
> Also a true non-deterministic DISCARD at user level may have some use
> but in general I think it would cause so much pain. At the very least
> an OS initializing a page for a process would need to ensure that
> zeroed data had been written out all the way to memory; if it zeroed a
> page in the cache then the user process discarded it the user process
> could get foreign data. Higher privelege levels should have the
> option to make DISCARD non-destrutive; otherwise you're creating a
> performance hit for a common case (initializing a page, which now has
> to be written back to memory) for a case that may not be desired at
> all (user process can destroy data in cache).

Easy enough to solve: DISCARD does not affect writes from a higher
privilege level than the level executing DISCARD. Alternately,
asynchronous WRITEBACK sets a barrier that DISCARD does not cross.

There is also the issue that practical systems (at least Linux, as I
understand) keep a "ready pool" of pre-zeroed pages to satisfy minor
page faults. (minor page fault == no I/O needed) For this, some way of
zeroing a page without polluting the cache would be useful.


-- Jacob

Jacob Bachmeyer

unread,
Mar 30, 2018, 11:26:03 PM3/30/18
to Cesar Eduardo Barros, lkcl, Guy Lemieux, RISC-V ISA Dev
My expectation for partitioned caches directly stems from Spectre, yes.

> The ASID could be part of the tag instead, but I also don't expect
> that to be the case, since it would make the tag bigger for no real
> gain (a physical address is the same memory, no matter which ASID was
> used to access it).

For VIPT cache, the maximum size of the cache is limited by the number
of untranslated address bits available, which is 12 in RISC-V. For a
larger cache, tricks such as allowing a cacheline to appear in any of
several locations must be used, essentially grouping several VIPT
caches. ASID partitioning provides additional untranslated address
bits. With power density becoming a limiting factor in modern design,
(which will be even worse with 3D microfabrication) additional cache
that is mostly inactive may have lower costs than in older processes. I
am unsure how this trade-off currently balances.

>> This is the foreign data prohibition from draft 5:
>>
>> """
>> These instructions create regions with undefined contents and share a
>> requirement that foreign data never be introduced. Foreign data is,
>> simply, data that was not previously visible to the current hart at
>> the current privilege level at any address. Operating systems zero
>> pages before attaching them to user address spaces to prevent foreign
>> data from appearing in freshly-allocated pages. Implementations must
>> ensure that these instructions do not cause foreign data to leak
>> through caches or other structures.
>> """
>>
>> Advice on improving it is welcome; it is intended to cover the case
>> mentioned above.
>
> I see at least three issues with that wording:
>
> 1. It implies that the "foreign data is not introduced" rule applies
> at the moment the undefined content is created.

That was the initial intent, using the reasoning that if the undefined
content can never be foreign data, there is no concern about reading
foreign data.

> Instead, it should apply when the undefined content is _read_. That
> is, if MEM.REWRITE is used on a shared memory area, the data which
> magically appears at the undefined region must not be "foreign data"
> for the process which is _reading_ it, which is not necessarily the
> process which did the MEM.REWRITE.

This is an interesting issue and I think I may be seeing it now: hart A
issues MEM.REWRITE on a shared buffer and the cachelines thus allocated
have previously mapped hart A's stack. As the stack is well-defined and
not foreign, the cachelines now map the shared buffer and are "valid,
exclusive", but still contain the words from hart A's stack. Hart A
begins to write data to the shared buffer; the cacheline at issue is now
"valid, exclusive, dirty" and its contents are (1) part of the intended
data and (2) leftover words from hart A's stack. Hart B is running a
different program with access to the same shared buffer. Due to either
a synchronization bug or malicious program running on hart B, hart B
attempts to read from the shared buffer and reads the cacheline that
contains, at that moment, 50% intended data and 50% stack leftovers.
Hardware coherency forces hart A to writeback the cacheline so hart B
can read it. Hart B now has words from hart A's stack that are foreign
data at hart B. Oops.

The best option I can see right now would be for hart A to hold off hart
B's read until the cacheline at issue is rewritten, but that is a new
problem if the program on hart A needs significant time to calculate the
next value to write to the shared buffer. Hardware does not know if any
given page is visible to other processes, so the solution must also work
in the single-thread case with good performance. Another option is to
require sequential writes to a rewritten region (permitting hardware to
simply "fill in" any gaps that software attempts to leave) and then
track the progress of rewriting. This would require additional tag
states "valid, exclusive, rewrite pending" that inhibits writeback of
that cacheline on a hardware coherency event and "valid, exclusive,
rewrite active" with an additional tracking mask that allows hardware to
replace not-yet-written bytes with some constant if a cacheline must be
written back to satisfy a request from another hart. The "valid,
exclusive, rewrite pending" state may identical to "valid, exclusive,
clean" and the additional tracking mask need not be per-cacheline, due
to the (new) sequential rewrite requirement that ensures only one
cacheline can be in the "rewrite active" state at at time.

The above follows a strict memory model, that may be stricter than RVWMO
or even RVTSO. Could an implementation simply decide that the writes to
a partially-rewritten cacheline "have not happened yet" when a coherency
event forces that cacheline to be written back? In other words, instead
of leaking foreign data, pretend the cacheline was clean, so hart B will
fetch it from memory? (The next write by hart A of course would make it
"valid, exclusive, dirty, rewrite active" again since it was never
removed from hart A's cache.)

> This happens naturally on a ASID-split cache (in which MEM.REWRITE
> affects only one ASID), since only one of the processes will see the
> effects of the MEM.REWRITE (but then you have all the headaches of
> cacheline aliasing, since the other ASID might also have it cached).

I would expect the coherency protocol to handle this, since this
aliasing is not a new situation.

> It also happens naturally if the undefined content can come only from
> the same cacheline or from the corresponding memory location, or if
> the undefined content is all zeros or all ones.

The original idea was that the undefined content is the previous content
of the cacheline allocated by MEM.REWRITE, so it is the contents of
whatever memory that cacheline previously mapped.

> 2. Even a completely separate process can be in the same hart, and
> will be at the same privilege level. That is, even in the absence of
> shared memory, this definition does not respect process isolation.

I believe that this was covered, but this is a possible loophole with a
skewed interpretation. I agree that the definition needs to be
tightened here, as process isolation was implicit: all data accessible
immediately before MEM.REWRITE is part of the same address space in
which MEM.REWRITE is executed.

> 3. The "ghost from the past". In cryptography, there are situations in
> which you *must* completely erase a value from existence (in
> particular, forward secrecy requires that the key be ephemeral; also,
> some algorithms break if even one bit of their intermediate values is
> leaked). This definition uses "previously visible" which implies that
> any value which ever existed in the memory of the process could come
> back from the grave.

That is broader than intended. Generally, overwriting a value "buries
it" and it will not rise again. MEM.DISCARD can possibly raise
phantoms, but if you are intent on destroying a value, you will issue
WRITEBACK and FENCE after zeroing it to ensure that the
overwrite-with-zero is forced to memory before another hart can read the
old value. Unlike the case of a supervisor zeroing a page for reuse,
performance is not an issue here: the value must be destroyed, no
matter how long it takes.

A revised version of that definition: Foreign data is, simply, data
that, immediately prior to the execution of an instruction, could not be
the result of an ordinary LOAD issued for an arbitrary address in place
of that instruction on the current hart or a hypothetical hart operating
in the same address space with a different path to memory at the same
privilege level.

The intent is that MEM.REWRITE may produce arbitrary "aliased copies"
within an address space, but never across address spaces or privilege
levels. Similarly, MEM.DISCARD can cause writes that "have not happened
yet" to "never happen" instead of eventually becoming globally visible.
Eliding a cacheline load was meant to mean not altering the contents of
a cacheline, presumably to save energy or heat.

> I have, however, a good argument for REWRITE instead of PREZERO: for
> REWRITE, a "do nothing" implementation is acceptable.

That is one of the goals: a no-op should be acceptable in most systems
for most or all of the proposed instructions. MEM.REWRITE, at least, is
a performance hint.

>>> Also, this shows that MEM.REWRITE followed by a read instead of a
>>> write can happen on a valid program: the program itself might have
>>> intended to write immediately after the MEM.REWRITE, but it can be
>>> preempted and something else sharing the same memory (or a debugger,
>>> or a debugging-like tool) can read from that cache line before the
>>> program gets its time slice back. Simply saying "don't do that" is
>>> not enough. Even saying "flush the cache when switching processes"
>>> is not enough, since what reads the memory can be a BPF program for
>>> a seccomp filter running on the kernel (entering the kernel doesn't
>>> count as a process switch).
>>
>> Entering the supervisor is a privilege level change, however, and no
>> data accessible to any user process is foreign to the supervisor.
>> (There might be a spectre haunting the supervisor here, though.)
>> Similarly for the debugger: no data accessible to the debugged
>> program is foreign to the debugger. This might cause a problem for
>> the "monitor trap" option, but could also be avoided by partitioning
>> the cache: the debugger will not hit the same cacheline as the user
>> process itself, since ptrace(2) goes through the supervisor.
>
> However, the debugger runs in S-mode (or at least asks S-mode to read
> from the process), so if nothing is foreign to the supervisor, the
> debugger could read anything, including the supervisor itself.

If the debugger is actually running in S-mode, then that is correct.
Otherwise, since debugging access is mediated by the supervisor, the
supervisor must take care in what it presents to the debugger. That is
no different from current systems.

>> This could complicate inter-process cache coherency, suggesting that
>> CACHE.WRITEBACK, when executed in S-mode, needs to be able to target
>> an ASID or that a new "SFENCE.CACHE" instruction for writing back
>> cachelines associated with a non-current ASID may be needed.
>
> That is only the case if you allow aliases in the cache. In the
> absence of aliases, a given physical address can only be in a single
> cacheline, and a writeback (or anything else) to any virtual address
> mapping to that physical address will work as expected.

Am I mistaken that "clean" cachelines are normally allowed to be
shared? Aliases only need to be removed when a cacheline is to be made
"valid, exclusive"?

>>> And a "debugging-like tool" includes rr, which really doesn't like
>>> non-determinism (it already has problems with lr/sc; in some other
>>> thread we were discussing ways of trapping a failed sc so rr can
>>> record it). For tools like that, it would be good to be able to
>>> switch MEM.REWRITE into a deterministic mode (either turning into a
>>> prefetch, or always zeroing, or even trapping).
>>
>> This is the first legitimate argument I have seen for allowing any of
>> the cache-control instructions to be explicitly trapped. As an
>> alternative (and to avoid a proliferation of control flags), could a
>> future "deterministic execution" extension add a single
>> "deterministic execution" flag, one effect of setting which would be
>> making MEM.REWRITE deterministic (or a no-op)?
>
> Most people will simply set the "deterministic" bit and leave it
> enabled. There's less debugging headaches that way.

Yes, but also reduced performance (or there would be no reason to have
the option).

> But keep in mind that there's two levels of "deterministic". There's
> "will always do the same thing on this chip", and there's "will always
> do the same thing on every chip which implements this standard". An
> example of the later is the DIV instruction, where the result of a
> division by zero is precisely defined.

That would be part of the hypothetical "deterministic execution" extension.


-- Jacob

lkcl .

unread,
Mar 31, 2018, 2:12:12 AM3/31/18
to Jacob Bachmeyer, Guy Lemieux, RISC-V ISA Dev
On Sat, Mar 31, 2018 at 2:58 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> lkcl . wrote:
>>
>> On Fri, Mar 30, 2018 at 12:42 AM, Jacob Bachmeyer <jcb6...@gmail.com>
>> wrote:
>>
>>>>
>>>> ok so the proposal does define that they cannot introduce (or access)
>>>> foreign data, but that is a requirement not an implementation detail.
>>>>
>>>
>>> That is a requirement that implementations must meet, yes. How exactly
>>> they
>>> meet that requirement is left to implementors.

>> [...]

> that would make it ABSOLUTELY clear to implementors how to go about
> things in a way that does not require them to think or "get creative"
> [and possibly get things wrong]. each section A B C (and D? and E?)
> could have its own sub-section describing PRECISELY how the
> recommended method satisfies, in its own way, the requirement. doing
> so would give implementors the confidence that the method had been
> properly reviewed.

> That directly conflicts with the general principle of "mechanism not policy"
> that leads to enduring standards.

let me think it through [out loud]. having developed a long-term
standard, i had to make a study of... how to develop standards :)
what i found was that enduring standards are ones that have absolutely
nothing that is ambiguous, where everything is absolutely clear,
*nothing* is "optional" and where if there is intended to be any
extensibility that there is a "lowest common denominator" mandatory
interoperability where every "level up" is also mandatory.

good examples include SATA, USB and PCIe. the only implementation of
USB that i know of which failed to respect USB2's rules is the OMAP
35xx series, which bizarrely fails to be USB 1.1 compliant (probably
due to a silicon errata). SDIO *used* to be a good example except
they removed SPI backwards-interoperability for version 4.0 (everyone
ignores that, and has SPI implemented because it's too much work to
rip it out of their implementations). the best examples of
"aggregation" standards that i know of - they're both absolutely
awesome and leave nothing ambiguous at all - are PC-104 (and variants)
and COM-Express.

so against that test criteria, in my rush to point out i was deeply
concerned about security here, what did i not say? no upgradeability
/ extensibility was mentioned... because i don't have the technical
expertise here to suggest or describe any future extensions. i also
didn't mention that i feel that there should be delberate mention of
how vendors should go about getting approval for an alternative
implementation that conforms to the requirement.

open questions:

* is there anything missing from that template / boilerplate of what
i believe a "good standard" is? i'd genuinely like to know so that i
can learn from that.

* if you were under the impression that i'd described a "policy" not
a "mechanism', then i will definitely have mis-communicated. how did
that happen? was it specifically the paragraph where i suggested
adding a "rationale" (a description of how each "mechanism" can be
shown to meet the "requirement")?

* anything else. i'm keenly aware that i don't have the deep
technical knowledge on this particular topic that you do.


> Perhaps a set of "known good" memory
> subsystems will be developed, possibly as part of Rocket, but that really is
> out of scope for this proposal.
>
>> i would also strongly suggest that there be *no* wiggle-room for
>> implementors to go about "rolling their own" method of meeting this
>> requirement. that would force implementors to approach the RISC-V
>> Foundation for a revision of the extension, thus in turn requiring a
>> full and public review of the implementation that they propose.
>>
>
>
> While I believe that only the insane would prefer a closed implementation
> over an open implementation, I see no reason to effectively force all
> implementations to be open.

i do understand and appreciate that. see below.

> A vendor that produces a defective product will
> acquire the reputation that they deserve.

unfortunately that would destroy the reputation of RISC-V - which
will have had to have put out a Conformance Statement endorsing the
product - in the process. so it's not just that the vendor would be
harmed, the RISC-V Foundation's reputation would be harmed as well.

in speaking with yunsup at FOSDEM2018 i remember distinctly how he
emphasised that the RISC-V effort is based on learning from the
mistakes of the past, not just from a technical perspective of
learning from 30 years of RISC processor core design but also from a
strategic one. ARM's dog's dinner mess which resulted in Linus
Torvalds at a linux kerner architecture meeting in Cambridge
misunderstanding how the ARM eco-system works, and trying to tell 18
of the 36 people who turned up (all 18 *ARM* architecture
representatives) "piss off, sort yourselves out, and come back when
you only have one representative".

nobody dared explain to him that there were 18 representatives because
they were *all utterly different architectures*.

so the only way to solve the reputation problem would be for there to
be a backdoor / special "secret, closed and proprietary" specification
development process, where some extensions even to this extension are
entirely developed in secret, reviewed in secret, approved and
endorsed in secret and given a Conformance Certificate by the RISC-V
Foundation.

we know from long experience how those processes work out... but if
it's the way that it has to be done then it's the way that it has to
be done.

l.

Cesar Eduardo Barros

unread,
Mar 31, 2018, 8:07:05 AM3/31/18
to jcb6...@gmail.com, lkcl ., Guy Lemieux, RISC-V ISA Dev
Em 30-03-2018 22:58, Jacob Bachmeyer escreveu:
> lkcl . wrote:
>> i would also strongly suggest that there be *no* wiggle-room for
>> implementors to go about "rolling their own" method of meeting this
>> requirement.  that would force implementors to approach the RISC-V
>> Foundation for a revision of the extension, thus in turn requiring a
>> full and public review of the implementation that they propose.
>
> While I believe that only the insane would prefer a closed
> implementation over an open implementation, I see no reason to
> effectively force all implementations to be open.  A vendor that
> produces a defective product will acquire the reputation that they deserve.

By then, it's too late for the people who already bought and are using
the product. And it also means yet another workaround on the Linux
kernel, complicating the code for everyone else for decades until they
decide "nobody uses this anymore, let's rip this out".

Cesar Eduardo Barros

unread,
Mar 31, 2018, 8:33:39 AM3/31/18
to jcb6...@gmail.com, Bruce Hoult, lkcl, Guy Lemieux, RISC-V ISA Dev
In the case of two interpreted workers from the same origin sharing the
same SharedArrayBuffer or similar, there's no synchronization needed on
the interpreter side, since the worst that can happen is one worker
seeing the other worker's writes out of order. Also, adding
synchronization on the interpreter side for every read or write would
slow things down too much. The synchronization would be within the
intepreted code (for instance, one worker sends the other an "I'm
finished" message, and the message passing mechanism has a memory barrier).

Now add something which writes into the shared buffer; perhaps something
like a "read()" call which gets data from somewhere into the buffer,
perhaps even a copy from a part of the shared buffer into the other. The
copy will use memcpy() (or a copy loop, which the compiler optimizes
into memcpy() anyway). Finally, the memcpy function is modified to use
this MEM.REWRITE for greater performance (memcpy and memset tend to be
among the most optimized C library functions).

There's no need for synchronization for basic correctness in this
example, since a malicious code which ignores synchronization would only
be hurting itself (the workers come from the same place and trust each
other). That changes completely if memcpy's (or memset's) use of
MEM.REWRITE allows for data from outside the interpreter's sandbox to
magically appear inside the sandbox.

>> That is: it's not enough that all code which uses MEM.REWRITE be
>> trusted to overwrite all the undefined memory. It's also necessary
>> that the memory not be shared with untrusted code while it's undefined.
>
> In this context, "shared" gets complicated itself -- two harts may have
> mappings to the same page but independent caches.  In that case, hart
> B's read would force hart A's cacheline to be written back, possibly
> after only being partially updated, potentially causing an information
> leak across that software boundary, but not across any hardware-enforced
> boundary.

The issue can also happen with a single hart, having two harts only
makes it much easier to exploit.

> Is there a viable means for software to work around this or must
> hardware address this potential issue?

Let's flip this around: is there a viable way for hardware to work
around this, or must software address this potential issue? Because
there's a very simple way for software to address this: never call
MEM.REWRITE anywhere.

(And hope that's enough, see
https://randomascii.wordpress.com/2018/01/07/finding-a-cpu-design-bug-in-the-xbox-360/
for a case where it wasn't.)

Cesar Eduardo Barros

unread,
Mar 31, 2018, 9:31:29 AM3/31/18
to jcb6...@gmail.com, lkcl, Guy Lemieux, RISC-V ISA Dev
Em 31-03-2018 00:25, Jacob Bachmeyer escreveu:
> Cesar Eduardo Barros wrote:
>> I see at least three issues with that wording:
>>
>> 1. It implies that the "foreign data is not introduced" rule applies
>> at the moment the undefined content is created.
>
> That was the initial intent, using the reasoning that if the undefined
> content can never be foreign data, there is no concern about reading
> foreign data.
>
>> Instead, it should apply when the undefined content is _read_. That
>> is, if MEM.REWRITE is used on a shared memory area, the data which
>> magically appears at the undefined region must not be "foreign data"
>> for the process which is _reading_ it, which is not necessarily the
>> process which did the MEM.REWRITE.
>
> [...]
>
> The best option I can see right now would be for hart A to hold off hart
> B's read until the cacheline at issue is rewritten, but that is a new
> problem if the program on hart A needs significant time to calculate the
> next value to write to the shared buffer.  Hardware does not know if any
> given page is visible to other processes, so the solution must also work
> in the single-thread case with good performance.  Another option is to
> require sequential writes to a rewritten region (permitting hardware to
> simply "fill in" any gaps that software attempts to leave) and then
> track the progress of rewriting.  This would require additional tag
> states "valid, exclusive, rewrite pending" that inhibits writeback of
> that cacheline on a hardware coherency event and "valid, exclusive,
> rewrite active" with an additional tracking mask that allows hardware to
> replace not-yet-written bytes with some constant if a cacheline must be
> written back to satisfy a request from another hart.  The "valid,
> exclusive, rewrite pending" state may identical to "valid, exclusive,
> clean" and the additional tracking mask need not be per-cacheline, due
> to the (new) sequential rewrite requirement that ensures only one
> cacheline can be in the "rewrite active" state at at time.

That seems to me to be falling into the trap of adding too much extra
hardware for a single instruction. Now you have to track one extra state
and 64 mask bits per cacheline. Even if you share it (have a separate
"pending rewrites tracker" which records the cacheline index and the
mask bits for up to X cachelines), it's still extra hardware in a
performance-critical (and area-critical, since it takes time for signals
to propagate) subsystem.

If it requires too much extra hardware, there's a good chance that the
instruction will not be implemented in that way. Wasn't the whole point
of allowing unrelated data to appear in the cacheline to *reduce* the
amount of extra hardware? If it is getting that complicated, adding a
"cacheline zeroer" becomes much more attractive, with the bonus that
it's much easier to reason about (the overwritten bytes are either zero,
or what was there before).

>> 3. The "ghost from the past". In cryptography, there are situations in
>> which you *must* completely erase a value from existence (in
>> particular, forward secrecy requires that the key be ephemeral; also,
>> some algorithms break if even one bit of their intermediate values is
>> leaked). This definition uses "previously visible" which implies that
>> any value which ever existed in the memory of the process could come
>> back from the grave.
>
> That is broader than intended.  Generally, overwriting a value "buries
> it" and it will not rise again.  MEM.DISCARD can possibly raise
> phantoms, but if you are intent on destroying a value, you will issue
> WRITEBACK and FENCE after zeroing it to ensure that the
> overwrite-with-zero is forced to memory before another hart can read the
> old value.  Unlike the case of a supervisor zeroing a page for reuse,
> performance is not an issue here:  the value must be destroyed, no
> matter how long it takes.

There's one fundamental difference: MEM.DISCARD can only ressurect data
from the _same_ memory location. The way MEM.REWRITE was defined, it
could ressurect data from _anywhere in the same process_, even after a
cache flush!

Consider the following scenario:

- Two harts are running the same process;
- hart A writes a cryptographic key to an address mapped to its cache
index 1;
- hart B reads that key, so now it's also on its cache index 1;
- hart B zeroizes the key, so now the cacheline is exclusively on its
cache index 1;
- hart B flushes the write to memory;
- hart B reuses its cache index 1 for something else;
- hart A does MEM.REWRITE to an unrelated address, which happened to
also map to cache index 1.

Before the last step, the key had been destroyed; it was no longer on
hart B's cache (overwritten with zeros), no longer on memory
(overwritten with zeros when hart B flushed the write), and no longer on
hart A's cache (when hart B wrote the zeros, the copy on hart A's cache
was marked as invalid). But if the last step does nothing more than
marking the cache line as "valid, exclusive", the key suddenly
reappears, at an unrelated address!

If the last step were a MEM.DISCARD on hart A, however, even on the same
address, the key wouldn't reappear, since the value on memory is already
zero. Even if hart B had not flushed it yet, the correct value would
appear on any cache load.

As for performance: it can be an issue in some cases where most of the
work of a process is cryptography. However, in that case, the process
can make sure that the memory addresses used to store keys are not
reused for anything other then cryptographic keys, so there would be no
risk of them being reused for something which wants to do a MEM.DISCARD.

Michael Clark

unread,
Mar 31, 2018, 3:04:27 PM3/31/18
to Cesar Eduardo Barros, jcb6...@gmail.com, lkcl, Guy Lemieux, RISC-V ISA Dev

> On 29/03/2018, at 7:54 PM, Cesar Eduardo Barros <ces...@cesarb.eti.br> wrote:
>
> The ASID could be part of the tag instead, but I also don't expect that to be the case, since it would make the tag bigger for no real gain (a physical address is the same memory, no matter which ASID was used to access it).

For the L1, the ASID is relevant. A virtually indexed, physically tagged L1 needs to tag and compare ASID. After the L1, a lot of this information is lost however there are unused physical address bits on the memory bus i.e. in a system with 44 bits physical there are 20 bits that can be used for tags.

Cache partitioning post Spectre is going to be very important. I can imagine a runtime configurable way selector that can permute and mask various pieces of privilege boundary information such as privileged large level, ASID and page table protection key (IBM’s expired patent). i.e. a way selector that has shift and mask registers for the various privilege domain sources. Perhaps after the L1, these can be encoded in the unused upper bits of the physical address, assuming the bus interface between cache levels carries a full 64-bit address.

Security sensitive applications are sure to forego some cache efficiency for improved security properties. e.g. this mechanism could be used so that collocated virtual machines can’t cause cache evictions on other virtual machines. The oversubscription ratio is often no more than 2:1 so this requires 1 extra bit in the LLC.

Blending elements of CPU (fixed purpose) and FPGA (application specific) might be something we will see in the future. I can imagine some strategically located pieces of runtime configurable logic will be extremely useful, especially for cache way selection, partitioning and solving cache eviction side channel related issues.

Allen J. Baum

unread,
Mar 31, 2018, 3:59:17 PM3/31/18
to Cesar Eduardo Barros, jcb6...@gmail.com, lkcl, Guy Lemieux, RISC-V ISA Dev
I think this slots into exactly my thinking.

This is far too much complexity for what is basically a corner case.
Either people will not implement it because it seems to complex, or they are (over) confident, try to implement it, possibly even ship it, before they discover it is screwed up.

I am in awe of teams that can ship products with coherence that works, much less high-performance systems (which are invariably complex). There are some large teams that can do that - but you would be hard pressed to find a small team that can do that competently. It's really hard - and if you think it isn't, you just don't know what you don't know.

Getting coherence right is hard. don't make it harder/
>--
>You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
>To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
>To post to this group, send email to isa...@groups.riscv.org.
>Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/54427d45-b3aa-d628-55bd-3b7b0e20738a%40cesarb.eti.br.


--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

lk...@lkcl.net

unread,
Apr 3, 2018, 2:31:30 PM4/3/18
to RISC-V ISA Dev, ces...@cesarb.eti.br, jcb6...@gmail.com, luke.l...@gmail.com, glem...@vectorblox.com


On Saturday, March 31, 2018 at 8:59:17 PM UTC+1, Allen Baum wrote:

Getting coherence right is hard. don't make it harder/


so if we can unwind this down-the-rabbit-hole, what would be acceptable? what is simpler and clearer?

l.

Allen Baum

unread,
Apr 3, 2018, 3:20:06 PM4/3/18
to lk...@lkcl.net, RISC-V ISA Dev, Cesar Eduardo Barros, Jacob Bachmeyer, lkcl, Guy Lemieux
The ASID is only used to validate a TLB entry. It is never seen beyond it.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Apr 4, 2018, 6:40:23 PM4/4/18
to Cesar Eduardo Barros, lkcl ., Guy Lemieux, RISC-V ISA Dev
Depending on just how badly a vendor screws this up, the "workaround"
could very well be to refuse to boot on certain known-defective
hardware. If it is impossible to run safely, it is better to not run at
all.


-- Jacob

It is loading more messages.
0 new messages