Proposal: Explicit cache-control instructions (draft 5 after feedback)

548 views
Skip to first unread message

Jacob Bachmeyer

unread,
Mar 7, 2018, 11:19:42 PM3/7/18
to RISC-V ISA Dev
Previous discussions suggested that explicit cache-control instructions
could be useful, but RISC-V has some constraints here that other
architectures lack, namely that caching must be transparent to the user ISA.

I propose a new minor opcode REGION := 3'b001 within the existing
MISC-MEM major opcode. Instructions in REGION are R-type and use rs1 to
indicate a base address, rs2 to indicate an upper bound address, and
produce a result in rd that is the first address after the highest
address affected by the operation. If rd is x0, the instruction has no
directly visible effects and can be executed entirely asynchronously as
an implementation option.

Non-destructive operations permit an implementations to expand a
requested region on both ends to meet hardware granularity for the
operation. An application can infer alignment from the produced value
if it is a concern. As a practical matter, cacheline lengths are
expected to be declared in the processor configuration ROM.

Destructive operations are a thornier issue, and are resolved by
requiring any partial cachelines (at most 2 -- first and last) to be
flushed or prefetched instead of performing the requested operation on
those cachelines. Implementations may perform the destructive operation
on the parts of these cachelines included in the region, or may simply
flush or prefetch the entire cacheline.

If the upper and lower bounds are specified by the same register, the
smallest region that can be affected that includes the lower bound is
affected if the operation is non-destructive; destructive operations are
no-ops. Otherwise, the upper bound must be greater than the lower bound
and the contrary case is reserved. (Issue for discussion: what happens
if the reserved case is executed?)

In general, this proposal uses "cacheline" to describe the hardware
granularity for an operation that affects multiple words of memory or
address space. Where these operations are implemented using traditional
caches, the use of the term "cacheline" is entirely accurate, but this
proposal does not prevent implementations from using other means to
implement these instructions.

Instructions in MISC-MEM/REGION may be implemented as no-ops if an
implementation lacks the corresponding hardware. The value produced in
this case is the base address.

The new MISC-MEM/REGION space will have room for 128 opcodes, one of
which is the existing FENCE.I. I propose:

[for draft 3, the function code assignments have changed to better group
prefetch operations]
[for draft 4, most of the mnemonics have been shortened and now indicate
that these instructions affect the memory subsystem]

===Fences===

[function 7'b0000000 is the existing FENCE.I instruction]

[function 7'b0000001 reserved]

FENCE.RD ("range data fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
Perform a conservative fence affecting only data accesses to the
chosen region. This instruction always has visible effects on memory
consistency and is therefore synchronous in all cases.

FENCE.RI ("range instruction fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000011}
Perform equivalent of FENCE.I but affecting only instruction fetches
from the chosen region. This instruction always has visible effects on
memory consistency and is therefore synchronous in all cases.

===Non-destructive cache control===

====Prefetch====

All prefetch instructions ignore page faults and other access faults.
In general use, applications should use rd == x0 for prefetching,
although this is not required. If a fault occurs during a synchronous
prefetch (rd != x0), the operation must terminate and produce the
faulting address. A fault occurring during an asynchronous prefetch (rd
== x0) may cause the prefetching to stop or the implementation may
attempt to continue prefetching past the faulting location.

MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
Load as much of the chosen region as possible into the data cache,
with varying levels of expected temporal access locality. The number in
the opcode is proportionate to the expected frequency of access to the
prefetched data: MEM.PF3 is for data that will be very heavily used.

MEM.PF.EXCL ("prefetch exclusive")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
Load as much of the chosen region as possible into the data cache,
with the expectation that future stores will soon occur to this region.
In a cache-coherent system, any locks required for writing the affected
cachelines should be acquired.

[function 7'b0001101 reserved for a future prefetch operation]

MEM.PF.ONCE ("prefetch once")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
Prefetch as much of the region as possible, but expect the prefetched
data to be used at most once. This operation may activate a prefetch
unit and prefetch the region incrementally if rd is x0. Software is
expected to access the region sequentially, starting at the base address.

MEM.PF.TEXT ("prefetch program text")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
Load as much of the chosen region as possible into the instruction cache.

====Cacheline pinning====

??? Issue for discussion: should a page fault while pinning cachelines
cause a trap to be taken?
??? Issue for discussion: what if another processor attempts to write
to an address in a cacheline pinned on this processor?

CACHE.PIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
Arrange for as much of the chosen region as possible to be accessible
with minimal delay and no traffic to main memory. Pinning a region is
idempotent and an implementation may pin a larger region than requested,
provided that an unpin operation with the same base and bound will also
unpin the larger region.
One possible implementation is to load as much of the chosen region as
possible into the data cache and keep it there until unpinned. Another
implementation is to configure a scratchpad RAM and map it over at least
the chosen region, preloading it with data from main memory.
Scratchpads may be processor-local, but writes to a scratchpad mapped
with CACHE.PIN must be visible to other nodes in a coherent system.
Implementations are expected to ensure that pinned cachelines will not
impede the efficacy of a cache. Implementations with fully-associative
caches may permit any number of pins, provided that at least one
cacheline remains available for normal use. Implementations with N-way
set associative caches may support pinning up to (N-1) ways within each
set, provided that at least one way in each set remains available for
normal use. Implementations with direct-mapped caches should not pin
cachelines, but may still use CACHE.PIN to configure an overlay
scratchpad, which may itself use storage shared with caches, such that
mapping the scratchpad decreases the size of the cache.

Implementations may support both cacheline pinning and scratchpads,
choosing which to use to perform a CACHE.PIN operation in an
implementation-defined manner.

CACHE.UNPIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010001}
Explicitly release a pin set with CACHE.PIN. Pinned regions are also
implicitly released if the memory protection and virtual address mapping
is changed. (Specifically, a write to the current satp CSR or an
SFENCE.VM will unpin all cachelines as a side effect, unless the
implementation partitions its cache by ASID. Even with ASID-partitioned
caches, changing the root page number associated with an ASID unpins all
cachelines belonging to that ASID.) Unpinning a region does not
immediately remove it from the cache. Unpinning a region always
succeeds, even if parts of the region were not pinned. For an
implementation that implements CACHE.PIN using scratchpad RAM, unpinning
a region that uses a scratchpad causes the current contents of the
scratchpad to be written to main memory.

And two M-mode-only privileged instructions:

CACHE.PIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
Arrange for code to execute from as much of the chosen region as
possible without traffic to main memory. Pinning a region is idempotent.

CACHE.UNPIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010001, 2'b11}
Release resources pinned with CACHE.PIN.I. Pins are idempotent. One
unpin instruction will unpin the chosen region completely, regardless of
how many times it was pinned. Unpinning always succeeds.

====Flush====

CACHE.WRITEBACK
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
Writeback any cachelines in the requested region.

CACHE.FLUSH
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
Write any cachelines in the requested region (as by CACHE.WRITE),
marking them invalid afterwards (as by MEM.DISCARD). Flushed cachelines
are automatically unpinned.

Rationale for including CACHE.FLUSH: small implementations may
significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD;
the implementations that most benefit lack the infrastructure to achieve
such combination by macro-op fusion.


===Declaring data obsolescence===

These operations declare data to be obsolete and unimportant. In
fully-coherent systems, they are two sides of the same coin:
MEM.DISCARD declares data not yet written to main memory to be obsolete,
while MEM.REWRITE declares data in main memory to be obsolete and
indicates that software on this hart will soon overwrite the region.
These operations are useful in general: a function prologue could use
MEM.REWRITE to allocate a stack frame, while a function epilogue could
use MEM.DISCARD to release a stack frame without requiring the
now-obsolete local variables ever be written back to main memory. In
non-coherent systems, MEM.DISCARD may also be an important tool for
software-enforced coherency, since its semantics provide an invalidate
operation on all caches on the path between a hart and main memory.

The declarations of obsolescence produced by these instructions are
global and affect all caches on the path between a hart and main memory
and all caches coherent with those caches, but are not required to
affect non-coherent caches not on the direct path between a hart and
main memory. Implementations depending on software to maintain
coherency in such situations must provide other means (MMIO control
registers, for example) to force invalidations in remote non-coherent
caches.

These instructions create regions with undefined contents and share a
requirement that foreign data never be introduced. Foreign data is,
simply, data that was not previously visible to the current hart at the
current privilege level at any address. Operating systems zero pages
before attaching them to user address spaces to prevent foreign data
from appearing in freshly-allocated pages. Implementations must ensure
that these instructions do not cause foreign data to leak through caches
or other structures.

MEM.DISCARD
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
Declare the contents of the region obsolete, dropping any copies
present in the processor without performing writes to main memory. The
contents of the region are undefined after the operation completes, but
shall not include foreign data.
If the region does not align with cacheline boundaries, any partial
cachelines are written back. If hardware requires such, the full
contents of a cacheline partially included may be written back,
including data just declared obsolete. In a non-coherent system,
partial cachelines written back are also invalidated. In a system with
hardware cache coherency, partial cachelines must be written back, but
may remain valid.
Any cachelines fully affected by MEM.DISCARD are automatically unpinned.
NOTE WELL: MEM.DISCARD is *not* an "undo" operation for memory writes
-- an implementation is permitted to aggressively writeback dirty
cachelines, or even to omit caches entirely. *ANY* combination of "old"
and "new" data may appear in the region after executing MEM.DISCARD.

MEM.REWRITE
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
Declare the contents of the region obsolete, indicating that the
current hart will soon overwrite the entire region. Reading any datum
from the region before the current hart has written that datum (or other
data fully overlapping that datum) is incorrect behavior and produces
undefined results, but shall not return foreign data. Note that
undefined results may include destruction of nearby data. For optimal
performance, software should write the entire region before reading any
part of the region and should do so sequentially, starting at the base
address and moving towards the address produced by MEM.REWRITE.
TLB fills occur normally as for writes to the region and must appear
to occur sequentially, starting at the base address. A page fault in
the middle of the region causes the operation to stop and produce the
faulting address. A page fault at the base address causes a page fault
trap to be taken.
Implementations with coherent caches should arrange for all cachelines
in the region to be in a state that permits the current hart to
immediately overwrite the region with no further delay. In common
cache-coherency protocols, this is an "exclusive" state.
An implementation may have a maximum size of a region that can have a
rewrite pending. If software declares intent to overwrite a larger
region than the implementation can prepare at once, the operation must
complete partially and return the first address beyond the region
immediately prepared for overwrite. Software is expected to overwrite
the region prepared, then iterate for the next part of the region that
software intends to overwrite until the entire larger region is overwritten.
If the region does not align with cacheline boundaries, any partial
cachelines are prefetched as by MEM.PF.EXCL. If hardware requires such,
the full contents of a cacheline partially included may be loaded from
memory, including data just declared obsolete.
NOTE WELL: MEM.REWRITE is *not* memset(3) -- any portion of the
region prepared for overwrite already present in cache will *retain* its
previously-visible contents.

MEM.REWRITE appears to be a relatively novel operation and previous
iterations of this proposal have produced considerable confusion. While
the above semantics are the required behavior, there are different ways
to implement them. One simple option is to temporarily mark the region
as "write-through" in internal configuration. Another option is to
allocate cachelines, but either retain their previous contents (provided
that the implementation can *prove* that those contents are *not*
foreign data) or load the allocated cachelines with zero instead of
fetching contents from memory. A third option is to track whether
cachelines have been overwritten and use a monitor trap to zero
cachelines that software attempts to invalidly read. A fourth option is
to provide dedicated write-combining buffers for MEM.REWRITE.
In systems that implement MEM.REWRITE using cache operations,
MEM.REWRITE allocates cachelines, marking them "valid, exclusive, dirty"
and filling them without reading from main memory while abiding by the
requirements to avoid introducing foreign data. Other cachelines may be
evicted to make room if needed but implementations should avoid evicting
data recently fetched with MEM.PF.ONCE, as software may intend to copy
that data into the region. Implementations are recommended to permit at
most half of a data cache to be allocated for MEM.REWRITE if data has
been recently prefetched into the cache to aid in optimizing memcpy(3),
but may permit the full data cache to be used to aid in optimizing
memset(3). In particular, an active asynchronous MEM.PF.ONCE ("active"
meaning that the data prefetched has not yet been read) can be taken as
a hint that MEM.REWRITE is preparing to copy data and should use at most
half or so of the data cache.



=== ===

Thoughts?

Thanks to:
[draft 1]
Bruce Hoult for citing a problem with the HiFive board that inspired
the I-cache pins.
[draft 2]
Stefan O'Rear for suggesting the produced value should point to the
first address after the affected region.
Alex Elsayed for pointing out serious problems with expanding the
region for a destructive operation and suggesting that "backwards"
bounds be left reserved.
Guy Lemieux for pointing out that pinning was insufficiently specified.
Andrew Waterman for suggesting that MISC-MEM/REGION could be encoded
around the existing FENCE.I instruction.
Allen Baum for pointing out the incomplete handling of page faults.
[draft 3]
Guy Lemieux for raising issues that inspired renaming PREZERO to RELAX.
Chuanhua Chang for suggesting that explicit invalidation should unpin
cachelines.
Guy Lemieux for being persistent asking for CACHE.FLUSH and giving
enough evidence to support that position.
Guy Lemieux and Andrew Waterman for discussion that led to rewriting a
more general description for pinning.
[draft 4]
Guy Lemieux for suggesting that CACHE.WRITE be renamed CACHE.WRITEBACK.
Allen J. Baum and Guy Lemieux for suggestions that led to rewriting the
destructive operations.
[draft 5]
Allen Baum for offering a clarification for the case of using the same
register for both bounds to select a minimal-length region.


-- Jacob

Aaron Severance

unread,
Mar 8, 2018, 6:07:09 PM3/8/18
to jcb6...@gmail.com, RISC-V ISA Dev
Thanks Jacob.

As a general point of confusion, the memory model spec states that future extensions may include cache management instructions but that they should be treated as hints, not functional requirements.  For correctness it specs that a (possibly range-limited) fence must be used; the example they give is "fence rw[addr],w[addr]" for writeback.

I take this to mean that with non-coherent caches/DMA on a fence with W in the predecessor set cached data needs to be written out to memomry and on a fence with R in the successor set the cache needs to be flushed.  I'm not sure how useful the synchronous versions of WRITEBACK/FLUSH are then.

The WRITEBACK and FLUSH instructions then seem mostly useful in their asynchronous form to initiate a writeback/flush early because a fence is needed to ensure correctness.  As an example if working with a buffer shared with another non-coherent master then after you finish with a buffer: 1) do an asynchronous CACHE.FLUSH instruction on its addresses 2) do some other work 3) when the buffer is needed again by another hart or DMA device do a fence.

Anyway, the points that I think should be clarified are:
  1) Is if a fence is required for correctness when using the CACHE.WRITEBACK/CACHE.FLUSH operations?
  2) Can CACHE.WRITEBACK, CACHE.FLUSH, MEM.DISCARD, and MEM.REWRITE be implemented as no-ops even on hardware with non-coherent caches?  I assume the cache pinning and prefetching ops can.  WRITEBACK/FLUSH depend on if fences are required for correctness.  DISCARD/REWRITE are more problematic but I would think they can be as long as fences are required for correctness.


More notes inline.
  Aaron

On Wed, Mar 7, 2018 at 8:19 PM Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Previous discussions suggested that explicit cache-control instructions
could be useful, but RISC-V has some constraints here that other
architectures lack, namely that caching must be transparent to the user ISA.

I propose a new minor opcode REGION := 3'b001 within the existing
MISC-MEM major opcode.  Instructions in REGION are R-type and use rs1 to
indicate a base address, rs2 to indicate an upper bound address, and
produce a result in rd that is the first address after the highest
address affected by the operation.  If rd is x0, the instruction has no
directly visible effects and can be executed entirely asynchronously as
an implementation option.
Is rs2's upper bound inclusive?

Regarding having rd write back the first address after the highest address affected by the operation:
  This wording is a bit confusing; even if there is no data in the cache in the specified range those addresses are 'affected'.  Not sure what better wording would be though...
  Is this always rs2 (or rs2+1) or can it be arbitrarily higher?
  I believe this is meant to allow partial completion, where the operation is restarted from the address returned by rd.

Assuming partial completition is allowed:
    Is forward progress required? i.e. must rd be >= rs1 + 1?
    0 must be a valid return value (if the region goes to the highest addressable value).
    I would suggest that rd must be >= the first address after the highest address affected by the operation.
      Then an implementation that always fully completes could then return 0.
      No-op implementations could also always return 0.
    Does this apply to FENCE.RD/FENCE.RI? It seems problematic to have FENCE.RD/FENCE.RI partially complete and return, and FENCE/FENCE.I must fully complete anyway.
Again I would suggest 0 here, assuming an implementation can return an arbitrarily high value (up to the end of memory wrapping back around to 0).  This would signal that the operation completed, unless there's some reason to signal to the software that the operation was skipped due to lack of hardware (I don't think there is).
 
.


The new MISC-MEM/REGION space will have room for 128 opcodes, one of
which is the existing FENCE.I.  I propose:

[for draft 3, the function code assignments have changed to better group
prefetch operations]
[for draft 4, most of the mnemonics have been shortened and now indicate
that these instructions affect the memory subsystem]

===Fences===

[function 7'b0000000 is the existing FENCE.I instruction]

[function 7'b0000001 reserved]

FENCE.RD ("range data fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
  Perform a conservative fence affecting only data accesses to the
chosen region.  This instruction always has visible effects on memory
consistency and is therefore synchronous in all cases.

Does "only data accesses" mean it has the same effects as a "fence rw, rw"?
CACHE.FLUSH is also non-destructive as a single op; CACHE.WRITEBACK+MEM.DISCARD need be executed as an atomic pair.  Not that that's a huge burden.
I think the wording should be changed to something like 'data in the region need not be written to main memory'.  Flushing the cache should be a valid implementation; discarding is just a performance optimization.
 
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5AA0B9DB.4090408%40gmail.com.

Albert Cahalan

unread,
Mar 9, 2018, 3:57:02 AM3/9/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 3/7/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> ====Cacheline pinning====
>
> ??? Issue for discussion: should a page fault while pinning cachelines
> cause a trap to be taken?

This should not be an issue. Besides the fact that the MMU is
most likely disabled, the actual filling in of the data shouldn't
have to happen until an access occurs.

There are two use cases I have seen:

1. before DRAM has been set up
2. as scratch space for highly optimized FFT libraries

When setting up DRAM, the MMU won't yet have been set up.
Typically one might ask the cache to retain all writes to firmware,
making locations in flash or ROM seem writable. Stuff breaks if you
write more data than the size of the cache, and this is OK because
the code will not do that. After DRAM is running, all that data needs
to disappear. It obviously can't be written to ROM, and it isn't needed.

When running highly optimized FFT libraries, hardware-specific
knowledge is built into the libraries. Systems that are optimized
to this level tend to run without security distinctions, so mapping
the cache rwx at the same address in every task is likely acceptable.

> ??? Issue for discussion: what if another processor attempts to write
> to an address in a cacheline pinned on this processor?

If they share caches, they see the same data. If they don't share
caches, they see different data. This data is never flushed to RAM.

> CACHE.PIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
> Arrange for as much of the chosen region as possible to be accessible
> with minimal delay and no traffic to main memory. Pinning a region is
> idempotent and an implementation may pin a larger region than requested,
> provided that an unpin operation with the same base and bound will also
> unpin the larger region.
> One possible implementation is to load as much of the chosen region as
> possible into the data cache and keep it there until unpinned. Another
> implementation is to configure a scratchpad RAM and map it over at least
> the chosen region, preloading it with data from main memory.
> Scratchpads may be processor-local, but writes to a scratchpad mapped
> with CACHE.PIN must be visible to other nodes in a coherent system.

No, pinned cache should not be coherent.

> Implementations are expected to ensure that pinned cachelines will not
> impede the efficacy of a cache. Implementations with fully-associative
> caches may permit any number of pins, provided that at least one
> cacheline remains available for normal use. Implementations with N-way
> set associative caches may support pinning up to (N-1) ways within each
> set, provided that at least one way in each set remains available for
> normal use. Implementations with direct-mapped caches should not pin
> cachelines, but may still use CACHE.PIN to configure an overlay
> scratchpad, which may itself use storage shared with caches, such that
> mapping the scratchpad decreases the size of the cache.

These restrictions are not required. If the user ends up without
any normal cache, oh well... this is what they chose to do.

> CACHE.UNPIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010001}
> Explicitly release a pin set with CACHE.PIN. Pinned regions are also
> implicitly released if the memory protection and virtual address mapping
> is changed. (Specifically, a write to the current satp CSR or an
> SFENCE.VM will unpin all cachelines as a side effect, unless the
> implementation partitions its cache by ASID. Even with ASID-partitioned
> caches, changing the root page number associated with an ASID unpins all
> cachelines belonging to that ASID.) Unpinning a region does not
> immediately remove it from the cache. Unpinning a region always
> succeeds, even if parts of the region were not pinned. For an

I don't believe the MMU should be a concern. It should be ignored,
assuming it is even enabled at all.

> implementation that implements CACHE.PIN using scratchpad RAM, unpinning
> a region that uses a scratchpad causes the current contents of the
> scratchpad to be written to main memory.

Under no condition should the contents ever be written to main memory.
In the case of early hardware initialization, the writeback would be going
to flash memory or even a ROM. This could cause unintended changes.
In the case of an optimized FFT library, the writeback would likely have
no place to go at all -- there is nothing to back it at all.

> ===Declaring data obsolescence===
>
> These operations declare data to be obsolete and unimportant.

I hope that a hypervisor can secretly disable these.

> These operations are useful in general: a function prologue could use
> MEM.REWRITE to allocate a stack frame, while a function epilogue could
> use MEM.DISCARD to release a stack frame without requiring the
> now-obsolete local variables ever be written back to main memory.

There is something to be said for having this happen automatically.

John Hauser

unread,
Mar 9, 2018, 11:32:45 PM3/9/18
to RISC-V ISA Dev
Thanks, Jacob, for shepherding the discussion on this topic and
generating this draft.

I'd like to propose some renaming and reorganizing, although mostly
keeping the same spirit, I hope.

In this message, I'll leave the FENCE and CACHE.PIN/UNPIN instructions
untouched.  For the others, I first suggest a different set of names, as
follows:

    MEM.PFn     -> MEM.PREP.R     (reads)
    MEM.PF.EXCL -> MEM.PREP.RW    (reads/writes)
    MEM.PF.ONCE -> MEM.PREP.INCR  (reads at increasing addresses)
    MEM.PF.TEXT -> MEM.PREP.I     (instruction execution)
    MEM.REWRITE -> MEM.PREP.W (writes) or
                    MEM.PREP.INCW (writes at increasing addresses)

    CACHE.WRITEBACK -> CACHE.CLEAN
    CACHE.FLUSH     -> CACHE.FLUSH  (name unchanged)
    MEM.DISCARD     -> CACHE.INV

The "prep" instructions (MEM.PREP.*) are essentially hints for what
the software expects to do, giving the hardware the option to prepare
accordingly---"prep" being short for _prepare_.  Note that, when rd
isn't x0, these instructions aren't true RISC-V hints, because the
hardware at a minimum must still write rd, though it need not do
anything else.  I'm currently proposing collapsing the four MEM.PFn
instructions down to one, although I'm open to further discussion about
that.  I'm also proposing a major overhaul of MEM.REWRITE, reconceiving
it as two different MEM.PREP instructions that can optionally be used in
conjunction with CACHE.INV.

The explicit cache control instructions (CACHE.*) are fairly standard.
These cannot be trivialized in the same way as the first group, unless
the entire memory system is known to be coherent (including any device
DMA).

I've broken up Jacob's "data obsolescence" section that has MEM.REWRITE
and MEM.DISCARD together, with the consequence that some of the text
there would no longer be correct.

--------------------
Memory access hints

I'm proposing to collapse Jacob's four levels of MEM.PFn prefetch
instructions into a single MEM.PREP.R, because I don't see how software
will know which of the four levels to use in any given circumstance.  If
a consensus could be developed for heuristics to guide this decision for
programmers, tools, and hardware, then I could perhaps see the value of
having multiple levels.  In the absence of such guidance, I see at best
a chicken-and-egg problem between software and hardware, where nobody
agrees or understands exactly what the different levels should imply.
In my view, the Foundation shouldn't attempt to standardize multiple
prefetch levels unless it's prepared to better answer this question.

It's still always possible for individual implementations to have their
own custom instructions for different levels of prefetch, if they see a
value in having their own custom answer to the question.

I propose some minor tweaks to how the affected region is specified.  My
version is as follows:  Assume A and B are the unsigned integer values
of operands rs1 and rs2.  If A < B, then the region covers addresses A
through B - 1, inclusive.  If B <= A, the region is empty.  (However,
note that, because these MEM.PREP instructions act only as hints, an
implementation can adjust the bounds of the affected region to suit its
purposes without officially breaking any rules.)

Except for MEM.PREP.INCR and MEM.PREP.INCW, I'm proposing to remove
the option to have anything other than x0 for rd.  It's not clear to me
that we can realistically expect software to make use of the result that
Jacob defines, and removing it is a simplification.  If everyone feels
I'm mistaken and the loss of this feature is unacceptable, it could be
restored.

If implemented at all, the MEM.PREP instructions do not trap for any
reason, so they cannot cause page faults or access faults.  (This is no
different than Jacob already had.)

Here I've attempted to summarize the intention proposed for each
instruction:

  MEM.PREP.I

      Software expects to execute many instructions in the region.

      Hardware might respond by attempting to prefetch the code into the
      instruction cache.

  MEM.PREP.R

      Software expects to read many bytes in the region, but not to
      write many bytes in the region.

      Hardware might respond by attempting to acquire a shared copy of
      the data (prefetch).

  MEM.PREP.RW

      Software expects both to read many bytes and to write many bytes
      in the region.

      Hardware might respond by attempting to acquire a copy of the
      data along with (temporary) exclusive rights to write the data.
      For some implementations, the effect of MEM.PREP.RW might be no
      different than MEM.PREP.R.

  MEM.PREP.INCR

      Software expects to read many bytes in the region, starting first
      with lower addresses and progressing over time into increasingly
      higher addresses, though not necessarily in strictly sequential
      order.  If software writes to the region, it expects usually to
      read the same or nearby bytes before writing.

      Hardware might respond by applying a mechanism for sequential
      prefetch-ahead.  For some implementations, this mechanism might
      be ineffective unless the region is accessed solely by reads at
      sequential, contiguous locations.

  MEM.PREP.W

      Software expects to write many bytes in the region.  If software
      reads from the region, it expects usually to first write those
      same bytes before reading.

      Hardware might respond by adjusting whether a write to a
      previously missing cache line within the region will cause the
      remainder of the line to be brought in from main memory.

  MEM.PREP.INCW

      Software expects to write many bytes in the region, starting first
      with lower addresses and progressing over time into increasingly
      higher addresses, though not necessarily in strictly sequential
      order.  If software reads from the region, it expects usually to
      first write those same bytes before reading.

      Hardware might respond by applying a mechanism for sequential
      write-behind.  For some implementations, this mechanism might
      be ineffective unless the region is accessed solely by writes at
      sequential, contiguous locations.


For the non-INC instructions (MEM.PREP.I, MEM.PREP.R, MEM.PREP.RW, and
MEM.PREP.W), if the size of the region specified is comparable to or
larger than the entire cache size at some level of the memory hierarchy,
the implementation would probably do best to ignore the prep instruction
for that cache, though it might still profit from applying the hint to
larger caches at lower levels of the memory hierarchy.

In my proposal, MEM.PREP.INCR and MEM.PREP.INCW are unique in allowing
destination rd to be something other than x0.  If a MEM.PREP.INCR/INCW
has a non-x0 destination, the implementation writes register rd with an
address X subject to these rules, where A and B are the values of rs1
and rs2 defined earlier:  If B <= A (region is empty), then X = B.
Else, if A < B (region is not empty), the value X must be such that
A < X <= B.  If the value X written to rd is less than B (which can
happen only if the region wasn't empty), then software is encouraged to
execute MEM.PREP.INCR/INCW again after it believes it is done accessing
the subregion between addresses A and X - 1 inclusive.  For this new
MEM.PREP.INCR/INCW, the value of rs1 should be X and the value of rs2
should be B as before.  The process may then repeat with the hardware
supplying a new X.

Software is not required to participate in this iterative sequence of
MEM.PREP.INCR/INCW instructions, as it can always simply give rd as x0.
However, read-ahead or write-behind might not be as effective without
this iteration.

The minimal hardware implementation of the non-INC instructions
(MEM.PREP.I, MEM.PREP.R, MEM.PREP.RW, and MEM.PREP.W) would be simply to
ignore the instructions as no-ops.  For MEM.PREP.INCR and MEM.PREP.INCW,
the minimum is to copy rs2 (B) to rd (X) and do nothing else.  Since
the non-INC instructions require rd to be x0, these minimal behaviors
can be combined, so that the only thing done for all valid MEM.PREP.*
instructions is to copy rs2 to rd (which may be x0).

--------------------
Explicit cache control

The three cache control instructions I propose are:

    CACHE.CLEAN  (was CACHE.WRITEBACK)
    CACHE.FLUSH
    CACHE.INV    (was MEM.DISCARD)

I expect these will be familiar to anyone who has used similar
instructions on other processors, except possibly for the name "clean"
instead of "writeback" or "wb".  I'm not proposing changing the
fundamental semantics of the instructions, although I think the
description of CACHE.INV can be simplified a bit.

It's important to remember when talking about caches that writebacks
of dirty data from the cache are allowed to occur automatically at
any time, for any reason, or even for no apparent reason whatsoever.
Likewise, a cache line that isn't dirty can be automatically invalidated
at any time, with or without any apparent reason.  Therefore, our
description of CACHE.INV doesn't need to give the implementation
explicit license to flush whole cache lines to handle partial lines at
the start and end of the specified region, as it would already have the
authority to perform such flushes at will.

Here's how I might rewrite the description for CACHE.INV:

    CACHE.INV  [was MEM.DISCARD]


    Declare the contents of the region obsolete, dropping any copies
    present in the processor, without necessarily writing dirty data to
    main memory first.  The contents of the region are unspecified after

    the operation completes, but shall not include foreign data.

    Any pinned cache lines that are entirely within the region are
    automatically unpinned.

    Comment:
    Be aware that, because writebacks of dirty data in the cache can
    occur at any time, software has no guarantees that CACHE.INV will
    cause previous memory writes to be discarded.  Any combination

    of "old" and "new" data may appear in the region after executing
    CACHE.INV.

    Comment:
    If the region does not align with cache line boundaries and the
    cache is incapable of invalidating or flushing only a partial cache
    line, the implementation may need to flush the whole cache lines
    overlapping the start and end of the region, including bytes next to
    but outside the region.


The region to be cleaned/flushed/invalidated is specified the same as
I wrote above for the MEM.PREP instructions:  Assuming A and B are the
unsigned integer values of operands rs1 and rs2, then if A < B, the
region covers addresses A through B - 1, inclusive, and, conversely, if
B <= A, the region is empty.  The hardware does not guarantee to perform
the operation for the entire region at once, but instead returns a
progress in the destination rd.  The value written to rd is an address X
conforming to these rules:  If B <= A (region is empty), then X = B.
Else, if A < B (region is not empty), the value X must satisfy
A <= X <= B.  If the result value X = B, then the operation is complete
for the entire region.  If instead X < B (which can only happen if
the region wasn't empty), then software must execute the same CACHE
instruction again with rs1 set to X and with rs2 set to the same B as
before.

As long as no other memory instructions are executed between each CACHE
instruction, an implementation must guarantee that the original cache
operation will complete in a finite number of iterations of this
algorithm.  It is possible for an implementation to make progress on a
cache operation yet repeatedly return the original A as the result X,
until eventually returning B.

A CACHE instruction may cause a page fault for any page overlapping the
specified region.

When an implementation's entire memory system is known to be coherent,
including any device DMA, then the CACHE instructions may be implemented
in a minimal way simply by copying operand rs2 to destination rd.  On
the other hand, if the memory system is not entirely coherent, the CACHE
instructions cannot be implemented trivially, because their full effects
are needed for software to compensate for a lack of hardware-enforced
coherence.

--------------------
Concerning instructions for data obsolescence

Jacob defines two instructions for declarating data obsolescence:
MEM.DISCARD and MEM.REWRITE.  I've renamed MEM.DISCARD to CACHE.INV,
and otherwise tweaked its behavior in only minor ways.  Concerning
MEM.REWRITE, it's my belief that the proposed uses for this instruction
can each be satisfied by one of the following:

    MEM.PREP.W
    MEM.PREP.INCW
    CACHE.INV followed by MEM.PREP.W
    CACHE.INV followed by MEM.PREP.INCW

For instance, I believe Jacob's stack frame example could be rewritten
as:

    A function prologue could use CACHE.INV and MEM.PREP.W to allocate
    a stack frame, while a function epilogue could use CACHE.INV to

    release a stack frame without requiring the now-obsolete local
    variables ever be written back to main memory.

(Although it probably should be added that this wouldn't ordinarily be
advantageous to do for small stack frames.)

--------------------

That's it for now.  I invite feedback on any of the above.  I know
there's already been a lot of discussion on this topic, and I hope I
haven't overlooked any contrary conclusions from earlier.

Regards,

    - John Hauser

John Hauser

unread,
Mar 9, 2018, 11:59:47 PM3/9/18
to RISC-V ISA Dev
By the way, the Google Groups interface seems still to be a bit buggy,
so if you see some strange formatting in my previous message, that's
the reason why.  The mistakes weren't there when I pushed the button
to approve the message, but Google's machines then inserted them, the
clever bastards.  Not under my control, nor can I edit E-mail to fix it once it's
been sent.

    - John Hauser

John Hauser

unread,
Mar 10, 2018, 12:12:48 AM3/10/18
to RISC-V ISA Dev
Oh yeah, I see now, in the Google Groups archive I'm viewing, Google's
machines are automatically seeking out lines of text that match ones
from earlier messages, and then setting those lines off as quotations,
without realizing that that's having the effect of breaking up what I
wrote.  Hopefully, the E-mail text that was sent out doesn't include the
same effect.

I hope we're all preparing for the day soon when our lives are run
entirely by machines with the sense of a pea.

    - John Hauser

John Hauser

unread,
Mar 11, 2018, 4:17:27 PM3/11/18
to RISC-V ISA Dev
On further reflection, I'd like to amend my proposal a bit.

It occurs to me that there could be value to having a MEM.DISCARD
instruction separate from the CACHE.INV I defined.  I propose
MEM.DISCARD be a kind of "destructive" hint, taking a region specified
by rs1 and rs2 the same as my MEM.PREP hints.  The meaning would be:

  MEM.DISCARD  [new version]

      Software asserts that no bytes in the region will be read again
      by anyone (including other processors and devices) until first
      written.

      Hardware might respond by invalidating any cached instances of the
      relevant data.

The differences between CACHE.INV and MEM.DISCARD would be:

  - While CACHE.INV is _required_ to cause cache invalidations,
    MEM.DISCARD only _might_ do so.  An implementation is free to ignore
    MEM.DISCARD.

  - Like my various MEM.PREP.* hint instructions, MEM.DISCARD would not
    be permitted to cause any traps.

  - There would be no mechanism for iterating MEM.DISCARD.  Destination
    rd would be required to be x0, the same as most (though not all) of
    my MEM.PREP instructions.

To review, the complete set of instructions covered by my proposal would
now be:

    MEM.PREP.I
    MEM.PREP.R
    MEM.PREP.RW
    MEM.PREP.INCR
    MEM.PREP.W
    MEM.PREP.INCW

    MEM.DISCARD

    CACHE.CLEAN
    CACHE.FLUSH
    CACHE.INV

At this point, the earlier MEM.DISCARD and MEM.REWRITE have been
split into four instructions with more specific semantics:  CACHE.INV,
MEM.DISCARD, MEM.PREP.W, and MEM.PREP.INCW.

Returning to Jacob's stack frame example, it might be sensible for a
subroutine to use MEM.DISCARD to "free" its stack frame before exiting,
especially if the frame is large.  On entry to the same subroutine,
one could also use MEM.DISCARD followed by MEM.PREP.W.  However, in
practice, if subroutines were routinely using MEM.DISCARD at exit,
having MEM.DISCARD + MEM.PREP.W at entry probably wouldn't buy much.

As always, feedback is appreciated.

    - John Hauser

Rogier Brussee

unread,
Mar 12, 2018, 12:45:04 PM3/12/18
to RISC-V ISA Dev


Op zondag 11 maart 2018 21:17:27 UTC+1 schreef John Hauser:
On further reflection, I'd like to amend my proposal a bit.

It occurs to me that there could be value to having a MEM.DISCARD
instruction separate from the CACHE.INV I defined.  I propose
MEM.DISCARD be a kind of "destructive" hint, taking a region specified
by rs1 and rs2 the same as my MEM.PREP hints.  The meaning would be:

  MEM.DISCARD  [new version]

      Software asserts that no bytes in the region will be read again
      by anyone (including other processors and devices) until first
      written.

Is it necessary that nothing ever reads this region, or does a MEM.DISCARD
assert that until it starts rewriting the region, software _executing in the current hart_ :
no longer cares what the content of the memory region is, 
gives hardware the freedom to no longer keep cache coherent with the content of the memory region, and
gives hardware the freedom to drop any cache line for the memory region.

Thus, from the point of the hint on, the software treats the region in memory as undefined, and whatever is in 
cache for the region is considered useless garbage. Clearly that means that the region should be private to the 
hart (like a stack) and other harts or devices should not be reading or writing to the region (like a stack), but perhaps 
someone finds a use for having harts or devices reading and writing asynchronously to such a region and designs some 
protocol do deal with the resulting mess. 

In any case, what is the behaviour if the hart or other harts or devices read from the region anyway?

P.S. FWIW I do think your naming and split up do make things a lot clearer. I like the name discard,
but for symmetry perhaps there should be a mem.discard and cache.discard or a mem.inv and cache.inv.
Should the C.ADDI16SP imm instruction have an implicit  MEM.DISCARD min(sp, sp + imm << 4), max(sp, sp + imm<< 4) hint?  

Tommy Thorn

unread,
Mar 14, 2018, 11:49:33 AM3/14/18
to Rogier Brussee, RISC-V ISA Dev
I have no comments on the overall discussion at this point, but I must
comment on this:

> Should the C.ADDI16SP imm instruction have an implicit MEM.DISCARD min(sp, sp + imm << 4), max(sp, sp + imm<< 4) hint?

It is essential that we maintain the property of compressed instructions that they can be expanded directly into a single 32-bit instruction. This would break that.

Tommy

Christoph Hellwig

unread,
Mar 14, 2018, 1:29:40 PM3/14/18
to Jacob Bachmeyer, RISC-V ISA Dev
Hi Jacob,

thanks for doing this! I had started drafting text for a small subset
of your instructions. Comments are mostly related to those.

On Wed, Mar 07, 2018 at 10:19:39PM -0600, Jacob Bachmeyer wrote:
> Previous discussions suggested that explicit cache-control instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the user
> ISA.

Note that this can't always be the case. For at least two use cases
we will need cache control instructions that are not just hints:

a) persistent memory
b) cache-incoherent DMA

For persistent memory it is absolutely essential that we have a way
to force specific cache lines out to the persistence boundary. I've
started looking into porting the Linux kernel persistent memory support
and pmdk to RISC-V (so far in qemu emulation, but also looking into
rocket support), and the absolutely minimum required feature is
a cache line writeback instruction, which can not be treated as a hint.

For cache-incoherent DMA we also need a cache line invalidation
instruction that must not be a hint but guaranteed to work. I've
heard from a few folks that they'd like to mandate cache coherent
DMA for usual RISC-V systems. This sounds like a noble goal to me,
but just about every CPU architecture used in SOCs seems to grow
device with cache incoherent DMA rather sooner than later (due to
the fact that most SOCs are random pieces of barely debugged IP
glued together with shoe-string and paperclips). This even includes
x86 now.

> In general, this proposal uses "cacheline" to describe the hardware
> granularity for an operation that affects multiple words of memory or
> address space. Where these operations are implemented using traditional
> caches, the use of the term "cacheline" is entirely accurate, but this
> proposal does not prevent implementations from using other means to
> implement these instructions.

One of the thorny issues here is that we will have to have a way to
find out the cache line size for a given CPU.

> ====Flush====
>
> CACHE.WRITEBACK
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
> Writeback any cachelines in the requested region.
>
> CACHE.FLUSH
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
> Write any cachelines in the requested region (as by CACHE.WRITE), marking
> them invalid afterwards (as by MEM.DISCARD). Flushed cachelines are
> automatically unpinned.
>
> Rationale for including CACHE.FLUSH: small implementations may
> significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD; the
> implementations that most benefit lack the infrastructure to achieve such
> combination by macro-op fusion.

Yes, this is something both x86 and arm provide so it will help porting.

In terms of naming I'd rather avoid flush as a name as it is a very
overloaded term. I'd rather name the operations purely based on
'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and CACHE.WBINV

> These instructions create regions with undefined contents and share a
> requirement that foreign data never be introduced. Foreign data is,
> simply, data that was not previously visible to the current hart at the
> current privilege level at any address. Operating systems zero pages
> before attaching them to user address spaces to prevent foreign data from
> appearing in freshly-allocated pages. Implementations must ensure that
> these instructions do not cause foreign data to leak through caches or
> other structures.

This sounds extremely scary. Other architectures generally just define
instructions to invalidate the caches directly assesible to the hart
equivalent up to a given coherency domain (arm is particularly complicated
there, x86 just has one domain).

> MEM.DISCARD
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
> Declare the contents of the region obsolete, dropping any copies present
> in the processor without performing writes to main memory. The contents of
> the region are undefined after the operation completes, but shall not
> include foreign data.

What are copies present in the processor?

> If the region does not align with cacheline boundaries, any partial
> cachelines are written back. If hardware requires such, the full contents
> of a cacheline partially included may be written back, including data just
> declared obsolete. In a non-coherent system, partial cachelines written
> back are also invalidated. In a system with hardware cache coherency,
> partial cachelines must be written back, but may remain valid.

At least for data integrity operations like invalidate (or discard) and
writeback I would much, much prefer to require the operation to be
aligned to cache lines. Pretty much any partial behavior could be
doing the wrong thing for one case or another.

> MEM.REWRITE
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
> Declare the contents of the region obsolete, indicating that the current
> hart will soon overwrite the entire region. Reading any datum from the
> region before the current hart has written that datum (or other data fully
> overlapping that datum) is incorrect behavior and produces undefined
> results, but shall not return foreign data. Note that undefined results
> may include destruction of nearby data. For optimal performance, software
> should write the entire region before reading any part of the region and
> should do so sequentially, starting at the base address and moving towards
> the address produced by MEM.REWRITE.

This sounds like a really scary undefined behavior trap. What is the
use case for this instruction? Are there equivalents in other architectures?

Christoph Hellwig

unread,
Mar 14, 2018, 1:47:38 PM3/14/18
to Aaron Severance, jcb6...@gmail.com, RISC-V ISA Dev
On Thu, Mar 08, 2018 at 11:06:56PM +0000, Aaron Severance wrote:
> As a general point of confusion, the memory model spec states that future
> extensions may include cache management instructions but that they should
> be treated as hints, not functional requirements. For correctness it specs
> that a (possibly range-limited) fence must be used; the example they give
> is "fence rw[addr],w[addr]" for writeback.

Do you have a pointer to that part of the memory model? I couldn't
find anything that looks like it in the last draft on the memory-model
list.

> I take this to mean that with non-coherent caches/DMA on a fence with W in
> the predecessor set cached data needs to be written out to memomry and on a
> fence with R in the successor set the cache needs to be flushed. I'm not
> sure how useful the synchronous versions of WRITEBACK/FLUSH are then.

In general to me the RISC philosophy would imply keeping fence and
writeback instructions separate, although combining them would certainly
help with code density. Especially with the very weak ordered memory
model cache writebacks would almost always require some sort of fence
before the writeback. That being said we'd really want a ranged fence
to not entirely kill performance.

> The WRITEBACK and FLUSH instructions then seem mostly useful in their
> asynchronous form to initiate a writeback/flush early because a fence is
> needed to ensure correctness. As an example if working with a buffer
> shared with another non-coherent master then after you finish with a
> buffer: 1) do an asynchronous CACHE.FLUSH instruction on its addresses 2)
> do some other work 3) when the buffer is needed again by another hart or
> DMA device do a fence.

At least for the typical persistent memory use case, and the DMA API as
used in Linux we will need synchronous execution of the cache writeback
and invalidation as it is needed ASAP.

> Anyway, the points that I think should be clarified are:
> 1) Is if a fence is required for correctness when using the
> CACHE.WRITEBACK/CACHE.FLUSH operations?

The way I understood Jacob it is, but making this more clear would be
very useful.

> 2) Can CACHE.WRITEBACK, CACHE.FLUSH, MEM.DISCARD, and MEM.REWRITE be
> implemented as no-ops even on hardware with non-coherent caches?

It is not just non-coherent caches (which would be horrible) but also
not cache coherent DMA and persistent memory. In both case they must
not be no-ops if we want a working system.

Michael Chapman

unread,
Mar 14, 2018, 5:23:09 PM3/14/18
to Tommy Thorn, Rogier Brussee, RISC-V ISA Dev

On 14-Mar-18 22:49, Tommy Thorn wrote:
> ...
> It is essential that we maintain the property of compressed instructions that they can be expanded directly into a single 32-bit instruction.

Why?


Daniel Lustig

unread,
Mar 14, 2018, 6:21:55 PM3/14/18
to Christoph Hellwig, Aaron Severance, jcb6...@gmail.com, RISC-V ISA Dev
On 3/14/2018 10:47 AM, Christoph Hellwig wrote:
> On Thu, Mar 08, 2018 at 11:06:56PM +0000, Aaron Severance wrote:
>> As a general point of confusion, the memory model spec states that future
>> extensions may include cache management instructions but that they should
>> be treated as hints, not functional requirements. For correctness it specs
>> that a (possibly range-limited) fence must be used; the example they give
>> is "fence rw[addr],w[addr]" for writeback.
>
> Do you have a pointer to that part of the memory model? I couldn't
> find anything that looks like it in the last draft on the memory-model
> list.

Right now we're focused on getting RVWMO settled, so for now we just
say the following in the explanatory material appendix, as one of the
"Possible Future Extensions" that we expect should be made compatible
with the memory consistency model:

> Cache writeback/flush/invalidate/etc. hints, but these should be
> considered hints, not functional requirements. Any cache management
> operations which are required for basic correctness should be
> described as (possibly address range-limited) fences to comply with
> the RISC-V philosophy (see also fence.i and sfence.vma). For example,
> a functional cache writeback instruction might instead be written as
> “fence rw[addr],w[addr]”.

Really, the intention of that text (which we can clarify) is that
it should apply to normal memory: people shouldn't use cache
writeback/flush/invalidate instead of following the rules of the
normal memory consistency model. As a performance hint, sure, but
not as a functional replacement. But I/O, non-coherent DMA,
persistent memory, etc. are a different question, and may well want
all those things for actual functional correctness. The memory model
task group is mostly punting on trying to formalize all that until
after RVWMO for normal memory is settled.

The bit about using fences is not necessarily meant as a strict
correctness claim either. It's just an observation that RISC-V
already uses FENCE.I and SFENCE.VMA to describe what other
architectures might describe as "invalidate the instruction cache"
and "invalidate the TLB", respectively. So in that spirit, maybe
the instruction for "invalidate this cache (line)" could be
described as some kind of fence too. And then likewise for flush
writeback/etc.

Or, maybe in the end that doesn't work out, and separate flush/
writeback/invalidate instructions work better. I don't think
there would be anything inherently wrong with that approach either.
Or, people may simply insist that DMA is coherent, etc., and
sidestep the question. We in the memory model TG are not really
taking any stance at the moment.

Dan
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Rogier Brussee

unread,
Mar 14, 2018, 6:51:24 PM3/14/18
to RISC-V ISA Dev, rogier....@gmail.com


Op woensdag 14 maart 2018 16:49:33 UTC+1 schreef Tommy Thorn:
That is why it is a question. 

I would not even have posed the question, if the MEM.DISCARD would be more than just a hint that may or may not
be followed by the hardware. I must admit, however, that the hint gives the hardware a little extra leeway with respect to cache coherency
that _might_ have subtle interactions with memory model, even if that _should_ not make a difference if the instruction is used for its
intended purpose of manipulating the stack. 


Probably the question should have been framed as: 

implementing the MEM.DISCARD gives you (almost?) everything you need to implement an heuristic 
equivalent to MEM.DISCARD min(sp, sp + imm), max(sp, sp + imm) when manipulating the stack before exit i.e  on 

addi sp sp imm
j ra

is this allowed?
How does this interact with the MEM.DISCARD instruction?
If such an heuristic is allowed, is a MEM.DISCARD instruction still useful?


Rogier
  

Jacob Bachmeyer

unread,
Mar 14, 2018, 9:05:23 PM3/14/18
to Aaron Severance, RISC-V ISA Dev
Aaron Severance wrote:
> Thanks Jacob.
>
> As a general point of confusion, the memory model spec states that
> future extensions may include cache management instructions but that
> they should be treated as hints, not functional requirements. For
> correctness it specs that a (possibly range-limited) fence must be
> used; the example they give is "fence rw[addr],w[addr]" for writeback.

This proposal predates the new memory model. Unfortunately, the
original motivation for these instructions means that the memory model
is simply wrong on that point: in a system without hardware coherency,
cache management *must* have functional requirements. They can be
thought of as hint-like instructions, in that caches may immediately
continue their normal operations (for example, a just-prefetched
cacheline *can* be evicted if not pinned) but the synchronous forms
block execution (or create dependencies; an out-of-order processor is
not required to serialize on them, but must meet the implied fence)
until the operation is complete.

> I take this to mean that with non-coherent caches/DMA on a fence with
> W in the predecessor set cached data needs to be written out to
> memomry and on a fence with R in the successor set the cache needs to
> be flushed. I'm not sure how useful the synchronous versions of
> WRITEBACK/FLUSH are then.
>
> The WRITEBACK and FLUSH instructions then seem mostly useful in their
> asynchronous form to initiate a writeback/flush early because a fence
> is needed to ensure correctness. As an example if working with a
> buffer shared with another non-coherent master then after you finish
> with a buffer: 1) do an asynchronous CACHE.FLUSH instruction on its
> addresses 2) do some other work 3) when the buffer is needed again by
> another hart or DMA device do a fence.

Or (1) write data to buffer (2) perform synchronous CACHE.FLUSH on
buffer (3) initiate DMA disk write (or similar I/O) from buffer. The
equivalent read uses MEM.DISCARD: (1) perform synchronous MEM.DISCARD
on buffer (2) wait for DMA disk read (or similar I/O) to complete (3)
read from buffer.

> Anyway, the points that I think should be clarified are:
> 1) Is if a fence is required for correctness when using the
> CACHE.WRITEBACK/CACHE.FLUSH operations?

The REGION operations, if synchronous, imply all relevant fences. This
is a new requirement from the base specification and has been added to
draft 6.

> 2) Can CACHE.WRITEBACK, CACHE.FLUSH, MEM.DISCARD, and MEM.REWRITE be
> implemented as no-ops even on hardware with non-coherent caches? I
> assume the cache pinning and prefetching ops can. WRITEBACK/FLUSH
> depend on if fences are required for correctness. DISCARD/REWRITE are
> more problematic but I would think they can be as long as fences are
> required for correctness.

Prefetch is prefetch -- the program does not really care if it completes
or even if the prefetched address is valid. The cache pinning
operations must either actually have their defined effect or fail,
transferring the base address in rs1 to rd to indicate that nothing was
affected. WRITEBACK/FLUSH imply the appropriate fences, as does
DISCARD. REWRITE really is an (ISA-level) no-op -- it has no directly
visible effects, but permits a microarchitectural optimization that
*does* have very visible effects if REWRITE is used improperly and a
performance benefit (elide a useless memory load) if used properly.

> More notes inline.
> Aaron
>
> On Wed, Mar 7, 2018 at 8:19 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Previous discussions suggested that explicit cache-control
> instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the
> user ISA.
>
> I propose a new minor opcode REGION := 3'b001 within the existing
> MISC-MEM major opcode. Instructions in REGION are R-type and use
> rs1 to
> indicate a base address, rs2 to indicate an upper bound address, and
> produce a result in rd that is the first address after the highest
> address affected by the operation. If rd is x0, the instruction
> has no
> directly visible effects and can be executed entirely
> asynchronously as
> an implementation option.
>
> Is rs2's upper bound inclusive?

Yes -- rs1 and rs2 may be the same value (even the same register) to
affect a single hardware granule.

> Regarding having rd write back the first address after the highest
> address affected by the operation:
> This wording is a bit confusing; even if there is no data in the
> cache in the specified range those addresses are 'affected'. Not sure
> what better wording would be though...
> Is this always rs2 (or rs2+1) or can it be arbitrarily higher?
> I believe this is meant to allow partial completion, where the
> operation is restarted from the address returned by rd.

The purpose is to allow easy looping over a larger region than the
hardware can handle with these operations: use the address produced as
the base address for the next loop iteration.

The proposal states that a requested region may be expanded on both ends
to meet hardware granularity requirements. The result is the first
address after the expanded region. Was this unclear and can you suggest
better wording?

> Assuming partial completition is allowed:
> Is forward progress required? i.e. must rd be >= rs1 + 1?
No, an operation can fail.
> 0 must be a valid return value (if the region goes to the highest
> addressable value).
> I would suggest that rd must be >= the first address after the
> highest address affected by the operation.
Permitting a series of operations to "skip" addresses prevents the easy
loop mentioned above.
> Then an implementation that always fully completes could then
> return 0.
> No-op implementations could also always return 0.
A no-op implementation must return the base address, indicating that
nothing was done.
> Does this apply to FENCE.RD/FENCE.RI? It seems problematic to have
> FENCE.RD/FENCE.RI partially complete and return, and FENCE/FENCE.I
> must fully complete anyway.

FENCE.I does not use its registers, all of which are required to be x0.

The ranged fences will be required to fully complete in draft 6.
Returning the base address is equally simple in hardware (copy rs1 vs
copy x0) and allows software to know that nothing has actually
happened. Generally, conflating failure (nothing done) with success
(complete!) *really* rubs me the wrong way. Software can always either
ignore the return value or issue the instruction asynchronously if it
truly does not care.

>
>
> .
>
> The new MISC-MEM/REGION space will have room for 128 opcodes, one of
> which is the existing FENCE.I. I propose:
>
> [for draft 3, the function code assignments have changed to better
> group
> prefetch operations]
> [for draft 4, most of the mnemonics have been shortened and now
> indicate
> that these instructions affect the memory subsystem]
>
> ===Fences===
>
> [function 7'b0000000 is the existing FENCE.I instruction]
>
> [function 7'b0000001 reserved]
>
> FENCE.RD ("range data fence")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
> Perform a conservative fence affecting only data accesses to the
> chosen region. This instruction always has visible effects on memory
> consistency and is therefore synchronous in all cases.
>
>
> Does "only data accesses" mean it has the same effects as a "fence rw,
> rw"?

The ranged fences are conservative, which means that they are equivalent
to "FENCE rwio,rwio" for all addresses in the range. The instruction
fetch unit and load/store unit are permitted to have separate paths to
memory, so FENCE.RD affects the path from the load/store unit to main
memory, while FENCE.RI affects the path from the instruction fetch unit
to main memory.
Only atomic with respect to other accesses to the same region on the
same path to main memory as this hart. For the small implementations
that drove the inclusion of CACHE.FLUSH, there is probably only a single
hart.
Is "invalidate cache" never specifically required?

-- Jacob

Jacob Bachmeyer

unread,
Mar 14, 2018, 9:15:40 PM3/14/18
to Christoph Hellwig, RISC-V ISA Dev
The proposed instructions are not hints. This will be explicit in draft 6.

>> In general, this proposal uses "cacheline" to describe the hardware
>> granularity for an operation that affects multiple words of memory or
>> address space. Where these operations are implemented using traditional
>> caches, the use of the term "cacheline" is entirely accurate, but this
>> proposal does not prevent implementations from using other means to
>> implement these instructions.
>>
>
> One of the thorny issues here is that we will have to have a way to
> find out the cache line size for a given CPU.
>

The intent for REGION opcodes is that the instruction specifies an
extent of memory that is to be affected. The actual hardware
granularity is not relevant.

>> ====Flush====
>>
>> CACHE.WRITEBACK
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
>> Writeback any cachelines in the requested region.
>>
>> CACHE.FLUSH
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
>> Write any cachelines in the requested region (as by CACHE.WRITE), marking
>> them invalid afterwards (as by MEM.DISCARD). Flushed cachelines are
>> automatically unpinned.
>>
>> Rationale for including CACHE.FLUSH: small implementations may
>> significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD; the
>> implementations that most benefit lack the infrastructure to achieve such
>> combination by macro-op fusion.
>>
>
> Yes, this is something both x86 and arm provide so it will help porting.
>
> In terms of naming I'd rather avoid flush as a name as it is a very
> overloaded term. I'd rather name the operations purely based on
> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and CACHE.WBINV
>

Can you explain the plausible meanings of "flush" that could create
confusion? I had believed it to be a good synonym for
"writeback-and-invalidate".

>> These instructions create regions with undefined contents and share a
>> requirement that foreign data never be introduced. Foreign data is,
>> simply, data that was not previously visible to the current hart at the
>> current privilege level at any address. Operating systems zero pages
>> before attaching them to user address spaces to prevent foreign data from
>> appearing in freshly-allocated pages. Implementations must ensure that
>> these instructions do not cause foreign data to leak through caches or
>> other structures.
>>
>
> This sounds extremely scary. Other architectures generally just define
> instructions to invalidate the caches directly assesible to the hart
> equivalent up to a given coherency domain (arm is particularly complicated
> there, x86 just has one domain).
>

All operations in the proposal affect a path (and all nodes on that
path) between a hart and main memory. The MEM.DISCARD and MEM.REWRITE
instructions are ways of saying that some (not-yet-coherent) current
contents of a region no longer matter and need never be made coherent.
The scary prohibition on introducing foreign data is there to ensure
that this is safe. MEM.REWRITE particularly needs it to prevent
possible abuse.

>> MEM.DISCARD
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
>> Declare the contents of the region obsolete, dropping any copies present
>> in the processor without performing writes to main memory. The contents of
>> the region are undefined after the operation completes, but shall not
>> include foreign data.
>>
>
> What are copies present in the processor?
>

An old bit of wording (now) changed for draft 6: "copies present
between the hart's load/store unit and main memory" I was originally
thinking of a common modern PC architecture, with caches internal to the
processor module and memory on its own modules.

>> If the region does not align with cacheline boundaries, any partial
>> cachelines are written back. If hardware requires such, the full contents
>> of a cacheline partially included may be written back, including data just
>> declared obsolete. In a non-coherent system, partial cachelines written
>> back are also invalidated. In a system with hardware cache coherency,
>> partial cachelines must be written back, but may remain valid.
>>
>
> At least for data integrity operations like invalidate (or discard) and
> writeback I would much, much prefer to require the operation to be
> aligned to cache lines. Pretty much any partial behavior could be
> doing the wrong thing for one case or another.
>

The problem is that the entire point of region operations is to insulate
software from cache details. Partial cachelines are simply included in
non-destructive operations, but destructive operations require some
non-destructive substitute operation on partial cachelines. MEM.DISCARD
performs writeback, while MEM.REWRITE performs an exclusive prefetch.

>> MEM.REWRITE
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
>> Declare the contents of the region obsolete, indicating that the current
>> hart will soon overwrite the entire region. Reading any datum from the
>> region before the current hart has written that datum (or other data fully
>> overlapping that datum) is incorrect behavior and produces undefined
>> results, but shall not return foreign data. Note that undefined results
>> may include destruction of nearby data. For optimal performance, software
>> should write the entire region before reading any part of the region and
>> should do so sequentially, starting at the base address and moving towards
>> the address produced by MEM.REWRITE.
>>
>
> This sounds like a really scary undefined behavior trap.

The destruction of nearby data has been clarified to "nearby data within
the region" for draft 6. The idea is that MEM.REWRITE produces
(temporarily) an inconsistent state, where the cacheline is valid (so
that writes will hit) but was never loaded from memory (executing
MEM.REWRITE *did* declare that the contents of main memory at those
addresses did not matter) so contains garbage. Other words within the
cacheline (or within a wider cacheline used at an outer level but still
entirely within the region) may be clobbered as a result, if the region
is not entirely overwritten before the cacheline is written back.

The warnings of undefined behavior are supposed to be scary -- the
programmer is expected to take heed and promptly overwrite the entire
region MEM.REWRITE affects, as is MEM.REWRITE's purpose.

> What is the use case for this instruction?

Initializing or bulk copying data. One use case is to use MEM.REWRITE
when allocating and initializing a stack frame in a function prologue
(every word in that block will be written shortly) and MEM.DISCARD when
releasing a stack frame in a function epilogue (the locals are now dead,
why waste cycles writing them back?). Another use case (also the
inspiration) for MEM.REWRITE (for which the proposal gives some
heuristics) is memset(3) and memcpy(3). Both functions completely
overwrite a destination buffer without first reading anything from it.
Why waste cycles loading a destination from main memory just to write
the whole thing back?

> Are there equivalents in other architectures?
As far as I know, MEM.REWRITE, exactly, is new. The concept of a block
of memory that must be written before reading from it is nothing new,
however -- C local variables and buffers returned from malloc(3) have
always been like this as far as I know.

There may be similar instructions on PowerPC that explicitly zero
cachelines and similar, but those would have been mentioned in the
earlier discussions on this topic on isa-dev and I do not have any
message-ids close at hand.


-- Jacob

Jacob Bachmeyer

unread,
Mar 14, 2018, 9:23:51 PM3/14/18
to Daniel Lustig, Christoph Hellwig, Aaron Severance, RISC-V ISA Dev
Daniel Lustig wrote:
> Really, the intention of that text (which we can clarify) is that
> it should apply to normal memory: people shouldn't use cache
> writeback/flush/invalidate instead of following the rules of the
> normal memory consistency model. As a performance hint, sure, but
> not as a functional replacement. But I/O, non-coherent DMA,
> persistent memory, etc. are a different question, and may well want
> all those things for actual functional correctness. The memory model
> task group is mostly punting on trying to formalize all that until
> after RVWMO for normal memory is settled.
>

This proposal predates the new memory model, when FENCE/FENCE.I was
really all we had. I would be particularly interested in feedback on
how best to arrange this and how to advise programmers who may see the
cache-control operations and miss the fences that they should be using
on RISC-V because other architectures have similar cache-control
operations instead of fences.

Should the cache-control instructions themselves be defined in terms of
fences? I am not yet entirely certain how to describe them that way.

> The bit about using fences is not necessarily meant as a strict
> correctness claim either. It's just an observation that RISC-V
> already uses FENCE.I and SFENCE.VMA to describe what other
> architectures might describe as "invalidate the instruction cache"
> and "invalidate the TLB", respectively. So in that spirit, maybe
> the instruction for "invalidate this cache (line)" could be
> described as some kind of fence too. And then likewise for flush
> writeback/etc.
>

This is the intent behind defining the cache-control operations on
regions instead of directly on cachelines. All of the cache-control
operations should be expressible as ranged fences with particular
(sometimes peculiar) semantics.


-- Jacob

Aaron Severance

unread,
Mar 14, 2018, 9:53:13 PM3/14/18
to jcb6...@gmail.com, RISC-V ISA Dev
Clarifying that there are implied FENCEs in the region instruction certainly helps my understanding.

Sure it works with the synchronous version.

For asynchronous DISCARD/REWRITE I assume it would be undefined to touch the memory before it completed.  I assume a FENCE would be the normal way to know it had completed?
I think I was just being pedantic about what affected means; don't worry about it.
 

> Assuming partial completition is allowed:
>     Is forward progress required? i.e. must rd be >= rs1 + 1?
No, an operation can fail.

Can you give an example?  Would failing mean just that nothing was done this iteration so you should keep trying, or that the operation cannot complete?
 
>     0 must be a valid return value (if the region goes to the highest
> addressable value).
>     I would suggest that rd must be >= the first address after the
> highest address affected by the operation.
Permitting a series of operations to "skip" addresses prevents the easy
loop mentioned above.

Sorry, I did not mean to imply that a partially complete operation should return an address that is higher than the first address it has not completed on.

What I meant was that if a region can be arbitrarily expanded a fully complete operation should be able to expand the region to all of memory and return 0 (highest addressable address + 1).
 
>       Then an implementation that always fully completes could then
> return 0.
>       No-op implementations could also always return 0.
A no-op implementation must return the base address, indicating that
nothing was done.

As an example take a system with no data cache (or a disabled data cache through some other mechanism).  If it runs into a WB instruction, after performing the implied FENCE should it return the base address indicating nothing was done or an address greater than the high address?
 
>     Does this apply to FENCE.RD/FENCE.RI? It seems problematic to have
> FENCE.RD/FENCE.RI partially complete and return, and FENCE/FENCE.I
> must fully complete anyway.

FENCE.I does not use its registers, all of which are required to be x0.

The ranged fences will be required to fully complete in draft 6.

Great.
Doing nothing does not necessarily mean failure.  Sometimes it means having nothing to do (again the example of a system with the cache disabled).  Writing software I want to use the return value to check for completion.  I need to see either partial completion and re-run the loop or full completion and stop.  If I see failure but the operation was not needed now I need to decode why that happened and if it's safe to proceed.
A cache can always have written back all of its data for no user discernable reason flushed its data the cycle before you issue the DISCARD instruction.  So I don't see how using a FLUSH to do a DISCARD isn't valid.  The user can never guarantee that all the data they are trying to discard didn't get written back.
 
 
-- Jacob

Jacob Bachmeyer

unread,
Mar 14, 2018, 10:40:07 PM3/14/18
to Aaron Severance, RISC-V ISA Dev
Draft 6 will clarify that MEM.DISCARD/MEM.REWRITE can only be executed
synchronously. They are no-ops if rd is x0.
Generally, a complete failure indicates that the operation is not
implemented. Different regions of memory could have different sets of
supported operations, so a failure after a success suggests that you
have crossed from a region that can support that operation to one that
cannot. For CACHE.PIN, which allocates a finite resource, failure could
simply mean that there are no more cachelines available.

Looking at it another way, lack of forward progress *is* an error result.

> > 0 must be a valid return value (if the region goes to the
> highest
> > addressable value).
> > I would suggest that rd must be >= the first address after the
> > highest address affected by the operation.
> Permitting a series of operations to "skip" addresses prevents the
> easy
> loop mentioned above.
>
>
> Sorry, I did not mean to imply that a partially complete operation
> should return an address that is higher than the first address it has
> not completed on.
>
> What I meant was that if a region can be arbitrarily expanded a fully
> complete operation should be able to expand the region to all of
> memory and return 0 (highest addressable address + 1).

Only if the entire address space is effectively a single cacheline.
(This is a bad implementation, but you are correct that it is
technically permitted.) Note that such a return probably makes most of
these operations effectively useless.

Also, MEM.DISCARD and MEM.REWRITE can *not* be so expanded. Such an
implementation would be required to always fail those operations.

> > Then an implementation that always fully completes could then
> > return 0.
> > No-op implementations could also always return 0.
> A no-op implementation must return the base address, indicating that
> nothing was done.
>
>
> As an example take a system with no data cache (or a disabled data
> cache through some other mechanism). If it runs into a WB
> instruction, after performing the implied FENCE should it return the
> base address indicating nothing was done or an address greater than
> the high address?

If the system is capable of flushing caches, CACHE.WRITEBACK is not a
no-op and simply succeeds immediately if the cache is disabled. The of
the no-op implementation as returning ENOSYS.
If the cache is present but disabled, then a writeback succeeds (all of
the zero lines in the cache have been written back, after all).

> Writing software I want to use the return value to check for
> completion. I need to see either partial completion and re-run the
> loop or full completion and stop. If I see failure but the operation
> was not needed now I need to decode why that happened and if it's safe
> to proceed.

If you see partial completion, you also need to actually do part of
whatever you are working on before trying again. Partial completion
indicates that the hardware is expecting to handle your request piecemeal.

> [...]
While you are correct and MEM.DISCARD could be implemented as
CACHE.FLUSH (with some performance penalty due to useless writebacks),
such an implementation would use exactly the loophole that you
describe. (And which I have no intention of trying to close -- the
cache *could* have been flushed by a preemptive context-switch.) Any
copies present in caches at the time MEM.DISCARD is executed should be
dropped with no further action, however.

The important part (and the reason that non-coherent systems must
write-back-and-invalidate partial cachelines) is that MEM.DISCARD
ensures that any subsequent reads from the region will go all the way to
main memory. This is needed for DMA input in systems without hardware
coherency.

The reason that the contents of the region are undefined is that
different harts may actually see different contents. MEM.DISCARD and
MEM.REWRITE relax coherency for performance.


-- Jacob

Aaron Severance

unread,
Mar 15, 2018, 3:10:51 PM3/15/18
to jcb6...@gmail.com, RISC-V ISA Dev
Of course.  I'm not sure if you're also implying that DISCARD/FLUSH must always return rs2+1 on successful completion though.

One other question, if rs1 is 0 and rs2 is 0xFFFF_FFFF (for RV32) is there a way to signal failure vs full completion?
Yes.  I think you should change the MEM.DISCARD description from:
"Declare the contents of the region obsolete, dropping any copies
present in the processor without performing writes to main memory." to 
"Declare the contents of the region obsolete; the processor may drop any copies present in the processor without performing writes to main memory."
 

-- Jacob

Jacob Bachmeyer

unread,
Mar 15, 2018, 9:30:55 PM3/15/18
to Aaron Severance, RISC-V ISA Dev
Aaron Severance wrote:
> On Wed, Mar 14, 2018 at 7:40 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
> Aaron Severance wrote:
> > Clarifying that there are implied FENCEs in the region instruction
> > certainly helps my understanding.
> >
> > On Wed, Mar 14, 2018 at 6:05 PM Jacob Bachmeyer
> <jcb6...@gmail.com <mailto:jcb6...@gmail.com>
> > <mailto:jcb6...@gmail.com <mailto:jcb6...@gmail.com>>> wrote:
> >
> > Aaron Severance wrote:
> [...]
Partial completion is permitted. Some implementations may choose to
always process at most one cacheline, rather than iterating in hardware.

> One other question, if rs1 is 0 and rs2 is 0xFFFF_FFFF (for RV32) is
> there a way to signal failure vs full completion?

Unfortunately not. But this (affecting the entire address space) is
very much an edge case, and the workaround is to not do that.

> >[...]
I see no reason for this: either the now-obsolete data is dropped from
any intermediate caches or has already been written back. The important
effect (need for software-enforced coherency) is that no cachelines for
the region are valid after MEM.DISCARD completes.


-- Jacob

Albert Cahalan

unread,
Mar 16, 2018, 1:29:15 AM3/16/18
to jcb6...@gmail.com, Christoph Hellwig, RISC-V ISA Dev
On 3/14/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Christoph Hellwig wrote:
>> On Wed, Mar 07, 2018 at 10:19:39PM -0600, Jacob Bachmeyer wrote:

>> For cache-incoherent DMA we also need a cache line invalidation
>> instruction that must not be a hint but guaranteed to work.

This must be privileged because it may expose old data.
Combined writeback+invalidate doesn't have the problem.

>> In terms of naming I'd rather avoid flush as a name as it is a very
>> overloaded term. I'd rather name the operations purely based on
>> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and
>> CACHE.WBINV
>
> Can you explain the plausible meanings of "flush" that could create
> confusion? I had believed it to be a good synonym for
> "writeback-and-invalidate".

PowerPC terminology would do writeback w/o invalidate.

> The problem is that the entire point of region operations is to insulate
> software from cache details.

I appreciate the thought, but this may be more trouble than it's worth.

Consider instead having two sizes, 4096 bytes and the full address space.
The 4096-byte one would require alignment, at least for operations
that can be destructive. These operations tend to be done on whole pages
or on all of the address space, so just going with those values is fine.

>> Are there equivalents in other architectures?
> As far as I know, MEM.REWRITE, exactly, is new.

No, it was part of POWER ("cli") and PowerPC ("dcbi").
Originally, those were privileged instructions that would
invalidate a cache line ("cli") or cache block ("dcbi") just
as you describe.

Later, the instructions were made to act the same as the
unprivileged ones ("dclz" and "dcbz") that zeroed.

Even later, the instructions were removed from the architecture.
Perhaps this ought to be taken as a hint regarding the desirability
of supporting such instructions.

Christoph Hellwig

unread,
Mar 16, 2018, 6:31:50 AM3/16/18
to Daniel Lustig, Christoph Hellwig, Aaron Severance, jcb6...@gmail.com, RISC-V ISA Dev
On Wed, Mar 14, 2018 at 03:21:52PM -0700, Daniel Lustig wrote:
> Really, the intention of that text (which we can clarify) is that
> it should apply to normal memory: people shouldn't use cache
> writeback/flush/invalidate instead of following the rules of the
> normal memory consistency model. As a performance hint, sure, but
> not as a functional replacement. But I/O, non-coherent DMA,
> persistent memory, etc. are a different question, and may well want
> all those things for actual functional correctness. The memory model
> task group is mostly punting on trying to formalize all that until
> after RVWMO for normal memory is settled.

The big issue is that a cache controller often can't really see
the difference. We could in theory force it through PMAs, but
that might turn very complicated really soon.

> The bit about using fences is not necessarily meant as a strict
> correctness claim either. It's just an observation that RISC-V
> already uses FENCE.I and SFENCE.VMA to describe what other
> architectures might describe as "invalidate the instruction cache"
> and "invalidate the TLB", respectively. So in that spirit, maybe
> the instruction for "invalidate this cache (line)" could be
> described as some kind of fence too. And then likewise for flush
> writeback/etc.

As long as it just is about naming I don't care to much. But both
for persistent memory and cache incoherent dma we do of course
require some amount of fencing as well, as we need to guarantee
any effects before the invalidation or writeback are actually
covered.

Christoph Hellwig

unread,
Mar 16, 2018, 6:46:18 AM3/16/18
to Jacob Bachmeyer, Christoph Hellwig, RISC-V ISA Dev
On Wed, Mar 14, 2018 at 08:15:37PM -0500, Jacob Bachmeyer wrote:
> The intent for REGION opcodes is that the instruction specifies an extent
> of memory that is to be affected. The actual hardware granularity is not
> relevant.

At least for cache invalidation it absolutely is relevant, as it changes
the data visible at a given address.

E.g. I want to invalidate address 4096 to 4159 because I am going to do
a cache incoherent dma operation, but it turns out the implementation
has a cache line size of 128 (or to take the extreme example allowed by
your defintion infinite) it will invalidate all kinds of data that the
driver might have written close to it, and we get data corruption.

That is why the supervisor needs to:

a) know the cache line size so that it can align dma-able structures
based on it
b) any destructive operation must operate on said granularity.

And as Albert already mentioned, descructive operations exposed to U-mode
are a non-started unless we can come up with very specific carefully
drafted circumstances that are probably too complex to implement.

>> Yes, this is something both x86 and arm provide so it will help porting.
>>
>> In terms of naming I'd rather avoid flush as a name as it is a very
>> overloaded term. I'd rather name the operations purely based on
>> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and CACHE.WBINV
>>
>
> Can you explain the plausible meanings of "flush" that could create
> confusion? I had believed it to be a good synonym for
> "writeback-and-invalidate".

In a lot of the world people use it just for writeback. concrete examples
are the ATA and NVMe storage protocols, and large parts of the Linux
kernel.

>>> MEM.DISCARD
>>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
>>> Declare the contents of the region obsolete, dropping any copies present
>>> in the processor without performing writes to main memory. The contents
>>> of the region are undefined after the operation completes, but shall not
>>> include foreign data.
>>>
>>
>> What are copies present in the processor?
>>
>
> An old bit of wording (now) changed for draft 6: "copies present between
> the hart's load/store unit and main memory" I was originally thinking of a
> common modern PC architecture, with caches internal to the processor module
> and memory on its own modules.

Much better. But it is important that we do not restrict the wording
to main memory. A lot of these cache operations are especially imporant
for I/O devices, or a of yet not categorized types like persistent
(or persistent-ish) memory.

> The problem is that the entire point of region operations is to insulate
> software from cache details. Partial cachelines are simply included in
> non-destructive operations, but destructive operations require some
> non-destructive substitute operation on partial cachelines. MEM.DISCARD
> performs writeback, while MEM.REWRITE performs an exclusive prefetch.

Which is an excellent way to make performance unusable. If each of
my invalidation for DMA requires two writebacks at either end it is
going to perform horribly. While at the same time the supervisor could
almost trivially align the data structures properly if it knows the
cache line size.

>> What is the use case for this instruction?
>
> Initializing or bulk copying data. One use case is to use MEM.REWRITE when
> allocating and initializing a stack frame in a function prologue (every
> word in that block will be written shortly) and MEM.DISCARD when releasing
> a stack frame in a function epilogue (the locals are now dead, why waste
> cycles writing them back?). Another use case (also the inspiration) for
> MEM.REWRITE (for which the proposal gives some heuristics) is memset(3) and
> memcpy(3). Both functions completely overwrite a destination buffer
> without first reading anything from it. Why waste cycles loading a
> destination from main memory just to write the whole thing back?

I'd really like to see a prototype of thise and very careful measurement
if it is actually worth it. Adding new instructions just because they
might sounds useful is a guarantee to arrive at a bloated spec. Especially
as RISC-V seems to bundle instructions in extensions instead of allowing
to probe for individual instruction as in the x86 cpuid leaves.

>> Are there equivalents in other architectures?
> As far as I know, MEM.REWRITE, exactly, is new. The concept of a block of
> memory that must be written before reading from it is nothing new, however
> -- C local variables and buffers returned from malloc(3) have always been
> like this as far as I know.
>
> There may be similar instructions on PowerPC that explicitly zero
> cachelines and similar, but those would have been mentioned in the earlier
> discussions on this topic on isa-dev and I do not have any message-ids
> close at hand.

Explicit zeroing sounds like a much easier to understand, use and optimize
for concept to me.

Allen Baum

unread,
Mar 16, 2018, 11:59:42 AM3/16/18
to Michael Chapman, Tommy Thorn, Rogier Brussee, RISC-V ISA Dev
It compressed instruction expands to more than a single 32 instruction,  then it is no longer Risc-V, it is CISC-V

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

John Hauser

unread,
Mar 16, 2018, 5:01:41 PM3/16/18
to RISC-V ISA Dev
I think there's an underlying question in this debate, and that is
whether every conforming RISC-V system will be required to have a
hardware-implemented coherent memory system, including for all device
DMA.

It's clear that some people want the answer to this question to
be "yes".  However, it's also true that many low-end systems have
traditionally not had memory systems of such complexity, presumably for
valid economic reasons.  If the hardware has caches and also supports
device DMA but doesn't automatically guarantee cache coherence for DMA'd
data, then some set of active cache control instructions are _required_
in the ISA, and not just what RISC-V calls "hints".

It's certainly within the rights of the Foundation, if it so chooses,
to make complete cache coherence an official requirement for conforming
RISC-V systems.  However, I'm not sure the Foundation's rectitude alone
will be sufficient to change the economics of small systems and convince
vendors en masse to accept the costs of complete memory coherence.  And
if the market doesn't bend, then will many low-end systems be denied the
official RISC-V mark for this heresy?

Efforts to develop standard cache control instructions are predicated
on the assumptions that they'll be both needed for some systems and
officially accepted for RISC-V.  Anyone who wants to argue against the
need ought to explain why they're certain low-end systems can absorb the
costs implied by their preferred memory model.

Regards,

    - John Hauser

Jacob Bachmeyer

unread,
Mar 16, 2018, 10:10:18 PM3/16/18
to Albert Cahalan, Christoph Hellwig, RISC-V ISA Dev
Albert Cahalan wrote:
> On 3/14/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Christoph Hellwig wrote:
>>
>>> For cache-incoherent DMA we also need a cache line invalidation
>>> instruction that must not be a hint but guaranteed to work.
>>>
>
> This must be privileged because it may expose old data.
> Combined writeback+invalidate doesn't have the problem.
>

This is why the proposal includes language prohibiting the exposure of
foreign data. A process may see its own old data, but cache
invalidation must not expose data from another process.

>>> In terms of naming I'd rather avoid flush as a name as it is a very
>>> overloaded term. I'd rather name the operations purely based on
>>> 'writeback' and 'invalidate', e.g. CACHE.WB, CACHE.INV and
>>> CACHE.WBINV
>>>
>> Can you explain the plausible meanings of "flush" that could create
>> confusion? I had believed it to be a good synonym for
>> "writeback-and-invalidate".
>>
>
> PowerPC terminology would do writeback w/o invalidate.
>

Is the presence of a CACHE.WRITEBACK instruction sufficient to resolve
this potential ambiguity?

>>> Are there equivalents in other architectures?
>>>
>> As far as I know, MEM.REWRITE, exactly, is new.
>>
>
> No, it was part of POWER ("cli") and PowerPC ("dcbi").
> Originally, those were privileged instructions that would
> invalidate a cache line ("cli") or cache block ("dcbi") just
> as you describe.
>
> Later, the instructions were made to act the same as the
> unprivileged ones ("dclz" and "dcbz") that zeroed.
>
> Even later, the instructions were removed from the architecture.
> Perhaps this ought to be taken as a hint regarding the desirability
> of supporting such instructions.

MEM.REWRITE does *not* invalidate cachelines -- it allocates cachelines
with undefined initial contents. MEM.DISCARD invalidates cachelines.



-- Jacob

Jacob Bachmeyer

unread,
Mar 16, 2018, 10:32:34 PM3/16/18
to Christoph Hellwig, RISC-V ISA Dev
Christoph Hellwig wrote:
> On Wed, Mar 14, 2018 at 08:15:37PM -0500, Jacob Bachmeyer wrote:
>
>> The intent for REGION opcodes is that the instruction specifies an extent
>> of memory that is to be affected. The actual hardware granularity is not
>> relevant.
>>
>
> At least for cache invalidation it absolutely is relevant, as it changes
> the data visible at a given address.
>
> E.g. I want to invalidate address 4096 to 4159 because I am going to do
> a cache incoherent dma operation, but it turns out the implementation
> has a cache line size of 128 (or to take the extreme example allowed by
> your defintion infinite) it will invalidate all kinds of data that the
> driver might have written close to it, and we get data corruption.
>

No, destructive operations are required to instead perform their
non-destructive counterparts on any partially-included cachelines. On a
cacheline partially included in the region, MEM.DISCARD performs
CACHE.FLUSH (writing the entire cacheline back before invalidating it)
and MEM.REWRITE performs MEM.PF.EXCL (actually loading the data from
memory).

> That is why the supervisor needs to:
>
> a) know the cache line size so that it can align dma-able structures
> based on it
>

That is beyond the scope of this proposal and is expected to be included
in the platform configuration structures.

> b) any destructive operation must operate on said granularity.
>

The proposal explicitly states that destructive operations perform
non-destructive equivalents on cachelines t