Proposal: Explicit cache-control instructions (draft 6 after feedback)

806 views
Skip to first unread message

Jacob Bachmeyer

unread,
Jun 22, 2018, 9:18:42 PM6/22/18
to RISC-V ISA Dev
Previous discussions suggested that explicit cache-control instructions
could be useful, but RISC-V has some constraints here that other
architectures lack, namely that caching must be transparent to the user ISA.

I propose a new minor opcode REGION := 3'b001 within the existing
MISC-MEM major opcode. Instructions in REGION are R-type and use rs1 to
indicate a base address, use rs2 to indicate an (inclusive) upper bound
address, and produce a result in rd that is the first address after the
highest address affected by the operation. If an operation cannot be
applied to the entire requested region at once, an implementation must
reduce the upper bound to encompass a region to which the operation can
be applied at once and the produced value must reflect this reduction.
If rd is x0, the instruction has no directly visible effects and can be
executed entirely asynchronously as an implementation option.

Non-destructive operations permit an implementations to expand a
requested region on both ends to meet hardware granularity for the
operation. An application can infer alignment from the produced value
if it is a concern. As a practical matter, cacheline lengths are
expected to be declared in the processor configuration ROM.

Destructive operations are a thornier issue, and are resolved by
requiring any partial cachelines (at most 2 -- first and last) to be
flushed or prefetched instead of performing the requested operation on
those cachelines. Implementations may perform the destructive operation
on the parts of these cachelines included in the region, or may simply
flush or prefetch the entire cacheline.

If the upper and lower bounds are specified by the same register, the
smallest region that can be affected that includes the lower bound is
affected if the operation is non-destructive; destructive operations are
no-ops. Otherwise, the upper bound must be greater than the lower bound
and the contrary case is reserved.

??? Issue for discussion: what happens if this reserved case is executed?

In general, this proposal uses "cacheline" to describe the hardware
granularity for an operation that affects multiple words of memory or
address space. Where these operations are implemented using traditional
caches, the use of the term "cacheline" is entirely accurate, but this
proposal does not prevent implementations from using other means to
implement these instructions.

Similarly, this proposal uses "main memory" to refer to any ultimate
memory bus target, including MMIO or other hardware.

Instructions in MISC-MEM/REGION may be implemented as no-ops if an
implementation lacks the corresponding hardware. The value produced in
this case is the base address.

The instructions defined in this proposal are *not* hints, however: if
caches exist, the CACHE.* instructions *must* have their defined
effects. Similarly, the prefetch instructions can be no-ops only if the
implementation has neither caches nor prefetch buffers. Likewise for
MEM.DISCARD and MEM.REWRITE: they *must* actually have the stated
effects if hardware such as caches or write buffers is present.

Synchronous operations imply all relevant fences: the effect of a
synchronous CACHE.FLUSH instruction must be globally visible before any
subsequent memory accesses begin, for example.

??? Issue for discussion: exactly what fences are implied and can any
of these instructions be defined purely in terms of fences or other
ordering constraints applied to the memory model? Note that this
proposal predates the new memory model and has not yet been aligned with
the new model.

The new MISC-MEM/REGION space will have room for 128 opcodes, one of
which is the existing FENCE.I. I propose:

[note as of draft 6: preliminary implementations have been announced
that support CACHE.WRITEBACK, CACHE.FLUSH, and MEM.DISCARD from draft 5.]


===Fences===

[function 7'b0000000 is the existing FENCE.I instruction]

[function 7'b0000001 reserved]

FENCE.RD ("range data fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
Perform a conservative fence affecting only data accesses to the
chosen region. This instruction always has visible effects on memory
consistency and is therefore synchronous in all cases. Fences must
fully complete and are not permitted to fail.

FENCE.RI ("range instruction fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000011}
Perform equivalent of FENCE.I but affecting only instruction fetches
from the chosen region. This instruction always has visible effects on
memory consistency and is therefore synchronous in all cases. Fences
must fully complete and are not permitted to fail.

===Non-destructive cache control===

====Prefetch====

All prefetch instructions ignore page faults and other access faults.
In general use, applications should use rd == x0 for prefetching,
although this is not required. If a fault occurs during a synchronous
prefetch (rd != x0), the operation must terminate and produce the
faulting address. A fault occurring during an asynchronous prefetch (rd
== x0) may cause the prefetching to stop or the implementation may
attempt to continue prefetching past the faulting location.

MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
Load as much of the chosen region as possible into the data cache,
with varying levels of expected temporal access locality. The number in
the opcode is proportionate to the expected frequency of access to the
prefetched data: MEM.PF3 is for data that will be very heavily used.

MEM.PF.EXCL ("prefetch exclusive")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
Load as much of the chosen region as possible into the data cache,
with the expectation that future stores will soon occur to this region.
In a cache-coherent system, any locks required for writing the affected
cachelines should be acquired.

MEM.PF.ONCE ("prefetch once")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
Prefetch as much of the region as possible, but expect the prefetched
data to be used at most once in any order.

MEM.PF.STREAM ("prefetch stream")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
Initiate streaming prefetch of the region, expecting the prefetched
data to be used at most once and in sequential order, while minimizing
cache pollution. This operation may activate a prefetch unit and
prefetch the region incrementally if rd is x0. Software is expected to
access the region sequentially, starting at the base address.

MEM.PF.TEXT ("prefetch program text")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
Load as much of the chosen region as possible into the instruction cache.

====Cacheline pinning====

??? Issue for discussion: should a page fault while pinning cachelines
cause a trap to be taken or simply cause the operation to stop or fail?
Should CACHE.PIN use the same approach to TLB fills as MEM.REWRITE uses?
??? Issue for discussion: what if another processor attempts to write
to an address in a cacheline pinned on this processor? [partially addressed]

CACHE.PIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
Arrange for as much of the chosen region as possible to be accessible
with minimal delay and no traffic to main memory. Pinning a region is
idempotent and an implementation may pin a larger region than requested,
provided that an unpin operation with the same base and bound will also
unpin the larger region.
One possible implementation is to load as much of the chosen region as
possible into the data cache and keep it there until unpinned. Another
implementation is to configure a scratchpad RAM and map it over at least
the chosen region, preloading it with data from main memory.
Scratchpads may be processor-local, but writes to a scratchpad mapped
with CACHE.PIN must be visible to other nodes in a coherent system.
Implementations are expected to ensure that pinned cachelines will not
impede the efficacy of a cache. Implementations with fully-associative
caches may permit any number of pins, provided that at least one
cacheline remains available for normal use. Implementations with N-way
set associative caches may support pinning up to (N-1) ways within each
set, provided that at least one way in each set remains available for
normal use. Implementations with direct-mapped caches should not pin
cachelines, but may still use CACHE.PIN to configure an overlay
scratchpad, which may itself use storage shared with caches, such that
mapping the scratchpad decreases the size of the cache.

Implementations may support both cacheline pinning and scratchpads,
choosing which to use to perform a CACHE.PIN operation in an
implementation-defined manner.

Pinned dirty cachelines may be written back at any time, after which
they are clean but remain valid. Pinned cachelines may be used as
writable scratchpad storage overlaying ROM; any errors writing back a
pinned cacheline are ignored.

CACHE.UNPIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010001}
Explicitly release a pin set with CACHE.PIN. Pinned regions are also
implicitly released if the memory protection and virtual address mapping
is changed. (Specifically, a write to the current satp CSR or an
SFENCE.VM will unpin all cachelines as a side effect, unless the
implementation partitions its cache by ASID. Even with ASID-partitioned
caches, changing the root page number associated with an ASID unpins all
cachelines belonging to that ASID.) Unpinning a region does not
immediately remove it from the cache. Unpinning a region always
succeeds, even if parts of the region were not pinned. For an
implementation that implements CACHE.PIN using scratchpad RAM, unpinning
a region that uses a scratchpad causes the current contents of the
scratchpad to be written to main memory if possible.

An implementation with hardware-enforced cache coherency using an
invalidation-based coherency protocol may force pinned cachelines to be
written back and unpinned if another processor attempts to write to a
cacheline pinned locally. Implementations using an update-based
coherency protocol may update pinned cachelines "in-place" when another
processor attempts to write to a cacheline pinned locally. Either
solution adversely impacts performance and software should avoid writing
to pinned cachelines on remote harts.

And two M-mode-only privileged instructions:

CACHE.PIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
Arrange for code to execute from as much of the chosen region as
possible without traffic to main memory. Pinning a region is idempotent.

CACHE.UNPIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010001, 2'b11}
Release resources pinned with CACHE.PIN.I. Pins are idempotent. One
unpin instruction will unpin the chosen region completely, regardless of
how many times it was pinned. Unpinning always succeeds.

====Flush====

CACHE.WRITEBACK
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
Writeback any cachelines in the requested region.

CACHE.FLUSH
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
Write any cachelines in the requested region (as by CACHE.WRITEBACK),
marking them invalid afterwards (as by MEM.DISCARD). Flushed cachelines
are automatically unpinned.

Rationale for including CACHE.FLUSH: small implementations may
significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD;
the implementations that most benefit lack the infrastructure to achieve
such combination by macro-op fusion.


===Declaring data obsolescence===

These operations declare data to be obsolete and unimportant. In
fully-coherent systems, they are two sides of the same coin:
MEM.DISCARD declares data not yet written to main memory to be obsolete,
while MEM.REWRITE declares data in main memory to be obsolete and
indicates that software on this hart will soon overwrite the region.
These operations are useful in general: a function prologue could use
MEM.REWRITE to allocate a stack frame, while a function epilogue could
use MEM.DISCARD to release a stack frame without requiring the
now-obsolete local variables ever be written back to main memory. In
non-coherent systems, MEM.DISCARD is also an important tool for
software-enforced coherency, since its semantics provide an invalidate
operation on all caches on the path between a hart and main memory.

The declarations of obsolescence produced by these instructions are
global and affect all caches on the path between a hart and main memory
and all caches coherent with those caches, but are not required to
affect non-coherent caches not on the direct path between a hart and
main memory. Implementations depending on software to maintain
coherency in such situations must provide other means (MMIO control
registers, for example) to force invalidations in remote non-coherent
caches.

These instructions create regions with undefined contents and share a
requirement that foreign data never be introduced. Foreign data is,
simply, data that was not previously visible to the current hart at the
current privilege level at any address. Operating systems zero pages
before attaching them to user address spaces to prevent foreign data
from appearing in freshly-allocated pages. Implementations must ensure
that these instructions do not cause foreign data to leak through caches
or other structures.

These instructions must be executed synchronously. If executed with x0
as rd, they are no-ops.

MEM.DISCARD
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
Declare the contents of the region obsolete, dropping any copies
present between the hart's load/store unit and main memory without
performing writes to main memory. The contents of the region are
undefined after the operation completes, but shall be data that was
previously written to the region and shall not include foreign data.
If the region does not align with cacheline boundaries, any partial
cachelines are written back. If hardware requires such, the full
contents of a cacheline partially included may be written back,
including data just declared obsolete. In a non-coherent system,
partial cachelines written back are also invalidated. In a system with
hardware cache coherency, partial cachelines must be written back, but
may remain valid.
Any cachelines fully affected by MEM.DISCARD are automatically unpinned.
NOTE WELL: MEM.DISCARD is *not* an "undo" operation for memory writes
-- an implementation is permitted to aggressively writeback dirty
cachelines, or even to omit caches entirely. *ANY* combination of "old"
and "new" data may appear in the region after executing MEM.DISCARD.

MEM.REWRITE
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
Declare the contents of the region obsolete, indicating that the
current hart will soon overwrite the entire region. Reading any datum
from the region before the current hart has written that datum (or other
data fully overlapping that datum) is incorrect behavior and produces
undefined results, but shall return data as described below for security
reasons. Note that undefined results may include destruction of nearby
data within the region. For optimal performance, software should write
the entire region before reading any part of the region and should do so
sequentially, starting at the base address and moving towards the
address produced by MEM.REWRITE.
For security reasons, implementations must ensure that cachelines
allocated by MEM.REWRITE appear to contain either all-zeros or all-ones
if invalidly read. The choice of all-zeros or all-ones is left to
implementation convenience, but must be consistent and fixed for any
particular hart.
TLB fills occur normally as for writes to the region and must appear
to occur sequentially, starting at the base address. A page fault in
the middle of the region causes the operation to stop and produce the
faulting address. A page fault at the base address causes a page fault
trap to be taken.
Implementations with coherent caches should arrange for all cachelines
in the region to be in a state that permits the current hart to
immediately overwrite the region with no further delay. In common
cache-coherency protocols, this is an "exclusive" state.
An implementation may have a maximum size of a region that can have a
rewrite pending. If software declares intent to overwrite a larger
region than the implementation can prepare at once, the operation must
complete partially and return the first address beyond the region
immediately prepared for overwrite. Software is expected to overwrite
the region prepared, then iterate for the next part of the region that
software intends to overwrite until the entire larger region is overwritten.
If the region does not align with cacheline boundaries, any partial
cachelines are prefetched as by MEM.PF.EXCL. If hardware requires such,
the full contents of a cacheline partially included may be loaded from
memory, including data just declared obsolete.
NOTE WELL: MEM.REWRITE is *not* memset(3) -- any portion of the
region prepared for overwrite already present in cache will *retain* its
previously-visible contents.

MEM.REWRITE appears to be a relatively novel operation and previous
iterations of this proposal have produced considerable confusion. While
the above semantics are the required behavior, there are different ways
to implement them. One simple option is to temporarily mark the region
as "write-through" in internal configuration. Another option is to
allocate cachelines, but load the allocated cachelines with all-zeros or
all-ones instead of fetching contents from memory. A third option is to
track whether cachelines have been overwritten and use a monitor trap to
zero cachelines that software attempts to invalidly read. A fourth
option is to provide dedicated write-combining buffers for MEM.REWRITE.
In systems that implement MEM.REWRITE using cache operations,
MEM.REWRITE allocates cachelines, marking them "valid, exclusive, dirty"
and filling them with a constant without reading from main memory.
Other cachelines may be evicted to make room if needed but
implementations should avoid evicting data recently fetched with
MEM.PF.ONCE or MEM.PF.STREAM, as software may intend to copy that data
into the region. Implementations are recommended to permit at most half
of a data cache to be allocated for MEM.REWRITE if data has been
recently prefetched into the cache to aid in optimizing memcpy(3), but
may permit the full data cache to be used to aid in optimizing
memset(3). In particular, an active asynchronous MEM.PF.ONCE or
MEM.PF.STREAM ("active" meaning that the data prefetched has not yet
been read) can be taken as a hint that MEM.REWRITE is preparing to copy
data and should use at most half or so of the data cache.


??? Issue for discussion: the requirements necessary for MEM.REWRITE to
be safe come very close to implementing memset(3) for the special case
of a zero value. Is there significant incremental cost to going the
rest of the way and changing MEM.REWRITE to MEM.CLEAR?


=== ===

Thoughts?

Thanks to:
[draft 1]
Bruce Hoult for citing a problem with the HiFive board that inspired
the I-cache pins.
[draft 2]
Stefan O'Rear for suggesting the produced value should point to the
first address after the affected region.
Alex Elsayed for pointing out serious problems with expanding the
region for a destructive operation and suggesting that "backwards"
bounds be left reserved.
Guy Lemieux for pointing out that pinning was insufficiently specified.
Andrew Waterman for suggesting that MISC-MEM/REGION could be encoded
around the existing FENCE.I instruction.
Allen Baum for pointing out the incomplete handling of page faults.
[draft 3]
Guy Lemieux for raising issues that inspired renaming PREZERO to RELAX.
Chuanhua Chang for suggesting that explicit invalidation should unpin
cachelines.
Guy Lemieux for being persistent asking for CACHE.FLUSH and giving
enough evidence to support that position.
Guy Lemieux and Andrew Waterman for discussion that led to rewriting a
more general description for pinning.
[draft 4]
Guy Lemieux for suggesting that CACHE.WRITE be renamed CACHE.WRITEBACK.
Allen J. Baum and Guy Lemieux for suggestions that led to rewriting the
destructive operations.
[draft 5]
Allen Baum for offering a clarification for the case of using the same
register for both bounds to select a minimal-length region.
[draft 6]
Aaron Serverance for highlighting possible issues with the new memory
model and other minor issues.
Albert Cahalan for suggesting a use case for pinned cachelines making
ROM appear writable.
Christoph Hellwig for pointing out poor wording in various parts of the
proposal.
Guy Lemieux for pointing out serious potential issues with cacheline
pinning and coherency.
Cesar Eduardo Barros for providing the example that proved that
MEM.REWRITE must appear to load a constant when allocating cachelines.
Others had raised concerns, but Cesar Eduardo Barros provided the
counter-example that proved the previous semantics to be unsafe.
Richard Herveille for raising issues related to streaming data access
patterns.


-- Jacob

Paul A. Clayton

unread,
Jun 24, 2018, 11:51:15 PM6/24/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
[snip]
> ====Prefetch====
>
> All prefetch instructions ignore page faults and other access faults.
> In general use, applications should use rd == x0 for prefetching,
> although this is not required. If a fault occurs during a synchronous
> prefetch (rd != x0), the operation must terminate and produce the
> faulting address. A fault occurring during an asynchronous prefetch (rd
> == x0) may cause the prefetching to stop or the implementation may
> attempt to continue prefetching past the faulting location.

This seems to be hint-like semantics (even though it is only triggered
on a fault).

> MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
> Load as much of the chosen region as possible into the data cache,
> with varying levels of expected temporal access locality. The number in
> the opcode is proportionate to the expected frequency of access to the
> prefetched data: MEM.PF3 is for data that will be very heavily used.

"heavily used" seems an improper term with respect to temporal
locality. A memory region can have the same number or frequency
of accesses ("heaviness" can refer to "weight" or "density") but
different use lifetimes.

If this information is intended to communicate temporal locality
(and not some benefit measure of caching), then prefetch once
might be merged as the lowest temporal locality.

(Utility, persistence, and criticality are different measures that
software may wish to communicate to the memory system.)

> MEM.PF.EXCL ("prefetch exclusive")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
> Load as much of the chosen region as possible into the data cache,
> with the expectation that future stores will soon occur to this region.
> In a cache-coherent system, any locks required for writing the affected
> cachelines should be acquired.

It may be useful to make a distinction between prefetch for write
where reads are not expected but the region is not guaranteed to
be overwritten. An implementation might support general avoidance
of read-for-ownership (e.g., finer-grained validity indicators) but
still benefit from a write prefetch to establish ownership.

> MEM.PF.ONCE ("prefetch once")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
> Prefetch as much of the region as possible, but expect the prefetched
> data to be used at most once in any order.

"used once" may be defined at byte level or at cache block level.
If the use does not have considerable spatial locality at cache block
granularity, use once at byte level would have a different intent
than use once at a coarser granularity. With cache-block granular
access, the cache can evict after the first access; with byte-granular
access and lower spatio-temporal locality, another byte within a
cache block may be accessed relatively long after the first access
to the cache block (so software would not want the block to be
marked for eviction after the first block access).

> MEM.PF.STREAM ("prefetch stream")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
> Initiate streaming prefetch of the region, expecting the prefetched
> data to be used at most once and in sequential order, while minimizing
> cache pollution. This operation may activate a prefetch unit and
> prefetch the region incrementally if rd is x0. Software is expected to
> access the region sequentially, starting at the base address.

It may be useful to include a stride for stream prefetching.

> MEM.PF.TEXT ("prefetch program text")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
> Load as much of the chosen region as possible into the instruction cache.

With a unified Ln cache, would the instruction prefetching "overflow"
into the Ln cache? (Similar questions apply to data.)

> ====Cacheline pinning====
>
> ??? Issue for discussion: should a page fault while pinning cachelines
> cause a trap to be taken or simply cause the operation to stop or fail?
> Should CACHE.PIN use the same approach to TLB fills as MEM.REWRITE uses?
> ??? Issue for discussion: what if another processor attempts to write
> to an address in a cacheline pinned on this processor? [partially
> addressed]
>
> CACHE.PIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
> Arrange for as much of the chosen region as possible to be accessible
> with minimal delay and no traffic to main memory. Pinning a region is
> idempotent and an implementation may pin a larger region than requested,
> provided that an unpin operation with the same base and bound will also
> unpin the larger region.

"as much as possible" and "minimal delay" interact in a more complex
memory system. Would software prefer more of the region be cached
(at some latency or bandwidth penalty) or just have the excess ignored?
(While these cache operations are presumably intended for more
microarchitecture-specific tuning, some software may be developed
quickly to work well on one microarchitecture with an expectation
that the software would work decently for a general class of
implementations.)

Pinning also seems to be a specification of temporal locality:
"perpetual" locality.

> One possible implementation is to load as much of the chosen region as
> possible into the data cache and keep it there until unpinned. Another
> implementation is to configure a scratchpad RAM and map it over at least
> the chosen region, preloading it with data from main memory.
> Scratchpads may be processor-local, but writes to a scratchpad mapped
> with CACHE.PIN must be visible to other nodes in a coherent system.
> Implementations are expected to ensure that pinned cachelines will not
> impede the efficacy of a cache. Implementations with fully-associative
> caches may permit any number of pins, provided that at least one
> cacheline remains available for normal use. Implementations with N-way
> set associative caches may support pinning up to (N-1) ways within each
> set, provided that at least one way in each set remains available for
> normal use. Implementations with direct-mapped caches should not pin
> cachelines, but may still use CACHE.PIN to configure an overlay
> scratchpad, which may itself use storage shared with caches, such that
> mapping the scratchpad decreases the size of the cache.

What about overlaid skewed associative caches (rf. "Concurrent Support of
Multiple Page Sizes On a Skewed Associative TLB")? In such a design
the capacity is not isolated by ways and enforcing a guarantee that any
cache block could be allocated might be somewhat expensive. (The
easiest method might be to provide an equal-latency side cache which
might otherwise be used as a victim cache, prefetch buffer, stream
cache, or provide other functionality.)

It might also be noted that with overlaid skewed associativity, different
block alignments would be practical with alignments associated with
ways (like different page sizes in the Seznec paper).

(Another consideration is that cache block sizes may be diverse and
non-constant. E.g., an implementation that allowed half of L1 cache
to be mapped as a scratchpad might use the extra tags to support
twice as many smaller cache blocks (requiring only an extra bit per
tag).)
Even an invalidation-based coherency protocol could provide relatively
low delay writeback (at the cost of modest hardware complexity, e.g., an
additional block state similar to "owned" in MOESI and a mechanism
to determine the home of the pinned memory and poor performance
under significant reuse, i.e., converting "don't write to pinned memory"
to "don't repeatedly write to the same pinned block"). Support for
limited scope update coherence can be useful for relatively static
producer-consumer relationships.

> And two M-mode-only privileged instructions:
>
> CACHE.PIN.I
> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
> Arrange for code to execute from as much of the chosen region as
> possible without traffic to main memory. Pinning a region is idempotent.

Why M-mode-only?

If one is supporting such ranged fetches, it seems that support for
memory-copy could be trivially provided.

TLB prefetching also seems worth considering.

In addition, the above does not seem to consider cache
hierarchies and cache sharing (even temporal sharing through
software context switches). While most of the uses of such
operations would have tighter control over thread allocation,
some uses might expect graceful/gradual decay.

Jacob Bachmeyer

unread,
Jun 25, 2018, 1:09:52 AM6/25/18
to Paul A. Clayton, RISC-V ISA Dev
Paul A. Clayton wrote:
> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> [snip]
>
>> ====Prefetch====
>>
>> All prefetch instructions ignore page faults and other access faults.
>> In general use, applications should use rd == x0 for prefetching,
>> although this is not required. If a fault occurs during a synchronous
>> prefetch (rd != x0), the operation must terminate and produce the
>> faulting address. A fault occurring during an asynchronous prefetch (rd
>> == x0) may cause the prefetching to stop or the implementation may
>> attempt to continue prefetching past the faulting location.
>>
>
> This seems to be hint-like semantics (even though it is only triggered
> on a fault).
>

Please elaborate, although asynchronous prefetches do seem rather
hint-like now that you mention it.

>> MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
>> Load as much of the chosen region as possible into the data cache,
>> with varying levels of expected temporal access locality. The number in
>> the opcode is proportionate to the expected frequency of access to the
>> prefetched data: MEM.PF3 is for data that will be very heavily used.
>>
>
> "heavily used" seems an improper term with respect to temporal
> locality.

That part is intended to define which end of the scale is which in a
generic manner. The region instructions are intended to be independent
of actual cache structure, which is why address-space-extents are used
instead of cacheline addresses.

> A memory region can have the same number or frequency
> of accesses ("heaviness" can refer to "weight" or "density") but
> different use lifetimes.
>
> If this information is intended to communicate temporal locality
> (and not some benefit measure of caching), then prefetch once
> might be merged as the lowest temporal locality.
>
> (Utility, persistence, and criticality are different measures that
> software may wish to communicate to the memory system.)
>

Please elaborate on this. I would like to be sure that we are using the
same terms here.

>> MEM.PF.EXCL ("prefetch exclusive")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
>> Load as much of the chosen region as possible into the data cache,
>> with the expectation that future stores will soon occur to this region.
>> In a cache-coherent system, any locks required for writing the affected
>> cachelines should be acquired.
>>
>
> It may be useful to make a distinction between prefetch for write
> where reads are not expected but the region is not guaranteed to
> be overwritten. An implementation might support general avoidance
> of read-for-ownership (e.g., finer-grained validity indicators) but
> still benefit from a write prefetch to establish ownership.
>

In other words, something in-between MEM.PF.EXCL (which prefetches the
current data in main memory) and MEM.REWRITE (which destroys the current
data in main memory)?

>> MEM.PF.ONCE ("prefetch once")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
>> Prefetch as much of the region as possible, but expect the prefetched
>> data to be used at most once in any order.
>>
>
> "used once" may be defined at byte level or at cache block level.
>

It was intended to be at the level of "something", initially defined as
the width of the first access made to the prefetched region.

> If the use does not have considerable spatial locality at cache block
> granularity, use once at byte level would have a different intent
> than use once at a coarser granularity. With cache-block granular
> access, the cache can evict after the first access; with byte-granular
> access and lower spatio-temporal locality, another byte within a
> cache block may be accessed relatively long after the first access
> to the cache block (so software would not want the block to be
> marked for eviction after the first block access).
>

I now wonder if MEM.PF.ONCE and MEM.PF.PF[0123] might be most useful in
combination, where MEM.PF.ONCE specifies a region and the second
prefetch specifies a granularity within that region, although this would
make the overall interface less regular.

>> MEM.PF.STREAM ("prefetch stream")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
>> Initiate streaming prefetch of the region, expecting the prefetched
>> data to be used at most once and in sequential order, while minimizing
>> cache pollution. This operation may activate a prefetch unit and
>> prefetch the region incrementally if rd is x0. Software is expected to
>> access the region sequentially, starting at the base address.
>>
>
> It may be useful to include a stride for stream prefetching.
>

Could the stride be inferred from the subsequent access pattern? If
words at X, X+24, and X+48 are subsequently read, skipping the
intermediate locations, the prefetcher could infer a stride of 24 and
simply remember (1) the last location actually accessed and (2) the next
expected access location. The reason to remember the last actual access
is to still meet the "mimimize cache pollution" goal when a stride is
mispredicted.

>> MEM.PF.TEXT ("prefetch program text")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
>> Load as much of the chosen region as possible into the instruction cache.
>>
>
> With a unified Ln cache, would the instruction prefetching "overflow"
> into the Ln cache? (Similar questions apply to data.)
>

By "overflow" do you mean that once L1 cache is full, additional
cachelines would be prefetched into L2 only? I had expected to leave
that implementation-defined, since the proposal tries to avoid tying
itself to any particular cache structure (or even to actually using
caches at all, although I am not aware of any other useful implementations).

>> ====Cacheline pinning====
>>
>> ??? Issue for discussion: should a page fault while pinning cachelines
>> cause a trap to be taken or simply cause the operation to stop or fail?
>> Should CACHE.PIN use the same approach to TLB fills as MEM.REWRITE uses?
>> ??? Issue for discussion: what if another processor attempts to write
>> to an address in a cacheline pinned on this processor? [partially
>> addressed]
>>
>> CACHE.PIN
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
>> Arrange for as much of the chosen region as possible to be accessible
>> with minimal delay and no traffic to main memory. Pinning a region is
>> idempotent and an implementation may pin a larger region than requested,
>> provided that an unpin operation with the same base and bound will also
>> unpin the larger region.
>>
>
> "as much as possible" and "minimal delay" interact in a more complex
> memory system. Would software prefer more of the region be cached
> (at some latency or bandwidth penalty) or just have the excess ignored?
>

The key requirement is "no traffic to main memory". Is the "minimal
delay" confusing in this context?

> (While these cache operations are presumably intended for more
> microarchitecture-specific tuning, some software may be developed
> quickly to work well on one microarchitecture with an expectation
> that the software would work decently for a general class of
> implementations.)
>
> Pinning also seems to be a specification of temporal locality:
> "perpetual" locality.
>

More like malloc(3)/free(3), but yes, pinning is "do not evict this
until I say so". The catch is that multi-tasking systems might not
actually be able to maintain that, so pins can be lost on
context-switch. For user mode, this should not be a significant
concern, since the pin can be redone on each iteration of an outer loop,
which is why pinning/unpinning is idempotent.

>> One possible implementation is to load as much of the chosen region as
>> possible into the data cache and keep it there until unpinned. Another
>> implementation is to configure a scratchpad RAM and map it over at least
>> the chosen region, preloading it with data from main memory.
>> Scratchpads may be processor-local, but writes to a scratchpad mapped
>> with CACHE.PIN must be visible to other nodes in a coherent system.
>> Implementations are expected to ensure that pinned cachelines will not
>> impede the efficacy of a cache. Implementations with fully-associative
>> caches may permit any number of pins, provided that at least one
>> cacheline remains available for normal use. Implementations with N-way
>> set associative caches may support pinning up to (N-1) ways within each
>> set, provided that at least one way in each set remains available for
>> normal use. Implementations with direct-mapped caches should not pin
>> cachelines, but may still use CACHE.PIN to configure an overlay
>> scratchpad, which may itself use storage shared with caches, such that
>> mapping the scratchpad decreases the size of the cache.
>>
>
> What about overlaid skewed associative caches (rf. "Concurrent Support of
> Multiple Page Sizes On a Skewed Associative TLB")? In such a design
> the capacity is not isolated by ways and enforcing a guarantee that any
> cache block could be allocated might be somewhat expensive. (The
> easiest method might be to provide an equal-latency side cache which
> might otherwise be used as a victim cache, prefetch buffer, stream
> cache, or provide other functionality.)
>

The concern here is that implementations where memory accesses *must* be
cached can avoid pinning so many cachelines that non-pinned memory
becomes inaccessible. There is no requirement that such a safety
interlock be provided, and implementations are permitted (but
discouraged) to allow cache pinning to be used as a footgun.

> It might also be noted that with overlaid skewed associativity, different
> block alignments would be practical with alignments associated with
> ways (like different page sizes in the Seznec paper).
>
> (Another consideration is that cache block sizes may be diverse and
> non-constant. E.g., an implementation that allowed half of L1 cache
> to be mapped as a scratchpad might use the extra tags to support
> twice as many smaller cache blocks (requiring only an extra bit per
> tag).)
>

This is a good reason to keep cacheline size out of the actual
instructions and was one of the motivating factors for the "region"
model, although the original assumption was that cacheline sizes would
only "vary" due to migrations in heterogeneous multiprocessor systems.
This shows that even uniprocessor systems can also benefit from this
abstraction.
Breaking pins on remote write was added to address complaints that
expecting such memory to remain pinned effectively made
invalidation-based coherency protocols unusable. The behaviors
described are both "may" options precisely because, while such writes
must work and must maintain coherency, exactly how writes to cachelines
pinned elsewhere are handled is intended to be implementation-defined.
Writes to pinned memory by the same hart that pinned it are expected;
the problems occur when other harts write to memory pinned on "this" hart.

>> And two M-mode-only privileged instructions:
>>
>> CACHE.PIN.I
>> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
>> Arrange for code to execute from as much of the chosen region as
>> possible without traffic to main memory. Pinning a region is idempotent.
>>
>
> Why M-mode-only?
>

The I-cache pins are M-mode-only because that is the only mode where
context-switch can be guaranteed to not occur. These were added to
allow using the I-cache as temporary program memory on implementations
that normally execute from flash but cannot read from flash while a
flash write is in progress. The issue was seen on one of the HiFive
boards that was also an M-mode-only implementation.

> If one is supporting such ranged fetches, it seems that support for
> memory-copy could be trivially provided.
>

Maybe. There are scope-creep concerns and there are currently no ranged
writes in the proposal. (MEM.REWRITE is not itself a write.)

> TLB prefetching also seems worth considering.
>

Any suggestions?

> In addition, the above does not seem to consider cache
> hierarchies and cache sharing (even temporal sharing through
> software context switches). While most of the uses of such
> operations would have tighter control over thread allocation,
> some uses might expect graceful/gradual decay.
>

I am making an effort to keep these operations relatively abstract even
though that limits the amount of detail that can be specified.
Generally, embedded systems are expected to have that sort of tight
control and large systems (such as a RISC-V PC) are expected to use
ASID-partitioned caches (effectively an independent cache for each ASID)
for reasons of performance and security, since Spectre-like attacks
enable cache side-channels without the "sender's" cooperation.


-- Jacob

Paul Miranda

unread,
Jun 25, 2018, 11:19:30 AM6/25/18
to RISC-V ISA Dev, paaron...@gmail.com, jcb6...@gmail.com


On Monday, June 25, 2018 at 12:09:52 AM UTC-5, Jacob Bachmeyer wrote:
Paul A. Clayton wrote:
> It may be useful to make a distinction between prefetch for write
> where reads are not expected but the region is not guaranteed to
> be overwritten. An implementation might support general avoidance
> of read-for-ownership (e.g., finer-grained validity indicators) but
> still benefit from a write prefetch to establish ownership.
>  

In other words, something in-between MEM.PF.EXCL (which prefetches the
current data in main memory) and MEM.REWRITE (which destroys the current
data in main memory)?

Yes. I believe it assumes the ability to hold partially dirty data in a cacheline, or cachelines small enough to be fully written in the coming instructions, or a memory system that can merge partially dirty data from one cache with clean data from the next level down. Or something else I haven't thought of that isn't supported in simple protocols like ACE.
 
>> MEM.PF.STREAM ("prefetch stream")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
>>   Initiate streaming prefetch of the region, expecting the prefetched
>> data to be used at most once and in sequential order, while minimizing
>> cache pollution.  This operation may activate a prefetch unit and
>> prefetch the region incrementally if rd is x0.  Software is expected to
>> access the region sequentially, starting at the base address.
>>    
>
> It may be useful to include a stride for stream prefetching.
>  

Could the stride be inferred from the subsequent access pattern?  If
words at X, X+24, and X+48 are subsequently read, skipping the
intermediate locations, the prefetcher could infer a stride of 24 and
simply remember (1) the last location actually accessed and (2) the next
expected access location.  The reason to remember the last actual access
is to still meet the "mimimize cache pollution" goal when a stride is
mispredicted.

If hardware has to wait to see the access pattern then the utility of the MEM.PF is reduced, although I suppose there isn't a good way of indicating stride in this format?


>> And two M-mode-only privileged instructions: 
>
> Why M-mode-only?
>  

The I-cache pins are M-mode-only because that is the only mode where
context-switch can be guaranteed to not occur.  These were added to
allow using the I-cache as temporary program memory on implementations
that normally execute from flash but cannot read from flash while a
flash write is in progress.  The issue was seen on one of the HiFive
boards that was also an M-mode-only implementation.

I can see a use for S-mode pinning. I might only use it for ASID=0 (global) lines, but I can see a use for multiple ASIDs if the implementation can handle it efficiently.
Bottom-line, I'd like to see S-mode pinning, although it might not be useful in all systems.

 
> TLB prefetching also seems worth considering.
>  

Any suggestions?

Definitely seems useful... just another flavor of MEM.PF, I think, but with no worries about getting ownership, I think.
TLB pinning could also be useful, IMO.
 
Thanks for putting all of this together. I was thinking about trying to do it all with nonstandard CSR registers but having instructions should be broadly useful.


One thing still missing from RISC-V (unless I myself missed something along the way) is a streaming-store hint. Right now I believe the only way to avoid cache pollution is through PMA's but that's not a very fine-grained tool (if it exists at all) in all systems.

Jacob Bachmeyer

unread,
Jun 25, 2018, 7:22:31 PM6/25/18
to Paul Miranda, RISC-V ISA Dev, paaron...@gmail.com
Paul Miranda wrote:
> On Monday, June 25, 2018 at 12:09:52 AM UTC-5, Jacob Bachmeyer wrote:
>
> Paul A. Clayton wrote:
> > It may be useful to make a distinction between prefetch for write
> > where reads are not expected but the region is not guaranteed to
> > be overwritten. An implementation might support general avoidance
> > of read-for-ownership (e.g., finer-grained validity indicators) but
> > still benefit from a write prefetch to establish ownership.
> >
>
> In other words, something in-between MEM.PF.EXCL (which prefetches
> the
> current data in main memory) and MEM.REWRITE (which destroys the
> current
> data in main memory)?
>
>
> Yes. I believe it assumes the ability to hold partially dirty data in
> a cacheline, or cachelines small enough to be fully written in the
> coming instructions, or a memory system that can merge partially dirty
> data from one cache with clean data from the next level down. Or
> something else I haven't thought of that isn't supported in simple
> protocols like ACE.

For now, I will call it MEM.WRHINT and think about how to actually
define such an operation.

> >> MEM.PF.STREAM ("prefetch stream")
> >> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
> >> Initiate streaming prefetch of the region, expecting the
> prefetched
> >> data to be used at most once and in sequential order, while
> minimizing
> >> cache pollution. This operation may activate a prefetch unit and
> >> prefetch the region incrementally if rd is x0. Software is
> expected to
> >> access the region sequentially, starting at the base address.
> >>
> >
> > It may be useful to include a stride for stream prefetching.
> >
>
> Could the stride be inferred from the subsequent access pattern? If
> words at X, X+24, and X+48 are subsequently read, skipping the
> intermediate locations, the prefetcher could infer a stride of 24 and
> simply remember (1) the last location actually accessed and (2)
> the next
> expected access location. The reason to remember the last actual
> access
> is to still meet the "mimimize cache pollution" goal when a stride is
> mispredicted.
>
>
> If hardware has to wait to see the access pattern then the utility of
> the MEM.PF is reduced, although I suppose there isn't a good way of
> indicating stride in this format?

We only have 2 source operands in the instruction, so there is no good
way to add an explicit stride parameter. However, for streaming
prefetch, that should be less of a concern: the hardware can load some
initial block into a prefetch buffer and use the inferred stride to make
more efficient use of later prefetches. In other words, initially
assume that streaming data is packed and revise that assumption as the
access pattern indicates.

> >> And two M-mode-only privileged instructions:
> >
> > Why M-mode-only?
>
> The I-cache pins are M-mode-only because that is the only mode where
> context-switch can be guaranteed to not occur. These were added to
> allow using the I-cache as temporary program memory on
> implementations
> that normally execute from flash but cannot read from flash while a
> flash write is in progress. The issue was seen on one of the HiFive
> boards that was also an M-mode-only implementation.
>
>
> I can see a use for S-mode pinning. I might only use it for ASID=0
> (global) lines, but I can see a use for multiple ASIDs if the
> implementation can handle it efficiently.
> Bottom-line, I'd like to see S-mode pinning, although it might not be
> useful in all systems.

I can see uses for U-mode pinning, but the problem goes back to the
original motivation for I-cache pins: holding some (small) amount of
code in the I-cache to handle a known period where the main program
store is not accessible. Interrupts must be disabled for this to work
and that means that I-cache pinning must be restricted to M-mode, since
no other mode can truly disable interrupts. Frequent use of MEM.PF.TEXT
can allow less-privileged modes to "quasi-pin" certain code but cannot
provide the same guarantee.

The difference stems from what happens when a cache pin is broken: for
a data cache pin, the relevant information will be reloaded when needed
and re-pinned on the next iteration of some loop; for an instruction
cache pin, an interrupt may result in a branch to program text that is
temporarily inaccessible until the (interrupted) pinned code completes,
which will not happen because an interrupt occurred, leading to a
temporal contradiction (interrupt handler cannot be fetched until pinned
code completes; pinned code cannot continue until interrupt is handled)
that deadlocks (reads block) or crashes (reads return garbage) the system.

> > TLB prefetching also seems worth considering.
>
> Any suggestions?
>
>
> Definitely seems useful... just another flavor of MEM.PF, I think, but
> with no worries about getting ownership, I think.

Perhaps MEM.PF.MAP?

> TLB pinning could also be useful, IMO.

Cache pins provide an otherwise-unavailable ability to use (part of) the
cache as a scratchpad. What new ability do we get from pinning TLB entries?

> Thanks for putting all of this together. I was thinking about trying
> to do it all with nonstandard CSR registers but having instructions
> should be broadly useful.

The goal is for this proposal to have enough broad community support to
be adopted as a standard extension and/or rolled into a future baseline.

> One thing still missing from RISC-V (unless I myself missed something
> along the way) is a streaming-store hint. Right now I believe the only
> way to avoid cache pollution is through PMA's but that's not a very
> fine-grained tool (if it exists at all) in all systems.

For the case of a packed streaming-store (every octet will be
overwritten), there is MEM.REWRITE, but that is also a "prefetch
constant" and allocates cachelines. Could a combination of
MEM.PF.STREAM and MEM.WRHINT address this generally? Or is a
MEM.SPARSEWRITE a better option? How to (conceptually) distinguish
MEM.SPARSEWRITE and MEM.WRHINT?

For that matter, is a general rule that "reads are prefetched while
writes are hinted" a good dividing line?

Should MEM.PF.STREAM be more of a modifier to another prefetch or hint
instead of its own prefetch instruction?


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 25, 2018, 7:52:32 PM6/25/18
to Jacob Bachmeyer, Paul Miranda, RISC-V ISA Dev, Paul A. Clayton
On Tue, Jun 26, 2018 at 12:22 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> TLB pinning could also be useful, IMO.
>
>
> Cache pins provide an otherwise-unavailable ability to use (part of) the
> cache as a scratchpad.

oo now that's *very* interesting, particularly given that a GPU has
such a strong requirement to process relatively large amounts of data
(4x4 blocks of 32-bit pixels) *without* that going back to L2 and
definitely not to main memory before the work's completely done.

scratchpads are a bit of a pain as they need to be context-switched
(or an entire core hard-dedicated to one task). if L1 cache can
double-duty as a scratchpad that would be *great*, as all the things
that an L1 cache has to take care of for context-switching and so on
are already well-known and solved.

> What new ability do we get from pinning TLB entries?

working sets. reduced thrashing. batch processing.

(if there is anyone who believes that the scenario below is not
realistic please feel free to alter it and contribute an improvement
or alternative that is).

it should be fairly easy to envisage a case where a long-running
process that needs regular but short-duration assistance of a process
that requires some (but not a lot) of memory could have its
performance adversely affected by the short-duration task due to the
short-duration task pushing out significant numbers of TLBs for the
long-running process.

network packets coming in on database servers easily fits that scenario.

if the short-duration task instead used a small subset of the TLB,
then despite the short-duration's task being a bit slower, it's quite
likely that the longer-duration task would be even worse-affected,
particularly if there are significant numbers of smaller TLB entries
(heavily-fragmented memory workloads) rather than fewer huge ones [for
whatever reason].

monero's mining algorithm in particular i understand to have been
*deliberately* designed to hit [really large amounts of] memory
particularly hard with random accesses, as a deliberate way to ensure
that people don't design custom ASICs for it.

l.

Paul Miranda

unread,
Jun 25, 2018, 9:15:58 PM6/25/18
to jcb6...@gmail.com, RISC-V ISA Dev, paaron...@gmail.com


On Mon, Jun 25, 2018 at 6:22 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
...

I can see uses for U-mode pinning, but the problem goes back to the original motivation for I-cache pins:  holding some (small) amount of code in the I-cache to handle a known period where the main program store is not accessible.  Interrupts must be disabled for this to work and that means that I-cache pinning must be restricted to M-mode, since no other mode can truly disable interrupts.  Frequent use of MEM.PF.TEXT can allow less-privileged modes to "quasi-pin" certain code but cannot provide the same guarantee.

The difference stems from what happens when a cache pin is broken:  for a data cache pin, the relevant information will be reloaded when needed and re-pinned on the next iteration of some loop; for an instruction cache pin, an interrupt may result in a branch to program text that is temporarily inaccessible until the (interrupted) pinned code completes, which will not happen because an interrupt occurred, leading to a temporal contradiction (interrupt handler cannot be fetched until pinned code completes; pinned code cannot continue until interrupt is handled) that deadlocks (reads block) or crashes (reads return garbage) the system.

    > TLB prefetching also seems worth considering.

    Any suggestions?


Definitely seems useful... just another flavor of MEM.PF, I think, but with no worries about getting ownership, I think.

Perhaps MEM.PF.MAP?

TLB pinning could also be useful, IMO.

Cache pins provide an otherwise-unavailable ability to use (part of) the cache as a scratchpad.  What new ability do we get from pinning TLB entries?

My desire for both of these is for providing reliably low latency interrupt handling for certain vectors despite the presence of caching and virtual memory. I want to pin I cache and I TLB (Data side would be useful too) to allow an S-mode handler to operate even if it isn't the most recently used code. I am assuming >N+1 associativity so that N pins could never lock out other threads or modes from caching completely, which I think would address your concern but I might not have understood it completely.


 
One thing still missing from RISC-V (unless I myself missed something along the way) is a streaming-store hint. Right now I believe the only way to avoid cache pollution is through PMA's but that's not a very fine-grained tool (if it exists at all) in all systems.

For the case of a packed streaming-store (every octet will be overwritten), there is MEM.REWRITE, but that is also a "prefetch constant" and allocates cachelines.  Could a combination of MEM.PF.STREAM and MEM.WRHINT address this generally?  Or is a MEM.SPARSEWRITE a better option?  How to (conceptually) distinguish MEM.SPARSEWRITE and MEM.WRHINT?

For that matter, is a general rule that "reads are prefetched while writes are hinted" a good dividing line?

Should MEM.PF.STREAM be more of a modifier to another prefetch or hint instead of its own prefetch instruction?

MEM.WRHINT is similar to what I'm thinking... get ownership of a line but not data...however I didn't even want to burn a tag entry for a streaming store. I think all of the hints talked about before hold the line in some way. 
MEM.REWRITE is also similar, but is allowed to drop data and again gets ownership.
It may be that some systems couldn't accommodate what I'm thinking, although it's not really any different than when a noncoherent device wants to write to coherent memory.

 

-- Jacob

Paul A. Clayton

unread,
Jun 25, 2018, 10:36:23 PM6/25/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Paul A. Clayton wrote:
>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
[snip for async prefetch fault stops or skips faulting addresses]

>> This seems to be hint-like semantics (even though it is only triggered
>> on a fault).
>
> Please elaborate, although asynchronous prefetches do seem rather
> hint-like now that you mention it.

By not faulting the asynchronous form is not strictly a directive (unless
the range exceeded capacity in such a way that either skipped addresses
would have been overwritten regardless of the skipping or the prefetch
would stop when capacity was reached).

[snip]
>> A memory region can have the same number or frequency
>> of accesses ("heaviness" can refer to "weight" or "density") but
>> different use lifetimes.
>>
>> If this information is intended to communicate temporal locality
>> (and not some benefit measure of caching), then prefetch once
>> might be merged as the lowest temporal locality.
>>
>> (Utility, persistence, and criticality are different measures that
>> software may wish to communicate to the memory system.)
>
> Please elaborate on this. I would like to be sure that we are using the
> same terms here.

Utility is roughly the number of accesses serviced/latency penalty
avoided due to the prefetch. Persistence refers to the useful lifetime
of the storage granule/prefetched region. Criticality refers to the
urgency, e.g., a prefetch operation might be ready early enough
that hardware can prefer bandwidth or energy efficiency over
latency without hurting performance.

[snip]
>> It may be useful to make a distinction between prefetch for write
>> where reads are not expected but the region is not guaranteed to
>> be overwritten. An implementation might support general avoidance
>> of read-for-ownership (e.g., finer-grained validity indicators) but
>> still benefit from a write prefetch to establish ownership.
>>
>
> In other words, something in-between MEM.PF.EXCL (which prefetches the
> current data in main memory) and MEM.REWRITE (which destroys the current
> data in main memory)?

Yes.

>>> MEM.PF.ONCE ("prefetch once")
>>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
>>> Prefetch as much of the region as possible, but expect the prefetched
>>> data to be used at most once in any order.
>>>
>>
>> "used once" may be defined at byte level or at cache block level.
>>
>
> It was intended to be at the level of "something", initially defined as
> the width of the first access made to the prefetched region.

Defining the granularity of expiration of usefulness by the first
access seems somewhat complex. Tracking "has been used"
would seem to require a bit for the smallest possible granule,
which seems unlikely to be provided (given that caches rarely
track present or dirty at even 32-bit granularity).

[snip]
> I now wonder if MEM.PF.ONCE and MEM.PF.PF[0123] might be most useful in
> combination, where MEM.PF.ONCE specifies a region and the second
> prefetch specifies a granularity within that region, although this would
> make the overall interface less regular.

This could also be an argument for not limiting such instructions to
32-bit encodings.

[snip]
>> It may be useful to include a stride for stream prefetching.
>>
>
> Could the stride be inferred from the subsequent access pattern? If
> words at X, X+24, and X+48 are subsequently read, skipping the
> intermediate locations, the prefetcher could infer a stride of 24 and
> simply remember (1) the last location actually accessed and (2) the next
> expected access location. The reason to remember the last actual access
> is to still meet the "mimimize cache pollution" goal when a stride is
> mispredicted.

As others noted, this would delay prefetching or waste bandwidth
by assuming unit stride.

[snip]
>> With a unified Ln cache, would the instruction prefetching "overflow"
>> into the Ln cache? (Similar questions apply to data.)
>>
>
> By "overflow" do you mean that once L1 cache is full, additional
> cachelines would be prefetched into L2 only? I had expected to leave
> that implementation-defined, since the proposal tries to avoid tying
> itself to any particular cache structure (or even to actually using
> caches at all, although I am not aware of any other useful
> implementations).

This concerns the hardware fulfilling the intent of the software.
If accesses to the prefetched region are "random", software might
prefer a large prefetch region to be fetched into L2 rather than
evicting most of L1.

[snip]
>> "as much as possible" and "minimal delay" interact in a more complex
>> memory system. Would software prefer more of the region be cached
>> (at some latency or bandwidth penalty) or just have the excess ignored?
>>
>
> The key requirement is "no traffic to main memory". Is the "minimal
> delay" confusing in this context?

Pinning to an off-chip L4 cache would avoid memory traffic and
provide substantial capacity (which may be what is desired) but
the latency (and bandwidth) would be worse than L1 latency (and
bandwidth).

>> (While these cache operations are presumably intended for more
>> microarchitecture-specific tuning, some software may be developed
>> quickly to work well on one microarchitecture with an expectation
>> that the software would work decently for a general class of
>> implementations.)
>>
>> Pinning also seems to be a specification of temporal locality:
>> "perpetual" locality.
>>
>
> More like malloc(3)/free(3), but yes, pinning is "do not evict this
> until I say so". The catch is that multi-tasking systems might not
> actually be able to maintain that, so pins can be lost on
> context-switch. For user mode, this should not be a significant
> concern, since the pin can be redone on each iteration of an outer loop,
> which is why pinning/unpinning is idempotent.

Having to repin even with only moderate frequency would remove
some of the performance advantage. (This could be optimized with
an approximate conservative filter; even a simple filter could
quickly convert repinning to a nop if the pinning remained entirely
intact. However, that adds significant hardware complexity.)

[snip]
> The concern here is that implementations where memory accesses *must* be
> cached can avoid pinning so many cachelines that non-pinned memory
> becomes inaccessible. There is no requirement that such a safety
> interlock be provided, and implementations are permitted (but
> discouraged) to allow cache pinning to be used as a footgun.

It seems that sometimes software would want to pin more cache
than is "safe". (By the way, presumably uncacheable memory is
still accessible. Memory for which no space is available for caching
could be treated as uncached/uncacheable memory.)

[snip]

> Breaking pins on remote write was added to address complaints that
> expecting such memory to remain pinned effectively made
> invalidation-based coherency protocols unusable. The behaviors
> described are both "may" options precisely because, while such writes
> must work and must maintain coherency, exactly how writes to cachelines
> pinned elsewhere are handled is intended to be implementation-defined.
> Writes to pinned memory by the same hart that pinned it are expected;
> the problems occur when other harts write to memory pinned on "this" hart.

(Actually the problem occurs when the pinned cache blocks are not
shared between the pinning hart and the writing hart.)

Implementation defined behavior requires a method of discovery. It
also seems desirable to have guidelines on families of implementation
to facilitate portability within a family.

[snip Icache pinning]
>> Why M-mode-only?
>>
>
> The I-cache pins are M-mode-only because that is the only mode where
> context-switch can be guaranteed to not occur. These were added to
> allow using the I-cache as temporary program memory on implementations
> that normally execute from flash but cannot read from flash while a
> flash write is in progress. The issue was seen on one of the HiFive
> boards that was also an M-mode-only implementation.

How is different from data pinning?

>> If one is supporting such ranged fetches, it seems that support for
>> memory-copy could be trivially provided.
>>
>
> Maybe. There are scope-creep concerns and there are currently no ranged
> writes in the proposal. (MEM.REWRITE is not itself a write.)

On the other hand, recognizing that features are closely related
in implementation seems important.

>> TLB prefetching also seems worth considering.
>>
>
> Any suggestions?

Nothing specific. Locking TLB entries is not uncommonly
supported (original MIPS TLB could set a maximum on
the random number generated for replacement to lock
entries; Itanium defined "Translation Registers" which
could be taken from the translation cache). PTE and
paging structure prefetching is a natural extension of
data prefetching.

Jacob Bachmeyer

unread,
Jun 25, 2018, 10:42:49 PM6/25/18
to Luke Kenneth Casson Leighton, Paul Miranda, RISC-V ISA Dev, Paul A. Clayton
Luke Kenneth Casson Leighton wrote:
> On Tue, Jun 26, 2018 at 12:22 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>>> TLB pinning could also be useful, IMO.
>>>
>> Cache pins provide an otherwise-unavailable ability to use (part of) the
>> cache as a scratchpad.
>>
>
> oo now that's *very* interesting, particularly given that a GPU has
> such a strong requirement to process relatively large amounts of data
> (4x4 blocks of 32-bit pixels) *without* that going back to L2 and
> definitely not to main memory before the work's completely done.
>
> scratchpads are a bit of a pain as they need to be context-switched
> (or an entire core hard-dedicated to one task). if L1 cache can
> double-duty as a scratchpad that would be *great*, as all the things
> that an L1 cache has to take care of for context-switching and so on
> are already well-known and solved.
>

Note that cache pins are broken upon context switch unless the cache is
ASID-partitioned -- each task must appear to have the entire cache
available. User tasks that use D-cache pins are expected to re-pin
their working buffer frequently. Of course, if the working buffer is
"marching" through the address space, the problem solves itself as the
scratchpad advances (unpin old, pin new) through the address space.

>> What new ability do we get from pinning TLB entries?
>>
>
> working sets. reduced thrashing. batch processing.
>
> (if there is anyone who believes that the scenario below is not
> realistic please feel free to alter it and contribute an improvement
> or alternative that is).
>
> it should be fairly easy to envisage a case where a long-running
> process that needs regular but short-duration assistance of a process
> that requires some (but not a lot) of memory could have its
> performance adversely affected by the short-duration task due to the
> short-duration task pushing out significant numbers of TLBs for the
> long-running process.
>
> network packets coming in on database servers easily fits that scenario.
>
> if the short-duration task instead used a small subset of the TLB,
> then despite the short-duration's task being a bit slower, it's quite
> likely that the longer-duration task would be even worse-affected,
> particularly if there are significant numbers of smaller TLB entries
> (heavily-fragmented memory workloads) rather than fewer huge ones [for
> whatever reason].
>

Either the TLB is also partitioned by ASID, in which case this does not
matter because the two tasks will have effectively independent TLBs, or
TLB pinning will not help because pins are broken on context switch to
prevent one task from denying service to another by tying up a
significant part of the cache or TLB.

Failing to break pins on context switch also creates some nasty side
channels, so pins must be limited by ASID. ASID-partitioning is the
only way to close cache side-channels generally. Spectre-like attacks
can cause even innocent processes to "send" on a cache side-channel.
The only solution is to close the side-channels, and ASID-partitioning
has the added advantage of being able to pack more cache into the same
area by interleaving the partitions, since only one partition will be
active at a time, the power dissipation is that of a single partition
but spread over the area of multiple partitions, reducing overall
thermal power density.

> monero's mining algorithm in particular i understand to have been
> *deliberately* designed to hit [really large amounts of] memory
> particularly hard with random accesses, as a deliberate way to ensure
> that people don't design custom ASICs for it.
>

Does monero use scrypt? That was designed for hashing passwords to
resist GPU-based cracking.


-- Jacob

Jacob Bachmeyer

unread,
Jun 25, 2018, 11:11:19 PM6/25/18
to Paul Miranda, RISC-V ISA Dev, paaron...@gmail.com
> <http://MEM.PF>, I think, but with no worries about getting
> ownership, I think.
>
>
> Perhaps MEM.PF.MAP?
>
> TLB pinning could also be useful, IMO.
>
>
> Cache pins provide an otherwise-unavailable ability to use (part
> of) the cache as a scratchpad. What new ability do we get from
> pinning TLB entries?
>
> My desire for both of these is for providing reliably low latency
> interrupt handling for certain vectors despite the presence of caching
> and virtual memory. I want to pin I cache and I TLB (Data side would
> be useful too) to allow an S-mode handler to operate even if it isn't
> the most recently used code. I am assuming >N+1 associativity so that
> N pins could never lock out other threads or modes from caching
> completely, which I think would address your concern but I might not
> have understood it completely.

I think that pinning a cacheline needs to implicitly pin the associated
TLB entry, since invalidation of the latter must also writeback and
invalidate the former. Is there any use for pinning TLB entries
separately from cachelines?

The associativity assumption should not be a problem: hardware is
expected to reject requests to pin so many cachelines as to preclude
other data from being cached at all.

A better option here would probably be to partition the cache, providing
some number of cachelines exclusively for S-mode. With a partitioned
I-cache, the supervisor could simply include MEM.PF.TEXT in its
trap-exit code to prefetch the trap entry code back into the supervisor
I-cache. The supervisor could also use MEM.PF.EXCL to prefetch the
saved context area into the data cache, but that would be redundant as
restoring the user context will ensure that the saved context area is in
the supervisor D-cache. The supervisor caches simply hold thier
contents while executing in user mode (and get blown away anyway on a
hypervisor VM switch unless appropriately partitioned) so MEM.PF.TEXT
will ensure that the supervisor trap entry is cached when the next trap
is taken. (The S-mode prefetch can continue and complete after user
execution has resumed.)

I agree that having the supervisor trap entry always (or nearly always)
cached could be useful. Again, partitioning closes side-channels and is
likely to be "free" in modern high-density processes where power
dissipation is a more-constraining limit than geometry. Of course,
very-low-power embedded systems probably will not have those
power-dissipation constraints and would actually have an area cost for
additional caches.

The key difference between M-mode-only I-cache pins and general I-cache
pins is that the monitor can pin a bit of code, disable interrupts, and
then do something with that pinned code that temporarily disables main
memory, like rewriting flash on the aforementioned HiFive board.
Less-privileged modes cannot do this, since more-privileged interrupts
are always enabled. Effectively, there are two different types of
I-cache pins possible, one which is guaranteed but only available in
M-mode, and one which can be broken by a more-privileged context switch,
but can be made available all the way down to U-mode.


-- Jacob

Jacob Bachmeyer

unread,
Jun 25, 2018, 11:52:01 PM6/25/18
to Paul Miranda, RISC-V ISA Dev, paaron...@gmail.com
MEM.PF.STREAM is expected (on the high end) to activate a separate
prefetch unit with its own prefetch buffer; a "streaming write hint"
would act similarly but use a write-combining buffer, possibly with
associated logic to merge the combined writes into existing data, which
may require allocation of cachelines in an outer cache to perform the merge.

This leads to questions of what MEM.WRHINT should actually do. Perhaps
the simplest option is to force invalidation of any copies held by other
harts in the system, writing back a dirty copy if present and leaving
the region "up for grabs" and currently uncached? That followed by
allocating a cacheline in an "owned, invalid-data" state? Would a
separate write-combining cache be more useful even without write-hints?

I envision that as a third L1 cache: L1I, L1D, L1C; the "C-cache" is a
write-only write-combining element. Each byte in the C-cache has its
own validity flag and the C-cache performs an intervention (merging its
valid bytes with data brought in from L2) when the L1D cache requests a
line from L2 which the C-cache contains. The same line cannot be
present in both L1D and L1C; writes to cachelines present in L1D hit
there instead of arriving at L1C. (This can be performed in parallel if
the C-cache keeps one "new item" line available and simply drops the
"new item" if L1D reports a hit.) Reads from lines present in L1C force
allocation of a cacheline in L1D and the transfer-and-merge of that line
from L2 and L1C to L1D. Completely valid lines can be transferred to
the L1D cache without accessing L2, but then also need to be "written
back" to L2, which L1D should be able to handle on its own since a
C-cache intervention must mark the data "dirty" upon arrival in L1D.
Completely valid lines can be written from the C-cache to L2 at any
time. Evicting a line from the C-cache is a bit more complex, since the
line must be brought into L2 and the valid bytes from L1C merged.
Perhaps allocating a C-cache line could initiate an L2 prefetch? But
this is wasteful if an entire L2 cacheline is overwritten; in that case
the prefetch could have been elided. The C-cache does not intervene on
L1I-cache requests and is flushed to L2 upon execution of FENCE.I.

And that leads to a "bikeshed" question: what is the best two-letter
abbreviation for "hint"? I currently lean towards "MEM.NT." as a prefix
for write hints.

> MEM.REWRITE is also similar, but is allowed to drop data and again
> gets ownership.

MEM.REWRITE does a bit more than that: cachelines are allocated, filled
with a constant (either all-zeros or all-ones, depending on hardware),
and assigned ownership for that region with all remote copies (even if
dirty!) simply invalidated. (Writeback is permitted but useless, since
the remote dirty cacheline will be overwritten locally.)

> It may be that some systems couldn't accommodate what I'm thinking,
> although it's not really any different than when a noncoherent device
> wants to write to coherent memory.

If there are systems that cannot accommodate what you are thinking, then
I have misunderstood. Please explain.


-- Jacob

Jacob Bachmeyer

unread,
Jun 26, 2018, 1:26:10 AM6/26/18
to Paul A. Clayton, RISC-V ISA Dev
Paul A. Clayton wrote:
> On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Paul A. Clayton wrote:
>>
>>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>
> [snip for async prefetch fault stops or skips faulting addresses]
>
>>> This seems to be hint-like semantics (even though it is only triggered
>>> on a fault).
>>>
>> Please elaborate, although asynchronous prefetches do seem rather
>> hint-like now that you mention it.
>>
>
> By not faulting the asynchronous form is not strictly a directive (unless
> the range exceeded capacity in such a way that either skipped addresses
> would have been overwritten regardless of the skipping or the prefetch
> would stop when capacity was reached).
>

Prefetches always stop when capacity is reached; synchronous prefetches
report where they stopped.

Synchronous prefetches do not fault either -- prefetches are allowed to
run "off the end" of valid memory and into unmapped space; if this were
not so, "LOAD x0" would be a prefetch instruction, but it is not.
Permitting asynchronous prefetch to continue past a fault accommodates
pages being swapped out and is a significant win if the swapped-out page
is not actually accessed. This is also an implementation option --
implementations are permitted equally well to simply stop an
asynchronous prefetch when any fault is encountered. Ideally, an
implementation could do both -- stop if a permission check fails, but
advance prefetch to the next page in the region if a page is not present.

> [snip]
>
>>> A memory region can have the same number or frequency
>>> of accesses ("heaviness" can refer to "weight" or "density") but
>>> different use lifetimes.
>>>
>>> If this information is intended to communicate temporal locality
>>> (and not some benefit measure of caching), then prefetch once
>>> might be merged as the lowest temporal locality.
>>>
>>> (Utility, persistence, and criticality are different measures that
>>> software may wish to communicate to the memory system.)
>>>
>> Please elaborate on this. I would like to be sure that we are using the
>> same terms here.
>>
>
> Utility is roughly the number of accesses serviced/latency penalty
> avoided due to the prefetch. Persistence refers to the useful lifetime
> of the storage granule/prefetched region. Criticality refers to the
> urgency, e.g., a prefetch operation might be ready early enough
> that hardware can prefer bandwidth or energy efficiency over
> latency without hurting performance.
>

Then the prefetch levels are intended to indicate relative utility.
Persistence is more limited: zero is no prefetch at all, once is the
MEM.PF.ONCE and MEM.PF.STREAM instructions, many times is
MEM.PF.PF[0123]. Criticality is not well-represented in this proposal,
aside from a slight implication that streaming prefetch prefers
bandwidth over latency.

How well are utility and criticality typically correlated? For that
matter, how many levels of each should be distinguished?

I almost want to describe criticality as "yesterday", "now", and
"later". But "yesterday" can be represented by an actual access, so
that leaves "now" and "later" as prefetch options.

> [snip]
>
>>> It may be useful to make a distinction between prefetch for write
>>> where reads are not expected but the region is not guaranteed to
>>> be overwritten. An implementation might support general avoidance
>>> of read-for-ownership (e.g., finer-grained validity indicators) but
>>> still benefit from a write prefetch to establish ownership.
>>>
>>>
>> In other words, something in-between MEM.PF.EXCL (which prefetches the
>> current data in main memory) and MEM.REWRITE (which destroys the current
>> data in main memory)?
>>
>
> Yes.
>

[This has branched off into "write-combining hints".]

>>>> MEM.PF.ONCE ("prefetch once")
>>>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
>>>> Prefetch as much of the region as possible, but expect the prefetched
>>>> data to be used at most once in any order.
>>>>
>>> "used once" may be defined at byte level or at cache block level.
>>>
>> It was intended to be at the level of "something", initially defined as
>> the width of the first access made to the prefetched region.
>>
>
> Defining the granularity of expiration of usefulness by the first
> access seems somewhat complex. Tracking "has been used"
> would seem to require a bit for the smallest possible granule,
> which seems unlikely to be provided (given that caches rarely
> track present or dirty at even 32-bit granularity).
>

Well, the smallest granule that the hardware cares about, which may be
an entire cacheline in practice, but application code does not know the
actual cacheline size. As long as accesses exhibit both spacial and
temporal locality, there is a good chance that marking the accessed
lines as "first to evict" but not actually dropping them until new
cachelines need to be loaded will work in practice.

> [snip]
>
>> I now wonder if MEM.PF.ONCE and MEM.PF.PF[0123] might be most useful in
>> combination, where MEM.PF.ONCE specifies a region and the second
>> prefetch specifies a granularity within that region, although this would
>> make the overall interface less regular.
>>
>
> This could also be an argument for not limiting such instructions to
> 32-bit encodings.
>

It is an argument, but I do not think that fine-grained prefetch has
enough demand to be the first to break beyond 32-bit instructions.

> [snip]
>
>>> It may be useful to include a stride for stream prefetching.
>>>
>> Could the stride be inferred from the subsequent access pattern? If
>> words at X, X+24, and X+48 are subsequently read, skipping the
>> intermediate locations, the prefetcher could infer a stride of 24 and
>> simply remember (1) the last location actually accessed and (2) the next
>> expected access location. The reason to remember the last actual access
>> is to still meet the "mimimize cache pollution" goal when a stride is
>> mispredicted.
>>
>
> As others noted, this would delay prefetching or waste bandwidth
> by assuming unit stride.
>

You are correct. The idea is to assume unit stride for the first
"prefetch group" and then adapt to the observed actual access pattern.
Bandwidth is only wasted if the stride exceeds a cacheline however,
since most implementations are expected to prefetch in units of
cachelines and the region is contiguous.

> [snip]
>
>>> With a unified Ln cache, would the instruction prefetching "overflow"
>>> into the Ln cache? (Similar questions apply to data.)
>>>
>> By "overflow" do you mean that once L1 cache is full, additional
>> cachelines would be prefetched into L2 only? I had expected to leave
>> that implementation-defined, since the proposal tries to avoid tying
>> itself to any particular cache structure (or even to actually using
>> caches at all, although I am not aware of any other useful
>> implementations).
>>
>
> This concerns the hardware fulfilling the intent of the software.
> If accesses to the prefetched region are "random", software might
> prefer a large prefetch region to be fetched into L2 rather than
> evicting most of L1.
>

Is there a useful way for software to indicate this intent or should
hardware simply recognize that larger prefetch requests should target
L2? Perhaps lower prefetch levels should prefetch only into outer
caches (a current implementation option)?

> [snip]
>
>>> "as much as possible" and "minimal delay" interact in a more complex
>>> memory system. Would software prefer more of the region be cached
>>> (at some latency or bandwidth penalty) or just have the excess ignored?
>>>
>> The key requirement is "no traffic to main memory". Is the "minimal
>> delay" confusing in this context?
>>
>
> Pinning to an off-chip L4 cache would avoid memory traffic and
> provide substantial capacity (which may be what is desired) but
> the latency (and bandwidth) would be worse than L1 latency (and
> bandwidth).
>

For a region that fits in L4 but no farther in, "minimal delay" would be
the latency of L4, since that is the only place the entire region can
fit. I think a clarification that pinned cachelines must remain cached,
but can be moved within the cache subsystem may be needed. Do you agree?

>>> (While these cache operations are presumably intended for more
>>> microarchitecture-specific tuning, some software may be developed
>>> quickly to work well on one microarchitecture with an expectation
>>> that the software would work decently for a general class of
>>> implementations.)
>>>
>>> Pinning also seems to be a specification of temporal locality:
>>> "perpetual" locality.
>>>
>> More like malloc(3)/free(3), but yes, pinning is "do not evict this
>> until I say so". The catch is that multi-tasking systems might not
>> actually be able to maintain that, so pins can be lost on
>> context-switch. For user mode, this should not be a significant
>> concern, since the pin can be redone on each iteration of an outer loop,
>> which is why pinning/unpinning is idempotent.
>>
>
> Having to repin even with only moderate frequency would remove
> some of the performance advantage. (This could be optimized with
> an approximate conservative filter; even a simple filter could
> quickly convert repinning to a nop if the pinning remained entirely
> intact. However, that adds significant hardware complexity.)
>

Hardware knows the last region pinned (2 XLEN-bit words or 2 XLEN-bit
words per cache ASID-partition) and knows if any pins have been broken
(one bit or one bit per cache ASID-partition; set it when CACHE.PIN is
executed, clear it when any pins are broken). Repinning the same region
would be trivial to recognize. Upon resuming the task that lost its
pins, some cachelines may be loaded normally before the region is
repinned. Pinning simply updates valid cachelines to "valid, pinned"
and loads cachelines not previously present. Remote invalidation can
"poke holes" in a pinned region, but repinning should still work
normally and the "pins intact" bit can be set again.

> [snip]
>
>> The concern here is that implementations where memory accesses *must* be
>> cached can avoid pinning so many cachelines that non-pinned memory
>> becomes inaccessible. There is no requirement that such a safety
>> interlock be provided, and implementations are permitted (but
>> discouraged) to allow cache pinning to be used as a footgun.
>>
>
> It seems that sometimes software would want to pin more cache
> than is "safe". (By the way, presumably uncacheable memory is
> still accessible. Memory for which no space is available for caching
> could be treated as uncached/uncacheable memory.)
>

Perhaps I am mistaken, but I understand that some current processors
cannot do that -- even "uncacheable" accesses must go through the cache,
but are simply immediately invalidated or written back.

Software could also want to pin more cache than exists, but that is
obviously not possible. Setting the limit for pinned cachelines
slightly lower than its physical hard limit is an implementation option
and can allow avoiding that entire can of worms.

> [snip]
>
>> Breaking pins on remote write was added to address complaints that
>> expecting such memory to remain pinned effectively made
>> invalidation-based coherency protocols unusable. The behaviors
>> described are both "may" options precisely because, while such writes
>> must work and must maintain coherency, exactly how writes to cachelines
>> pinned elsewhere are handled is intended to be implementation-defined.
>> Writes to pinned memory by the same hart that pinned it are expected;
>> the problems occur when other harts write to memory pinned on "this" hart.
>>
>
> (Actually the problem occurs when the pinned cache blocks are not
> shared between the pinning hart and the writing hart.)
>
> Implementation defined behavior requires a method of discovery. It
> also seems desirable to have guidelines on families of implementation
> to facilitate portability within a family.
>

The choice of implementation-defined behavior is software-invisible:
all valid options maintain coherency and user cache pins can be dropped
without warning anyway (such as by swapping out pages, although user
programs should use mlock(2) on anything pinned). Those behaviors are
described as an existence proof that cache pins are implementable.

> [snip Icache pinning]
>
>>> Why M-mode-only?
>>>
>> The I-cache pins are M-mode-only because that is the only mode where
>> context-switch can be guaranteed to not occur. These were added to
>> allow using the I-cache as temporary program memory on implementations
>> that normally execute from flash but cannot read from flash while a
>> flash write is in progress. The issue was seen on one of the HiFive
>> boards that was also an M-mode-only implementation.
>>
>
> How is different from data pinning?
>

[This is branched off into "S-mode I-cache pins".]

>>> If one is supporting such ranged fetches, it seems that support for
>>> memory-copy could be trivially provided.
>>>
>> Maybe. There are scope-creep concerns and there are currently no ranged
>> writes in the proposal. (MEM.REWRITE is not itself a write.)
>>
>
> On the other hand, recognizing that features are closely related
> in implementation seems important.
>

Yes, but I am uncertain how a memory-copy opcode fits in here.
MEM.REWRITE is intended for optimizing memcpy(3) and memset(3), however.

>>> TLB prefetching also seems worth considering.
>>>
>> Any suggestions?
>>
>
> Nothing specific. Locking TLB entries is not uncommonly
> supported (original MIPS TLB could set a maximum on
> the random number generated for replacement to lock
> entries; Itanium defined "Translation Registers" which
> could be taken from the translation cache). PTE and
> paging structure prefetching is a natural extension of
> data prefetching.
>

[This also branched into "TLB pinning".]

Is there a use for TLB prefetch/pinning separate from data
prefetch/pinning? Or is this purely a matter of reach, since TLBs can
map much larger regions than caches can store? Should prefetch
instructions be able to (asynchronously) continue TLB prefetching of the
region after reaching cache capacity?

>>> In addition, the above does not seem to consider cache
>>> hierarchies and cache sharing (even temporal sharing through
>>> software context switches). While most of the uses of such
>>> operations would have tighter control over thread allocation,
>>> some uses might expect graceful/gradual decay.
>>>
>> I am making an effort to keep these operations relatively abstract even
>> though that limits the amount of detail that can be specified.
>> Generally, embedded systems are expected to have that sort of tight
>> control and large systems (such as a RISC-V PC) are expected to use
>> ASID-partitioned caches (effectively an independent cache for each ASID)
>> for reasons of performance and security, since Spectre-like attacks
>> enable cache side-channels without the "sender's" cooperation.
>>
[Was a response intended here? The message ended with this quote block.]


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 26, 2018, 3:40:30 AM6/26/18
to Jacob Bachmeyer, Paul Miranda, RISC-V ISA Dev, Paul A. Clayton
On Tue, Jun 26, 2018 at 3:42 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> if the short-duration task instead used a small subset of the TLB,
>> then despite the short-duration's task being a bit slower, it's quite
>> likely that the longer-duration task would be even worse-affected,
>> particularly if there are significant numbers of smaller TLB entries
>> (heavily-fragmented memory workloads) rather than fewer huge ones [for
>> whatever reason].
>>
>
>
> Either the TLB is also partitioned by ASID, in which case this does not
> matter because the two tasks will have effectively independent TLBs, or TLB
> pinning will not help because pins are broken on context switch to prevent
> one task from denying service to another by tying up a significant part of
> the cache or TLB.

ok so it's not a good idea to attempt, you need to switch the entire
TLB out anyway. i was under the impression that even with
different... argh what's that prefix for TLBs that makes them
separate... tags is it called? i was under the impression that even
with different "tags" it might be possible to leave some entries
untouched. evidently not.

> Failing to break pins on context switch also creates some nasty side
> channels, so pins must be limited by ASID. ASID-partitioning is the only
> way to close cache side-channels generally. Spectre-like attacks can cause
> even innocent processes to "send" on a cache side-channel. The only
> solution is to close the side-channels, and ASID-partitioning has the added
> advantage of being able to pack more cache into the same area by
> interleaving the partitions, since only one partition will be active at a
> time, the power dissipation is that of a single partition but spread over
> the area of multiple partitions, reducing overall thermal power density.

that sounds like a reasonable analysis to me.

>> monero's mining algorithm in particular i understand to have been
>> *deliberately* designed to hit [really large amounts of] memory
>> particularly hard with random accesses, as a deliberate way to ensure
>> that people don't design custom ASICs for it.
>>
>
>
> Does monero use scrypt? That was designed for hashing passwords to resist
> GPU-based cracking.

i don't know if it uses scrypt (and it would likely not be useful for
it to do so): the algorithm is designed specifically so that it's ok
to implement on GPUs but *NOT* in a custom ASIC. i.e. it
*specifically* requires significant numbers of data / table lookups
across really really large amounts of memory (currently 4GB, to be
increased to 8GB in the event that someone actually does try creating
a custom ASIC to mine monero).

by contrast bitcoin relies exclusively on SHA256 which can be
massively parallelised (making it a serious runaway race consuming
vast amounts of power and resources, planet-wide).

so due to the massive deliberate random-access memory pattern monero
is a fair candidate for use-case analysis.

l.

Albert Cahalan

unread,
Jun 26, 2018, 5:30:18 AM6/26/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> Previous discussions suggested that explicit cache-control instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the user
> ISA.

You can implement hardware with 32-byte cache lines, but expose it to
software as if it were 512-byte cache lines. Adjust the numbers as you like.

The page size of 4096 bytes is already known. That makes a fine choice
for a software-visible cache line. Make that the case on all hardware, even
if the underlying implementation uses a smaller cache line size. Another fine
choice is 512 bytes, the traditional size of a disk block and thus
relevant for DMA.
The biggest I've heard of was 128 bytes. Going up an extra power of two for
some breathing room gives 256 bytes. That too is a perfectly fine size.

In other words: just pick something.

Paul A. Clayton

unread,
Jun 26, 2018, 11:47:15 AM6/26/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/26/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Paul A. Clayton wrote:
>> On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>>> Paul A. Clayton wrote:
>>>
>>>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
[snip]
>> By not faulting the asynchronous form is not strictly a directive (unless
>> the range exceeded capacity in such a way that either skipped addresses
>> would have been overwritten regardless of the skipping or the prefetch
>> would stop when capacity was reached).
>>
>
> Prefetches always stop when capacity is reached; synchronous prefetches
> report where they stopped.

I was not thinking clearly, emphasizing in my mind "as much of the
chosen region as possible" and ignoring "must terminate and produce
the faulting address" (although the later might be tightened/clarified
to indicate that the range is loaded sequentially until a permission,
validity, invalid data (uncorrectable ECC error) fault (are there other
possible faults?)), particularly considering "end of capacity" as a
fault.

Since the operation as described gives equal weight to all data
in the region and gives no preferred ordering of access, loading
the end (or any arbitrary subset) would meet the goal *if* it was
intended as a single request and not as a conditional partial load
(where unloaded parts may be loaded later). If the access is
"random" within the region, then a single request might be normal
as software might not be able to use blocking to adjust the use
pattern to fit the available capacity.

> Synchronous prefetches do not fault either -- prefetches are allowed to
> run "off the end" of valid memory and into unmapped space; if this were
> not so, "LOAD x0" would be a prefetch instruction, but it is not.

They fault in the sense that they stop operation and return a
value related to the failure, allowing continuation if the access
pattern can be blocked.

> Then the prefetch levels are intended to indicate relative utility.
> Persistence is more limited: zero is no prefetch at all, once is the
> MEM.PF.ONCE and MEM.PF.STREAM instructions, many times is
> MEM.PF.PF[0123]. Criticality is not well-represented in this proposal,
> aside from a slight implication that streaming prefetch prefers
> bandwidth over latency.

So it is mostly communicates to hardware how much performance
would be lost by dropping the prefetch? However, if the prefetch is a
directive (especially if limited to a uniform latency cache), how
would hardware be expected to use this information? It cannot
drop the request even at level 0 (e.g., based on bandwidth cost)
because it is specified as a directive. Hardware might have a
"has been used" bit (to better support "use-once" and speculative
hardware prefetching utility determination) and after first use
set replacement information according to expected resuse distance.

There are at least two considerations: priority of prefetch (which
should include being able to drop it entirely, but this is not allowed
in the current specification) and replacement of prefetched data.

> How well are utility and criticality typically correlated? For that
> matter, how many levels of each should be distinguished?

For large regions, utility can be high while criticality could be low
(assuming random access, in which case the first access may
be at the end of the prefetch stream).

> I almost want to describe criticality as "yesterday", "now", and
> "later". But "yesterday" can be represented by an actual access, so
> that leaves "now" and "later" as prefetch options.

One might want to provide "yesterday" as an asynchronous
prefetch; if only part of the expected latency can be hidden by work,
one really would have preferred an earlier prefetch (i.e., "yesterday").
N loads even to x0 is also likely to be more expensive than a
single prefetch instruction for some values of N.

> Well, the smallest granule that the hardware cares about, which may be
> an entire cacheline in practice, but application code does not know the
> actual cacheline size. As long as accesses exhibit both spacial and
> temporal locality, there is a good chance that marking the accessed
> lines as "first to evict" but not actually dropping them until new
> cachelines need to be loaded will work in practice.

One might generally have significant temporal locality for use once
within a cache block, but not for the entire prefetch region (though
this would imply a significant degree of non-random access assuming
there is not knowledge of the block size; a "cache oblivious" algorithm
might try to localize accesses to use the largest likely block size, but
arbitrary blocking sizes are not always practical, especially for "random"
access use once).

>> This could also be an argument for not limiting such instructions to
>> 32-bit encodings.
>>
>
> It is an argument, but I do not think that fine-grained prefetch has
> enough demand to be the first to break beyond 32-bit instructions.

Part of the point of RISC-V encoding is to encourage VLE with
its ability to support extension. A 64-bit encoding would not be
worse than requiring two 32-bit instructions.

[snip]

> Is there a useful way for software to indicate this intent or should
> hardware simply recognize that larger prefetch requests should target
> L2? Perhaps lower prefetch levels should prefetch only into outer
> caches (a current implementation option)?

Size is a significant piece of information, but other information
might be worth communicating.

I do not have the time and energy to work out a good suggestion.
I am not even sure what the intended uses are for random
access (non-pinning) ranges are.

> For a region that fits in L4 but no farther in, "minimal delay" would be
> the latency of L4, since that is the only place the entire region can
> fit. I think a clarification that pinned cachelines must remain cached,
> but can be moved within the cache subsystem may be needed. Do you agree?

I suspect use may be more complex. If there is sequential
access with reuse, the first portion might be preferentially fetched
to L1 and hardware could be aware of when prefetches from L2 etc.
should be initiated. Hardware could even retain awareness of the
portion of the region that did not fit and prefetch that to L1 in a
more timely manner.

[snip]
>> It seems that sometimes software would want to pin more cache
>> than is "safe". (By the way, presumably uncacheable memory is
>> still accessible. Memory for which no space is available for caching
>> could be treated as uncached/uncacheable memory.)
>>
>
> Perhaps I am mistaken, but I understand that some current processors
> cannot do that -- even "uncacheable" accesses must go through the cache,
> but are simply immediately invalidated or written back.

That seems unlikely.

[snip]
>> Implementation defined behavior requires a method of discovery. It
>> also seems desirable to have guidelines on families of implementation
>> to facilitate portability within a family.
>>
>
> The choice of implementation-defined behavior is software-invisible:
> all valid options maintain coherency and user cache pins can be dropped
> without warning anyway (such as by swapping out pages, although user
> programs should use mlock(2) on anything pinned). Those behaviors are
> described as an existence proof that cache pins are implementable.

They are architecturally invisible but not performance invisible.
Since prefetches (without pinning) are performance-targeting
operations, such effects may be significant to software.

[snip]

> Is there a use for TLB prefetch/pinning separate from data
> prefetch/pinning? Or is this purely a matter of reach, since TLBs can
> map much larger regions than caches can store? Should prefetch
> instructions be able to (asynchronously) continue TLB prefetching of the
> region after reaching cache capacity?

If one has a large region that is randomly accessed in a
given execution phase, prefetching address translation
information may be useful where data prefetching would
not be (because most of the range might not be accessed
in that phase).

I do not think I can contribute anything to this proposal;
I just had a few thoughts that sprang quickly to mind.


>>>> In addition, the above does not seem to consider cache
>>>> hierarchies and cache sharing (even temporal sharing through
>>>> software context switches). While most of the uses of such
>>>> operations would have tighter control over thread allocation,
>>>> some uses might expect graceful/gradual decay.
>>>>
>>> I am making an effort to keep these operations relatively abstract even
>>> though that limits the amount of detail that can be specified.
>>> Generally, embedded systems are expected to have that sort of tight
>>> control and large systems (such as a RISC-V PC) are expected to use
>>> ASID-partitioned caches (effectively an independent cache for each ASID)
>>> for reasons of performance and security, since Spectre-like attacks
>>> enable cache side-channels without the "sender's" cooperation.
>>>
> [Was a response intended here? The message ended with this quote block.]

No.

Jacob Bachmeyer

unread,
Jun 26, 2018, 10:36:47 PM6/26/18
to Paul A. Clayton, RISC-V ISA Dev
Paul A. Clayton wrote:
> On 6/26/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Paul A. Clayton wrote:
>>
>>> On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>
>>>> Paul A. Clayton wrote:
>>>>
>>>>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>>>
> [snip]
>
>>> By not faulting the asynchronous form is not strictly a directive (unless
>>> the range exceeded capacity in such a way that either skipped addresses
>>> would have been overwritten regardless of the skipping or the prefetch
>>> would stop when capacity was reached).
>>>
>> Prefetches always stop when capacity is reached; synchronous prefetches
>> report where they stopped.
>>
>
> I was not thinking clearly, emphasizing in my mind "as much of the
> chosen region as possible" and ignoring "must terminate and produce
> the faulting address" (although the later might be tightened/clarified
> to indicate that the range is loaded sequentially until a permission,
> validity, invalid data (uncorrectable ECC error) fault (are there other
> possible faults?)), particularly considering "end of capacity" as a
> fault.
>

Uncorrectable ECC error is expected to be an NMI in RISC-V. This raises
an interesting point: should a prefetch that encounters an
uncorrectable ECC error raise the associated NMI or simply drop the
unusable data and pretend that it did not happen?

> Since the operation as described gives equal weight to all data
> in the region and gives no preferred ordering of access, loading
> the end (or any arbitrary subset) would meet the goal *if* it was
> intended as a single request and not as a conditional partial load
> (where unloaded parts may be loaded later). If the access is
> "random" within the region, then a single request might be normal
> as software might not be able to use blocking to adjust the use
> pattern to fit the available capacity.
>

The intent is that synchronous prefetches load some contiguous prefix of
the requested region, while asynchronous prefetches can load an
arbitrary subset of the requested region, skipping over pages that are
not present.

>> Synchronous prefetches do not fault either -- prefetches are allowed to
>> run "off the end" of valid memory and into unmapped space; if this were
>> not so, "LOAD x0" would be a prefetch instruction, but it is not.
>>
>
> They fault in the sense that they stop operation and return a
> value related to the failure, allowing continuation if the access
> pattern can be blocked.
>

They do not fault in the sense of raising an exception and taking a trap.

>> Then the prefetch levels are intended to indicate relative utility.
>> Persistence is more limited: zero is no prefetch at all, once is the
>> MEM.PF.ONCE and MEM.PF.STREAM instructions, many times is
>> MEM.PF.PF[0123]. Criticality is not well-represented in this proposal,
>> aside from a slight implication that streaming prefetch prefers
>> bandwidth over latency.
>>
>
> So it is mostly communicates to hardware how much performance
> would be lost by dropping the prefetch? However, if the prefetch is a
> directive (especially if limited to a uniform latency cache), how
> would hardware be expected to use this information? It cannot
> drop the request even at level 0 (e.g., based on bandwidth cost)
> because it is specified as a directive. Hardware might have a
> "has been used" bit (to better support "use-once" and speculative
> hardware prefetching utility determination) and after first use
> set replacement information according to expected resuse distance.
>
> There are at least two considerations: priority of prefetch (which
> should include being able to drop it entirely, but this is not allowed
> in the current specification) and replacement of prefetched data.
>

The intention is that prefetches must be placed into the queue, but the
queue may be "leaky" -- that detail is unspecified. The original
concept was that prefetch levels indicated expected frequency of use on
some fuzzy scale and implementations could map them to prefetching into
various cache levels. I now know that that model is a bit ...
simplistic, although I believe that x86 uses a similar approach.

>> How well are utility and criticality typically correlated? For that
>> matter, how many levels of each should be distinguished?
>>
>
> For large regions, utility can be high while criticality could be low
> (assuming random access, in which case the first access may
> be at the end of the prefetch stream).
>

So prefetches should be defined for various combinations of utility and
criticality? Is there a simple algorithm that allows hardware to
untangle those into a prefetch queue ordering and can this instead be
moved to "compile-time", flattening that 2D space into a 1D "conflated
prefetch level"?

>> I almost want to describe criticality as "yesterday", "now", and
>> "later". But "yesterday" can be represented by an actual access, so
>> that leaves "now" and "later" as prefetch options.
>>
>
> One might want to provide "yesterday" as an asynchronous
> prefetch; if only part of the expected latency can be hidden by work,
> one really would have preferred an earlier prefetch (i.e., "yesterday").
> N loads even to x0 is also likely to be more expensive than a
> single prefetch instruction for some values of N.
>

Yes. I had the idea to have effectively two prefetch priorities and
actual accesses as a higher priority than any prefetch.


-- Jacob

Paul Miranda

unread,
Jun 28, 2018, 3:49:04 PM6/28/18
to RISC-V ISA Dev
While the gold standard in separating threads is to partition TLB and Cache by ASID (and that was my first thought as well when I first read the spec), I don't think it's strictly required by the RISCV-ISA, allowing for a lighter-weight implementation that does not necessarily provide this hard partitioning. The only useful way I have thought of applying this is to keep selected ASID==0 TLB and cache entries pinned (or simply not blindly flushed by a SFENCE.VMA with nonzero ASID), and everything else is subject to flushing.
There are definitely applications that care more about latency and cost than protecting data against side channel attacks. 
Even if you do want to enforce ASID partitioning, I can imagine implementations that don't have a 1-to-1 mapping of hardware partitions to ASID values, so long as context switch timing doesn't vary with data values. (probably easier said than done provably correct!)


On Monday, June 25, 2018 at 9:42:49 PM UTC-5, Jacob Bachmeyer wrote:
Note that cache pins are broken upon context switch unless the cache is
ASID-partitioned -- each task must appear to have the entire cache
available.  User tasks that use D-cache pins are expected to re-pin
their working buffer frequently.  Of course, if the working buffer is
"marching" through the address space, the problem solves itself as the
scratchpad advances (unpin old, pin new) through the address space.
...
 

Jacob Bachmeyer

unread,
Jun 28, 2018, 7:24:46 PM6/28/18
to Paul Miranda, RISC-V ISA Dev
Paul Miranda wrote:
> While the gold standard in separating threads is to partition TLB and
> Cache by ASID (and that was my first thought as well when I first read
> the spec), I don't think it's strictly required by the RISCV-ISA,
> allowing for a lighter-weight implementation that does not necessarily
> provide this hard partitioning. The only useful way I have thought of
> applying this is to keep selected ASID==0 TLB and cache entries pinned
> (or simply not blindly flushed by a SFENCE.VMA with nonzero ASID), and
> everything else is subject to flushing.
> There are definitely applications that care more about latency and
> cost than protecting data against side channel attacks.
> Even if you do want to enforce ASID partitioning, I can imagine
> implementations that don't have a 1-to-1 mapping of hardware
> partitions to ASID values, so long as context switch timing doesn't
> vary with data values. (probably easier said than done provably correct!)

ASIDs effectively partition the TLB; that is their purpose. The only
difference between a hard-partitioned TLB and a plain ASID-capable TLB
is that the hard-partitioned TLB reserves each slot for some particular
ASID, while the plain TLB stores an ASID in every slot and can assign
the same slot to different ASIDs at different times. This could
actually make the hard-partitioned TLB *less* complex in a sense, since
the ASID CAM columns are unneeded. RISC-V implementations are allowed
to support subsets of the ISA-defined ASID space, so 1-to-1 mapping of
ASID to hardware cache/TLB partitions is not unreasonable to expect,
although neither is it mandatory.

Caches are a different matter, but for some cache topologies
(particularly simple VIPT caches that are limited by the page size),
partitioning can allow the system to have larger caches than any
individual task can directly use in addition to its power density
reduction benefits.

Partitioned caches are never mandated in RISC-V, but implementations
that cannot provide fully-independent per-task caches must take steps to
isolate tasks that will affect some of the proposed features. In
particular, pinned cachelines must be unpinned if they would affect
another task, since leaving them pinned could itself open a side channel
due to the apparent reduction in cache size and its performance
effects. ASID-partitioned caches are a performance feature, since
non-partitioned caches must be flushed on every context switch. Further
partitioning the cache by privilege level is both a performance and
security feature, but again is never mandated. I expect that small
embedded systems will resort to flush-on-context-switch or even accept
the insecurity of non-isolated caches, which can be managed if the
complete software set is known, as it often is in embedded systems.
Larger, general-purpose, systems will want hard-partitioned caches and
TLBs for both security and performance.


-- Jacob

Paul Miranda

unread,
Jun 28, 2018, 11:12:26 PM6/28/18
to Jacob Bachmeyer, RISC-V ISA Dev
"non-partitioned caches must be flushed on every context switch."

I have looked for something saying that or not saying that in the RISC-V priv. spec and never found it. 

It is clear that the TLB must be flushed on an ASID or PPN change, and there is the explicit SFENCE.VMA to indicate when.
Similarly there is FENCE.I for flushing instruction cache.
Architecturally I can't find any statement when or how a data cache should be flushed. (how can be covered by the proposed instructions quite well, but I think the when is open to different usage scenarios. 
I have been thinking about the small embedded core case often, and would advocate limited data cache flushing.

Jacob Bachmeyer

unread,
Jun 28, 2018, 11:46:30 PM6/28/18
to Paul Miranda, RISC-V ISA Dev
The RISC-V ISA spec is silent on the matter, but failing to do so opens
side-channels. Some systems may be able to tolerate these side-channels
(example: embedded systems with known software) while others will have
big problems from them.

Some cache architectures may also require caches to be flushed or other
measures taken to prevent cached data from one task appearing in
another. Some (embedded) implementations may be able to tolerate such
leakage, however, so the RISC-V spec does not explicitly prohibit it.

My personal experience is with (1) PC-class hardware and (2) very small
microcontrollers (AVR, PIC32) so the proposal may have some blind spots
as it is oriented more towards large systems. Suggestions for
improvement on this issue are welcome, of course.


-- Jacob

Bill Huffman

unread,
Jul 6, 2018, 3:41:13 PM7/6/18
to RISC-V ISA Dev, jcb6...@gmail.com
Thank you, Jacob, for the work of putting together this proposal. RISC-V absolutely needs cache controls of some sort and I very much like the use of regions for the reasons you've articulated.

I have several separate comments I'll put under headings below.  I'm new to the discussion, so please forgive me if I'm rehashing any old issues.

========== asynchronous operations ==========================
In the proposal, setting rd to x0 indicates (potential) asynchronous operation.  The statement is that this can be used when "the instruction has no directly visible effects."  I'm guessing this means when the instruction has no required functional results (which probably means it's done for performance reasons).  A MEM.DISCARD before a non-coherent I/O block write, for example, cannot be done asynchronously because the result is functionally required and because the thread would have no way to know when the operation was complete. These two are tied together by the use of x0 as the destination register, I think.

If this is the thinking behind the asynchronous operations, I suggest a little more along this line be said in the spec.  I find myself wondering as well what happens to an asynchronous operation when another cacheop (sync or async) is executed.  Is the old one dropped?  Is the new one stalled until the existing one is completed?  Does anything happen on an interrupt?  What if the interrupt executes a new cacheop (sync or async)?  A full context switch?  Debug?

Perhaps all of these could be implementation defined, but I suspect some requirements need to be stated.  We might need an instruction to force the asynchronous machine to stop, for example.  We certainly need an interrupt to be able to occur in the middle of a synchronous operation and there needs to be a way for any cacheops in the interrupt handler to have priority over any ongoing cacheop.  Probably a synchronous cacheop stops in the middle on interrupt with rd pointing to the remaining region.  Perhaps an asynchronous cacheop is killed on interrupt, or is killed if the interrupt does a cacheop.

I also find myself wondering why MEM.DISCARD and MEM.REWRITE are specified to be synchronous only.  It's clear why they need a warning to the implementer that writing into the region after executing asynchronous versions *must* appear to write after the operations are complete.  But this could be accomplished in an implementation by a variety of mechanisms, such as doing the operations synchronously anyway, ending the operation if a write hit the remaining region, or stalling the write until the danger was past.

========= whole cache operations =========================
The proposal doesn't appear to include any plan for operations on the entire cache.  The most prominent needs here are probably initialization and, prior to disabling or changing size, flushing the entire cache.  It would be convenient to be able to use the same logic for these kinds of operations.  Has there been any consideration of how this might be included?

These can be M-mode only and don't need operands.  They might be considered to belong only under the SYSTEM major opcode, but I wanted to ask what thinking had been done.

========== pinned cache regions and cache-coherence =========
There has been discussion of what to do in a coherent system when a pinned region of the cache is written by another hart.  It seems like it would work to allow a pinned line to be valid or not.  The idea is that it is the address that is pinned, but the validity (and data) can still be taken away by the coherence protocol.  I think this would allow pinning and coherence to peacefully coexist as orthogonal concepts. This might also allow the pinning process to be separated into the "pin" itself and a prefetch that is expected to fill the data for the pinned range.  A scratchpad in a coherent system would then have "sectored cache" characteristics with many valid bits but one or a few tags.

This would solve some of the open issues with pinning, but I still find myself wondering about several issues.  What happens when a pin cannot be accomplished?  This could be because there are not enough cache ways, or because register pairs are used to hold pinned ranges and there are no more register pairs.  What happens when switching threads and executing more pin operations?  What guarantees are there that a region will remain pinned?  Any?  Why does a scratchpad implementation write back on an unpin operation where a cache implementation does not need to?  And why does it write back "if possible?"  If there's uncertainty, how does a scratchpad implementation work in a known fashion?

Also, is there the possibility of a performance related pin operation for the I-Cache in addition to the absolute, M-mode pin discussed?

========== memory obsolescence issues ========================
On the "foreign data" question wrt MEM.REWRITE, I was impressed that someone worked out how just being allowed to read the line being changed was not enough.  Thank you.  But it still bothers me to have to do all the zeroing.  It's not the gates that bother me.  It's the cycles in a machine with a memory access much narrower than the size of the cache line, where several accesses are required to the data portion of the cache to zero a line while only one read and one write are required to the tag otherwise.

I think you are suggesting that a line that is a MEM.REWRITE target and is already in the cache can remain unchanged.  I think this is an important property to keep because in a hart with a large cache which runs a single thread for a long time, it will often be the case that the target line of the MEM.REWRITE is already in the cache in an exclusive state.  Without this, for performance reasons, users (or memcpy/memset) might need to try to determine whether much of the data will already be in cache before deciding whether or not to execute a MEM.REWRITE.

I also wonder whether it might be OK, in an M-mode only implementation, to not zero any lines, ever, on MEM.REWRITE.

I wonder why destructive operations are nops when they specify one byte (same register in rs1 and rs2).  It seems like they should do the non-destructive version the way partial cachelines at the ends of a region would.

On the question of whether to turn MEM.REWRITE into MEM.CLEAR instead, I would suggest not, for the reasons above.  But we might consider adding MEM.CLEAR as a separate operation, since all the hardware for that will already be there.  On the other hand, if MEM.REWRITE is implemented as a nop or possibly the zeroing isn't done in an M-mode only implementation, this might be an issue.  Perhaps rd==rs1 could signal that the clear has not been done and needs to be done by another mechanism.

========== returning rs1 ======================================
The statement is made that these instructions can be implemented as nops and return rs1.  I don't fully understand this provision.  Is this to be done when there is no cache hardware (or it's disabled)?  If so, shouldn't the instruction return success (rd <- rs2+1) rather than what seems like a kind of failure, or at least has to be separately tested for?  If there is an enabled cache and the instructions don't work, shouldn't they take an illegal/unimplemented instruction exception?

The "hint" description seems to be involved here and seems like it either needs a more exact description or I think it should be somewhat different.  I don't understand the statement that prefetches cannot be nops.  Flush and several others probably can't be nops, but prefetches can be.  I think MEM.REWRITE can be a nop.

In general, it seems "downgrades" (flush, discard, etc.) must have their defined behavior while "upgrades" (prefetch, rewrite, etc.) can be nops (returning rs2+1) in an implementation that doesn't wish to make them available.

======== rs2 inclusive or exclusive ====================
The definition given for the upper bound is inclusive rather than exclusive.  This has two advantages I can think of: if the top of the region is the top of memory, there's no wrap to zero issue, and it's possible to use the same register twice for a one line cache operation.

Maybe you've considered this earlier, but there are some reasons to have rs2 be exclusive.  (In providing similar instructions in the past, we used a size, but the partial completion capability here needs to specify a bound.)

** It is consistent with the "exclusive," if you will, return in rd of the first byte address that has not been operated on.

** It is more straight-forward to compute (as rs1 plus size).

** Upon correct completion, rd <- rs2 instead of rs2+1

** Testing for completion is simpler (rd==rs2).

If this representation were used, we might set rs2 to x0 to signify operating on the smallest unit containing rs1.  Or we might simply use rs2 <- rs1+1 if providing a one-byte cacheop is not especially important.

For the wrap from high memory to zero case, maybe the highest memory address can't be used.  Or maybe the wrap doesn't matter.  And, in addition, the wrap already exists for the return value in rd.  It may be confusing to have the wrap to zero in one case and not the other.

To me, the advantages of exclusive out-weigh the advantages of inclusive.  Maybe you can help more with why inclusive or maybe you can consider exclusive.

========== cache hierarchies ===================================
It seems to me that there ought to be some provision, or at least a thought of how it might be added, to control how far "in" to a cache hierarchy (or how close to the processor) a prefetch should operate. There might also be a need for controlling how far "out" in a cache hierarchy a flush should operate (depending on which non-coherent, or partially coherent, masters need to see it).

========== required capabilities ================================
I think the proposal should state somewhere exactly what capability (R, W, or X, I assume) is required of the process before it can make whatever change to any given cache line.  For example, it might need read privilege to do non-destructive operations and write privilege to do destructive operations.

     Bill

Jose Renau

unread,
Jul 6, 2018, 4:17:40 PM7/6/18
to Jacob Bachmeyer, Paul A. Clayton, RISC-V ISA Dev
To make it more deterministic, I would not raise NMI from ECC errors in prefetch requests.

Otherwise, depending on the system load a single threaded application may raise NMI or not.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5B32F83B.2090501%40gmail.com.

Luke Kenneth Casson Leighton

unread,
Jul 6, 2018, 4:22:07 PM7/6/18
to Bill Huffman, RISC-V ISA Dev, Jacob Bachmeyer
On Fri, Jul 6, 2018 at 8:41 PM, Bill Huffman <huf...@cadence.com> wrote:

> On the question of whether to turn MEM.REWRITE into MEM.CLEAR instead, I
> would suggest not, for the reasons above. But we might consider adding
> MEM.CLEAR as a separate operation, since all the hardware for that will
> already be there. On the other hand, if MEM.REWRITE is implemented as a nop
> or possibly the zeroing isn't done in an M-mode only implementation, this
> might be an issue. Perhaps rd==rs1 could signal that the clear has not been
> done and needs to be done by another mechanism.

whenever there has been the possibility of polarisation due to one
implementor preferring (or requiring) one scheme and another requiring
a mutually-exclusive alternative, to create a win-win situation i've
always advocated that implementors be required to support both. this
ssems to be what you are suggesting, bill, which is encouraging.

however... allowing implementors to choose *not* to implement parts
of a standard has ramifications: the X25 standard made that mistake
decades ago, by making hardware-control-lines optional and allowing a
"fall-back" position in software... consequently everyone *had* to
implement the software mechanism, thus making the hardware control
lines totally redundant. given that X25 did not have an XMIT clock it
made the standard expensive to deploy vs RS232: external clock box
with external PSU vs a $1 cable.

my feeling is, therefore, that it would be better to make it
mandatory, *but* that implementors are advised that they can choose
*not* to actually put in the hardware but instead *must* throw an
exception... which can be caught.... and in the trap handler the clear
may be explicitly done by any mechanism that the implementor chooses.

this saves on instructions (and cycles) for implementors that choose
to implement the clear in hardware, but without software libraries
being forced to support the "fall-back" position... *just* in case
widely and publicly distributed binaries happen to run on
widely-disparate systems.

in essence if clear is not made mandatory at the hardware level, the
requirement to action the clear at the *software* level becomes a
mandatory de-facto requirement as a de-facto and indirect part of the
standard, even if that was not intentional.

and if that's going to be the case it's much better to be done via a
trap than be left in the program. of course... there is a caveat
here: traps cause context-switching. context-switching may have
unintended side-effects on the cache... so it's not as clear-cut in my
mind as the logical reasoning above would imply.

l.

Bill Huffman

unread,
Jul 6, 2018, 4:58:08 PM7/6/18
to RISC-V ISA Dev, huf...@cadence.com, jcb6...@gmail.com
Yes, Luke, I was a little fuzzy there.  I think the standard must avoid uncertainty.  In this case
I can think of the following ways to do that:

* Of course we can decide not to define MEM.CLEAR.

* We can mandate that MEM.CLEAR clear memory.

* We can mandate that MEM.CLEAR either clear memory or raise an exception.

* We can include in MEM.CLEAR a set of responses in rd that the program is required to deal
with.  The partial completion responses already require re-executing the instruction.  Maybe
other responses possible as well.

But things cannot simply happen or not happen!

On the other hand, things such as prefetch, which I view as having no functional results, can
just happen or not.  The current statement that they must happen surprises me.

      Bill

Jacob Bachmeyer

unread,
Jul 6, 2018, 8:29:49 PM7/6/18
to Bill Huffman, RISC-V ISA Dev
Bill Huffman wrote:
> Thank you, Jacob, for the work of putting together this proposal.
> RISC-V absolutely needs cache controls of some sort and I very much
> like the use of regions for the reasons you've articulated.
>
> I have several separate comments I'll put under headings below. I'm
> new to the discussion, so please forgive me if I'm rehashing any old
> issues.
>
> ========== asynchronous operations ==========================
> In the proposal, setting rd to x0 indicates (potential) asynchronous
> operation. The statement is that this can be used when "the
> instruction has no directly visible effects." I'm guessing this means
> when the instruction has no required functional results (which
> probably means it's done for performance reasons).

You seem to be correct. The intent is that asynchronous operations are
possible when the result in rd would be the only software-visible
effect. If that effect is discarded by setting rd to x0, then execution
can continue before the result of the operation is known.

> A MEM.DISCARD before a non-coherent I/O block write, for example,
> cannot be done asynchronously because the result is functionally
> required *and* because the thread would have no way to know when the
> operation was complete. These two are tied together by the use of x0
> as the destination register, I think.

Exactly. There is also a rule that region operations can complete
partially; for synchronous operations partial completion means that some
prefix of the requested region was affected and the value produced can
be used to repeat the instruction for the next part of the region.

> If this is the thinking behind the asynchronous operations, I suggest
> a little more along this line be said in the spec. I find myself
> wondering as well what happens to an asynchronous operation when
> another cacheop (sync or async) is executed. Is the old one dropped?
> Is the new one stalled until the existing one is completed? Does
> anything happen on an interrupt? What if the interrupt executes a new
> cacheop (sync or async)? A full context switch? Debug?
>
> Perhaps all of these could be implementation defined, but I suspect
> some requirements need to be stated. We might need an instruction to
> force the asynchronous machine to stop, for example. We certainly
> need an interrupt to be able to occur in the middle of a synchronous
> operation and there needs to be a way for any cacheops in the
> interrupt handler to have priority over any ongoing cacheop. Probably
> a synchronous cacheop stops in the middle on interrupt with rd
> pointing to the remaining region. Perhaps an asynchronous cacheop is
> killed on interrupt, or is killed if the interrupt does a cacheop.

These are all implementation-defined behaviors: a simple implementation
can always drop pending asynchronous cacheops upon interrupt or even
execute them as written in the first place -- synchronous cacheops with
the result sent to x0. You are correct that "stop in the middle and
return progress made" is the intended effect of interrupts on
synchronous region operations.

These behaviors must be implementation-defined because more capable
implementations might actually be able to continue a previous task's
prefetches even after a context switch or even a VM world switch. (As
an example, a machine with ASID-partitioned caches could easily use any
otherwise-idle timeslots on the memory bus to "pre-warm" another
context's cache if addresses of needed data are known to the memory
hardware. Obviously this requires the prefetch queues to carry ASID tags.)

> I also find myself wondering why MEM.DISCARD and MEM.REWRITE are
> specified to be synchronous only. It's clear why they need a warning
> to the implementer that writing into the region after executing
> asynchronous versions *must* appear to write after the operations are
> complete. But this could be accomplished in an implementation by a
> variety of mechanisms, such as doing the operations synchronously
> anyway, ending the operation if a write hit the remaining region, or
> stalling the write until the danger was past.

The reason that the destructive operations can only be synchronous is
that software actually needs the result of those operations -- it is the
block size that the hardware is willing to commit at the moment. For
example, using MEM.REWRITE in memset(3) produces pseudo-code along the
lines of:

char * base;
void * bound = base + len;
void * next;

while ((MEM.REWRITE next, base, bound), next < bound)
while (base < next) *(base++) = value;

In short, the outer loop marks a region to be rewritten, asking for the
entire block to be so marked, while the inner loop iterates over the
region that hardware has actually prepared for overwrite. The outer
loop then marks the next part of the block for overwrite and the cycle
repeats until the entire block is overwritten.

For similar reasons, MEM.DISCARD must be synchronous. If used to
invalidate a software-coherent cache prior to non-coherent DMA input,
software must iterate until the entire DMA target block is invalidated
from cache.

These rules stem from an implementation option that permits simple
implementations to elide the iteration hardware and only process a
single cacheline per operation. Software then is expected to iterate
until the entire operation software actually wants is accomplished.

> ========= whole cache operations =========================
> The proposal doesn't appear to include any plan for operations on the
> entire cache. The most prominent needs here are probably
> initialization and, prior to disabling or changing size, flushing the
> entire cache. It would be convenient to be able to use the same logic
> for these kinds of operations. Has there been any consideration of
> how this might be included?
>
> These can be M-mode only and don't need operands. They might be
> considered to belong only under the SYSTEM major opcode, but I wanted
> to ask what thinking had been done.

The thought here was that a region covering the entire address space
obviously affects the entire cache.

> ========== pinned cache regions and cache-coherence =========
> There has been discussion of what to do in a coherent system when a
> pinned region of the cache is written by another hart. It seems like
> it would work to allow a pinned line to be valid or not. The idea is
> that it is the address that is pinned, but the validity (and data) can
> still be taken away by the coherence protocol. I think this would
> allow pinning and coherence to peacefully coexist as orthogonal
> concepts. This might also allow the pinning process to be separated
> into the "pin" itself and a prefetch that is expected to fill the data
> for the pinned range. A scratchpad in a coherent system would then
> have "sectored cache" characteristics with many valid bits but one or
> a few tags.

Would this involve arranging for the eventual writeback from the remote
cache to land directly in the local cache, in a cacheline that has been
kept reserved for that address? I *think* that this is a valid
implementation of the draft 6 pinning semantics, a sort of hybrid
between invalidation-based and update-based coherency.

> This would solve some of the open issues with pinning, but I still
> find myself wondering about several issues. What happens when a pin
> cannot be accomplished? This could be because there are not enough
> cache ways, or because register pairs are used to hold pinned ranges
> and there are no more register pairs.

Pins are expected to be synchronous operations, although an asynchronous
pin could be used to "pre-pin" a region in advance of a subsequent
synchronous pin. The general rule for partial completion applies:
hardware pins as much of a prefix of the requested region as possible
and returns the appropriate result. Software can then iterate to see if
more space can be pinned or fail if the hardware simply cannot grant a
pin the software needs.

> What happens when switching threads and executing more pin
> operations? What guarantees are there that a region will remain
> pinned? Any?

Generally, task switches break pins unless the cache is
ASID-partitioned, in which case each task effectively has its own
cache. The effects of switching threads within a task depend on how
much of the cache was pinned by the previous thread. If hardware can
maintain both requests, it will. If the two threads combined ask to pin
more cachelines than hardware can grant, the later request is rejected.
The second thread can ask to unpin the entire address space to clear out
a previous thread's cache pins, but the correctness of such an approach
depends on the overall system design. Such an approach is safe when
pinning for performance on a RISC-V PC, but may cause problems in an
embedded system.

> Why does a scratchpad implementation write back on an unpin operation
> where a cache implementation does not need to?

Because the cache has its own writeback logic and cachelines remain
mapped to those addresses after being unpinned. A scratchpad
"disappears" after it is unmapped, so writeback must occur before the
scratchpad is released. (If the scratchpad RAM is physically stored in
rows taken from cache, it really does disappear, as the cache will
quickly overwrite its contents after it is unmapped.)

> And why does it write back "if possible?" If there's uncertainty, how
> does a scratchpad implementation work in a known fashion?

The "if possible" wording allows pinned cachelines to be writable
shadows of ROM. Writeback will obviously fail in that case. This
should probably be tightened to only ignoring PMA violations, since the
inability to write to ROM is a physical memory attribute.

> Also, is there the possibility of a performance related pin operation
> for the I-Cache in addition to the absolute, M-mode pin discussed?

There is a possibility, although it is also possible to regularly issue
MEM.PF.TEXT to achieve a similar effect, so I am less certain of the
utility of performance-related I-cache pins. Do you have a scenario in
mind?

> ========== memory obsolescence issues ========================
> On the "foreign data" question wrt MEM.REWRITE, I was impressed that
> someone worked out how just being allowed to read the line being
> changed was not enough. Thank you. But it still bothers me to have
> to do all the zeroing. It's not the gates that bother me. It's the
> cycles in a machine with a memory access much narrower than the size
> of the cache line, where several accesses are required to the data
> portion of the cache to zero a line while only one read and one write
> are required to the tag otherwise.

This is an interesting problem, since I have been mostly considering
systems where the memory bus is effectively the same width as a
cacheline, both of which are much wider than the processor registers.
Are you suggesting an implementation where, for example, cachelines are
32 bytes, but the cache array is accessed in 4-byte units on both the
CPU and memory sides?

One option would be to implement a write-combining buffer and use a
monitor trap (or microcode that stalls the pipeline) if the access
pattern is such that actually zeroing the cacheline may be required, but
the recommended "blind sequential overwrite" pattern results in the line
only being written once.

> I think you are suggesting that a line that is a MEM.REWRITE target
> and is already in the cache can remain unchanged. I think this is an
> important property to keep because in a hart with a large cache which
> runs a single thread for a long time, it will often be the case that
> the target line of the MEM.REWRITE is already in the cache in an
> exclusive state. Without this, for performance reasons, users (or
> memcpy/memset) might need to try to determine whether much of the data
> will already be in cache before deciding whether or not to execute a
> MEM.REWRITE.

Exactly! The hardware *knows* what is in the cache; software should not
need to guess about the cache contents, ever. Valid cachelines are even
allowed to be read after a MEM.REWRITE, since their contents are what
was last written, but software cannot assume that any given cacheline is
valid, so must not do that.

> I also wonder whether it might be OK, in an M-mode only
> implementation, to not zero any lines, ever, on MEM.REWRITE.

Cesar Eduardo Barros answered that in message-id
<54427d45-b3aa-d628...@cesarb.eti.br>
<URL:https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/54427d45-b3aa-d628-55bd-3b7b0e20738a%40cesarb.eti.br>,
start reading at "ghost from the past" for the counter-example mentioned
in the "thanks" section for draft 6.

> I wonder why destructive operations are nops when they specify one
> byte (same register in rs1 and rs2). It seems like they should do the
> non-destructive version the way partial cachelines at the ends of a
> region would.

That was probably from a general caution when specifying instructions.
It is clear that destructive operations cannot operate normally in that
case.

> On the question of whether to turn MEM.REWRITE into MEM.CLEAR instead,
> I would suggest not, for the reasons above. But we might consider
> adding MEM.CLEAR as a separate operation, since all the hardware for
> that will already be there. On the other hand, if MEM.REWRITE is
> implemented as a nop or possibly the zeroing isn't done in an M-mode
> only implementation, this might be an issue. Perhaps rd==rs1 could
> signal that the clear has not been done and needs to be done by
> another mechanism.

Others have argued that MEM.CLEAR more properly should be a function of
a DMA element than a CPU instruction. I am inclined to agree with that.

> ========== returning rs1 ======================================
> The statement is made that these instructions can be implemented as
> nops and return rs1. I don't fully understand this provision. Is
> this to be done when there is no cache hardware (or it's disabled)?
> If so, shouldn't the instruction return success (rd <- rs2+1) rather
> than what seems like a kind of failure, or at least has to be
> separately tested for? If there is an enabled cache and the
> instructions don't work, shouldn't they take an illegal/unimplemented
> instruction exception?

Returning rs1 indicates that nothing was done: the first address after
the highest address affected is the base address, so nothing was
affected. This can also happen for some operations due to transient
lack of resources, so it is not a new case in software.

> The "hint" description seems to be involved here and seems like it
> either needs a more exact description or I think it should be somewhat
> different. I don't understand the statement that prefetches cannot be
> nops. Flush and several others probably can't be nops, but prefetches
> can be. I think MEM.REWRITE can be a nop.

It will need to be rewritten when the proposal is aligned with the new
memory model. There was some discussion in earlier RVWMO drafts about
supporting cache-control hints, but this proposal has some instructions
that are clearly not hints and cannot be hints in some implementations.
That description was a bit of pushback against poor wording in earlier
drafts of the new memory model.

> In general, it seems "downgrades" (flush, discard, etc.) must have
> their defined behavior while "upgrades" (prefetch, rewrite, etc.) can
> be nops (returning rs2+1) in an implementation that doesn't wish to
> make them available.

This suggests that there can be both "do nothing" (rs1 -> rd) no-ops and
"succeed" (rs2+1 -> rd) no-ops, depending on an implementation.

> ======== rs2 inclusive or exclusive ====================
> The definition given for the upper bound is inclusive rather than
> exclusive. This has two advantages I can think of: if the top of the
> region is the top of memory, there's no wrap to zero issue, and it's
> possible to use the same register twice for a one line cache operation.

Both of these were motivations, but the big reason was to support easy
software iteration for larger regions. (Remember that hardware may
choose to only process one cacheline at a time, regardless of what
software requests.)

> Maybe you've considered this earlier, but there are some reasons to
> have rs2 be exclusive. (In providing similar instructions in the
> past, we used a size, but the partial completion capability here needs
> to specify a bound.)

I will try addressing each of these points. You are correct that the
use of a bound instead of a size was intended to support iteration on
partial completion.

> ** It is consistent with the "exclusive," if you will, return in rd of
> the first byte address that has not been operated on.

I think that the "inconsistency" here effectively hides an iteration
increment, saving a step in the software iteration case. I need to look
more closely at that.

> ** It is more straight-forward to compute (as rs1 plus size).

I do not believe that this is likely to be an issue, since size is
expected to usually be a constant that the compiler can adjust. Even in
the dynamic case, it is, at worst, one extra ADDI in the loop setup.

> ** Upon correct completion, rd <- rs2 instead of rs2+1

Except that full completion is not copying rs2 to rd; it is copying some
internal register used for hardware iteration to rd. One last ADD in
hardware is simply a matter of "tapping" the value after the increment
instead of before the increment. And the result is not rs2 + 1, unless
rs2 happens to point to the end of a cacheline -- it is the base address
of the next cacheline. Remember that regions are expanded to meet
hardware granularity requirements.

> ** Testing for completion is simpler (rd==rs2).

This appears so if the implementation iterates over addresses, but
actual implementations are expected to iterate over cachelines
internally, so will need a (step < bound) comparison anyway.

> If this representation were used, we might set rs2 to x0 to signify
> operating on the smallest unit containing rs1. Or we might simply use
> rs2 <- rs1+1 if providing a one-byte cacheop is not especially important.

The current proposal encodes "smallest unit containing an address" in
the static case as using the *same* *register* for both rs1 and rs2. I
would see no reason to change that in either case, since it would be a
otherwise-meaningless encoding if the upper bound were exclusive.

> For the wrap from high memory to zero case, maybe the highest memory
> address can't be used. Or maybe the wrap doesn't matter. And, in
> addition, the wrap already exists for the return value in rd. It may
> be confusing to have the wrap to zero in one case and not the other.

Generally, crossing the top of the address space gets "interesting" in a
bad way quickly.

> To me, the advantages of exclusive out-weigh the advantages of
> inclusive. Maybe you can help more with why inclusive or maybe you
> can consider exclusive.

I will have to think more about this; I think that an exclusive upper
bound was considered at one point but do not immediately recall why it
was rejected. Perhaps avoiding the need for applications to know about
cacheline boundaries was part of the reason?

> ========== cache hierarchies ===================================
> It seems to me that there ought to be some provision, or at least a
> thought of how it might be added, to control how far "in" to a cache
> hierarchy (or how close to the processor) a prefetch should operate.
> There might also be a need for controlling how far "out" in a cache
> hierarchy a flush should operate (depending on which non-coherent, or
> partially coherent, masters need to see it).

This was the original intent behind the prefetch levels: higher levels
prefetch farther "in" to the cache hierarchy nearer to the processor.

I am less certain about partial flushes simply because the very idea
looks like an excellent, very ornate, very large, thermonuclear footgun
to me at first glance.

> ========== required capabilities ================================
> I think the proposal should state somewhere exactly what capability
> (R, W, or X, I assume) is required of the process before it can make
> whatever change to any given cache line. For example, it might need
> read privilege to do non-destructive operations and write privilege to
> do destructive operations.

That is probably a good point. Prefetches require no access, but are
ignored if the target is not readable. (This allows speculative
prefetch that may run past the end of an array and into unmapped address
space without causing SIGSEGV.) Currently, the reasoning is that cache
operations are "soft" and the subsequent "hard" accesses should take any
relevant traps. On the other hand, actually loading inaccessible data
into the cache is halfway to Meltdown, so that needs to be avoided.
This will need to be addressed in later drafts.


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jul 6, 2018, 11:27:04 PM7/6/18
to Jacob Bachmeyer, Bill Huffman, RISC-V ISA Dev
On Sat, Jul 7, 2018 at 1:29 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> Others have argued that MEM.CLEAR more properly should be a function of a
> DMA element than a CPU instruction. I am inclined to agree with that.

apologies i must have missed such discussions, i have been
skim-reading some of this proposal discussion.

a memory clear operation as a DMA engine's role would be an
implementation-specific detail, that, interestingly, would also have
latency and synchronisation issues to contend with. also, it cannot
be *guaranteed* that any given implementation would have such DMA
functionality.

an explicit MEM.CLEAR would be required to be synchronous i.e.
definitely right there and then and as such would have certain
critical advantageous characteristics over any DMA engine... which
might not exist.

l.

Bill Huffman

unread,
Jul 12, 2018, 9:56:13 PM7/12/18
to jcb6...@gmail.com, RISC-V ISA Dev
Thanks for some clarifications, Jacob. As I stand back and look at
this, I think it is helpful to consider what portable code might look
like to accomplish a particular result - say, CACHE.FLUSH.

The simplest reasonable loop, I think, would be:

// t0 = start, t1 = (exclusive) end
LOOP: CACHE.FLUSH t0, t0, t1
BNE t0, t1, LOOP

This simple loop requires 1) that rd end at one byte past the end of the
region (rather than the end of the cache line), 2) that rs2 be
exclusive, and 3) that cacheops always make forward progress. If it's
really necessary, making forward progress can be relaxed to say they can
return rs1 under a transient condition that won't unreasonably stall the
loop such as getting interrupted before any activity.

The forward progress requirement would mean that hardware without caches
(or with disabled caches) would set rd <- rs2 while hardware with caches
but without these instructions would need to trap (as do most
unimplemented instructions).

For hardware where CACHE.FLUSH operates one line at a time and branches
aren't predicted, it would likely help performance to use a loop where
the percentage of overhead is lower, such as:

// t0 = start, t1 = (exclusive) end
LOOP: CACHE.FLUSH t0, t0, t1
CACHE.FLUSH t0, t0, t1
CACHE.FLUSH t0, t0, t1
CACHE.FLUSH t0, t0, t1
BNE t0, t1, LOOP

In addition to the above requirements, this loop requires 4) that when
rs1==rs2, nothing is done since the line containing them can be outside
of the address range.

Changing any of the above 4 conditions, except possibly #2, would, I
think, add instructions to the inner loop. Adding instructions outside
the loop might be OK, but adding instructions inside the loop will
likely have performance losses in many implementations.

Bill
> <URL:https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_a_groups.riscv.org_d_msgid_isa-2Ddev_54427d45-2Db3aa-2Dd628-2D55bd-2D3b7b0e20738a-2540cesarb.eti.br&d=DwIBaQ&c=aUq983L2pue2FqKFoP6PGHMJQyoJ7kl3s3GZ-_haXqY&r=AYJ4kbebphYpRw2lYDUDCk5w5Qa3-DR3bQnFjLVmM80&m=21ClEm0iM2zFdtt-KOOxGQeb95y35rO8e3KLsyPWBCY&s=gWMYMhVQbKz1lsOOKsgXhFAm_67py3SjmHUvIZO4SQM&e=>,

Luke Kenneth Casson Leighton

unread,
Jul 12, 2018, 10:21:09 PM7/12/18
to Bill Huffman, Jacob Bachmeyer, RISC-V ISA Dev
On Fri, Jul 13, 2018 at 2:56 AM, Bill Huffman <huf...@cadence.com> wrote:

> [...]
> with disabled caches) would set rd <- rs2 while hardware with caches but
> without these instructions would need to trap (as do most unimplemented
> instructions).

bill, i was caught out by this one: it's a known-point of contention
(i.e. against everyone's expectations) that unimplemented instructions
are *not* required by the RISC-V spec to throw an exception. i can't
recall the precise details, other people will have better recall than
i on this, i believe it's left to implementors to decide.

leaving it up to implementors is unfortunately one of those mistakes
in the RISC-V Specification that is identical to the mistake made by
the designers of the X.25 Standard, and it will unfortunately have the
same consequences and implications (due to the ambiguity / choice,
nobody knows what to do, so they have to code for the worst-case.
which makes even *having* the best case completely redundant and
pointless).

l.

Andrew Waterman

unread,
Jul 12, 2018, 10:30:37 PM7/12/18
to Luke Kenneth Casson Leighton, Bill Huffman, Jacob Bachmeyer, RISC-V ISA Dev
The ISA spec doesn't impose this stricture because it's a
platform-specification issue, not an ISA-specification issue. Some
platforms (like the Unix platform) will require that unimplemented
opcodes in the standard encoding space raise illegal-instruction
exceptions.

>
> l.
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDzWa0GvGTrGc759UMB%3DMzU48kr1fqVg1eHXfBWqQpGsFw%40mail.gmail.com.

Jacob Bachmeyer

unread,
Jul 12, 2018, 10:57:10 PM7/12/18
to Bill Huffman, RISC-V ISA Dev
Bill Huffman wrote:
> Thanks for some clarifications, Jacob. As I stand back and look at
> this, I think it is helpful to consider what portable code might look
> like to accomplish a particular result - say, CACHE.FLUSH.
>
> The simplest reasonable loop, I think, would be:
>
> // t0 = start, t1 = (exclusive) end
> LOOP: CACHE.FLUSH t0, t0, t1
> BNE t0, t1, LOOP
>
> This simple loop requires 1) that rd end at one byte past the end of
> the region (rather than the end of the cache line), 2) that rs2 be
> exclusive, and 3) that cacheops always make forward progress. If it's
> really necessary, making forward progress can be relaxed to say they
> can return rs1 under a transient condition that won't unreasonably
> stall the loop such as getting interrupted before any activity.

With the current inclusive bound, essentially the same loop works:

// t0 = start, t1 = (inclusive) end
LOOP: CACHE.FLUSH t0, t0, t1
BLTU t0, t1, LOOP

This works even if hardware extends the bound to complete a "hardware
granule" (cacheline) and reports the extended result, which also allows
software to determine the hardware granularity dynamically. The loop
exits when the first address not affected is past the inclusive bound.
The third requirement that cacheops always make progress or trap is
shared for this loop, and perhaps should be added to the next draft; the
alternative is to keep the base in t2, update it after each step, and
add a "BEQ t0, t2, ERROR" inside the loop.

> The forward progress requirement would mean that hardware without
> caches (or with disabled caches) would set rd <- rs2 while hardware
> with caches but without these instructions would need to trap (as do
> most unimplemented instructions).

This could also work with inclusive bounds. This would remove
"do-nothing" no-ops from the proposal. It is possible that a forward
progress guarantee may only apply to some instructions and not others.
For example, CACHE.PIN cannot be guaranteed to make forward progress.

> For hardware where CACHE.FLUSH operates one line at a time and
> branches aren't predicted, it would likely help performance to use a
> loop where the percentage of overhead is lower, such as:
>
> // t0 = start, t1 = (exclusive) end
> LOOP: CACHE.FLUSH t0, t0, t1
> CACHE.FLUSH t0, t0, t1
> CACHE.FLUSH t0, t0, t1
> CACHE.FLUSH t0, t0, t1
> BNE t0, t1, LOOP
>
> In addition to the above requirements, this loop requires 4) that when
> rs1==rs2, nothing is done since the line containing them can be
> outside of the address range.

With inclusive bounds, this would make the presently-reserved case of
"backwards" bounds (rs1 > rs2) a no-op. Should that no-op status be
limited to the range that hardware can produce within such a loop?
(Specifically: If the lower bound is greater than the upper bound, but
both bounds refer to the same hardware granule ("cacheline"), the
instruction is a no-op. All other "backwards" bounds cases remain
reserved.)

> Changing any of the above 4 conditions, except possibly #2, would, I
> think, add instructions to the inner loop. Adding instructions
> outside the loop might be OK, but adding instructions inside the loop
> will likely have performance losses in many implementations.

Condition (1) differs from the current draft; the effect on the loop
under the current draft's rules is to replace BNE with BLTU. Condition
(2) is similarly handled. Condition (4) requires defining part of a
presently-reserved case. Condition (3) will probably be added to the
next draft. How difficult is a forward progress guarantee likely to be
for implementors?


-- Jacob

Jacob Bachmeyer

unread,
Jul 12, 2018, 10:57:51 PM7/12/18
to Luke Kenneth Casson Leighton, Bill Huffman, RISC-V ISA Dev
Luke Kenneth Casson Leighton wrote:
> On Fri, Jul 13, 2018 at 2:56 AM, Bill Huffman <huf...@cadence.com> wrote:
>
>> [...]
>> with disabled caches) would set rd <- rs2 while hardware with caches but
>> without these instructions would need to trap (as do most unimplemented
>> instructions).
>>
> bill, i was caught out by this one: it's a known-point of contention
> (i.e. against everyone's expectations) that unimplemented instructions
> are *not* required by the RISC-V spec to throw an exception. i can't
> recall the precise details, other people will have better recall than
> i on this, i believe it's left to implementors to decide.
>

The reason that RISC-V cannot guarantee "unimplemented" instructions
will trap is that any "unimplemented" encoding can potentially be used
for a non-standard extension. So it *is* implemented, just not what was
expected. (The extensible assembler database is the solution I propose
for this issue.)

-- Jacob

Luke Kenneth Casson Leighton

unread,
Jul 12, 2018, 11:41:18 PM7/12/18
to Andrew Waterman, Bill Huffman, Jacob Bachmeyer, RISC-V ISA Dev
ok cool, thank you for clarifying. so that makes sense, then, that
unix platforms would require mandatory illegal-instruction exceptions.
and it also makes sense that embedded platform ones do not (so as to
be able to save decode space).

so, continuing the logic, here: if the context of the discussion is
about unix platforms then we're good (bill) with what you said about
cache management, and if it's an embedded platform we're in murky
territory but embedded is an application-specific area, so nothing to
be concerned about.

thanks andrew. sorry, do carry on, bill :)

l.

Bill Huffman

unread,
Jul 13, 2018, 12:10:16 AM7/13/18
to jcb6...@gmail.com, RISC-V ISA Dev
Yes, I considered that but didn't want to use the BLTU loop because it
doesn't work if the memory region specified includes the highest
numbered cacheline in memory (because the result wraps to zero).
Forbidding the use of the last part of memory seems awkward on its own
and likely to cause additional awkwardness in the future.

>
>> The forward progress requirement would mean that hardware without
>> caches (or with disabled caches) would set rd <- rs2 while hardware
>> with caches but without these instructions would need to trap (as do
>> most unimplemented instructions).
>
> This could also work with inclusive bounds.  This would remove
> "do-nothing" no-ops from the proposal.  It is possible that a forward
> progress guarantee may only apply to some instructions and not others.
> For example, CACHE.PIN cannot be guaranteed to make forward progress.

I agree with the forward progress requirement as the extra branch would
be a performance loss.

I can see the possibility that CACHE.PIN might not be able to guarantee
forward progress, but I have enough uncertainties about its
functionality not to be sure (a different discussion :-) ).

>
>> For hardware where CACHE.FLUSH operates one line at a time and
>> branches aren't predicted, it would likely help performance to use a
>> loop where the percentage of overhead is lower, such as:
>>
>>         // t0 = start, t1 = (exclusive) end
>> LOOP:   CACHE.FLUSH   t0, t0, t1
>>         CACHE.FLUSH   t0, t0, t1
>>         CACHE.FLUSH   t0, t0, t1
>>         CACHE.FLUSH   t0, t0, t1
>>         BNE           t0, t1, LOOP
>>
>> In addition to the above requirements, this loop requires 4) that when
>> rs1==rs2, nothing is done since the line containing them can be
>> outside of the address range.
>
> With inclusive bounds, this would make the presently-reserved case of
> "backwards" bounds (rs1 > rs2) a no-op.  Should that no-op status be
> limited to the range that hardware can produce within such a loop?
> (Specifically:  If the lower bound is greater than the upper bound, but
> both bounds refer to the same hardware granule ("cacheline"), the
> instruction is a no-op.  All other "backwards" bounds cases remain
> reserved.)

I don't think the "same hardware granule" condition is sufficient with
inclusive rs2 and "extended results" as the result will be in the
following cacheline after rs2. There is also no hard upper limit for
how much greater than rs2 it can be.

Worse, when the range includes the highest cacheline, the result will
wrap to zero and all of memory will be flushed if there's another
CACHE.FLUSH remaining in the loop.

With an exclusive rs2 and no "extended result" rs1 and rs2 equal is a
nop but rs1 > rs2 is still reserved.

>
>> Changing any of the above 4 conditions, except possibly #2, would, I
>> think, add instructions to the inner loop.  Adding instructions
>> outside the loop might be OK, but adding instructions inside the loop
>> will likely have performance losses in many implementations.
>
> Condition (1) differs from the current draft; the effect on the loop
> under the current draft's rules is to replace BNE with BLTU.  Condition
> (2) is similarly handled.  Condition (4) requires defining part of a
> presently-reserved case.  Condition (3) will probably be added to the

Because it doesn't work for the last cacheline of memory and because it
seems awkward and likely to cause future anomalies, I'm not satisfied by
the answer to (1), (2), or (4).

> next draft.  How difficult is a forward progress guarantee likely to be
> for implementors?

In the past, I haven't seen any trouble guaranteeing forward progress.
If the instruction is interrupted or raises an exception, it doesn't
complete (which is fine). Otherwise it can make some progress in much
the same way that a memory load or store needs to "make progress." In
the case of a memory load or store, making progress means completing one
memory operation, which might, at any time, include evicting one
cacheline and bringing in another. I think the requirement on the
CACHE.FLUSH and most other cacheops doesn't exceed that requirement. If
there are conditions where a memory access or cacheop might not have
enough resources, either can stall in an interruptible state until it
the condition is no longer present.

A question that comes to mind: In RV64I, virtual addresses are required
to have some number of upper bits identical. I think something needs to
be said about the case where the address range for a cacheop includes
addresses that don't meet this requirement. Are they ignored?

Another question: Is there an expectation that cacheops on ranges
larger than the cache can traverse the cache once checking each tag for
whether it's in the range? This only works where there is no
translation or pages are larger than cache, but it's much quicker for
those cases. Right now, I think it can only be done if it starts over
after an interrupt, which isn't great.

Bill

Luke Kenneth Casson Leighton

unread,
Jul 17, 2018, 10:25:06 AM7/17/18
to Bill Huffman, jcb6...@gmail.com, RISC-V ISA Dev
Jacob et al,

So libre RISCV 3D GPGPU needs scratch ram, best general purpose idea thats free of proprietary sue-happy GPU 800lb gorillas with patents to burn is the cache control proposal.

What would the minimum implementation need to be to be able to pin some cache and turn it effectively into persistent scratch ram?

Dont need anything fancy. If core context switches , happy for cache pins to be evicted.

L.


--
---
crowd-funded eco-conscious hardware: https://www.crowdsupply.com/eoma68

Bill Huffman

unread,
Jul 20, 2018, 6:45:29 PM7/20/18
to RISC-V ISA Dev
I'd like to go up a level and imagine that we've defined a function that
does a cacheop, say CACHE.FLUSH, over an address range. Given that
various things in OS-land deal with base/size combinations, I'll assume
that and suggest that we want to implement:

void flush_dcache_range(void *ptr, unsigned size);

I will make the following assumptions about the function:

* ptr==0 is allowed. The address range starts at zero
* size==0 always means do nothing.
* the last byte of address space can be included - ptr+size
wrap to zero
* otherwise ptr+size is not allowed to wrap-around, but this
is not checked
* size is XLEN and so cannot quite represent the entire
address space but must skip at least one byte
* very large sizes are not handled in a fully portable
way. They may trap and flush the entire cache, for
example. If they operate naively, they will be
extremely slow.

I will also assume that CACHE.FLUSH ensures forward progress. That is,
as long as rs2 > rs1, rd > rs1 (except for wrap around to zero).

Below is the code I expect to see with and without two conditions. I
also want to show two possible code sequences for each case. The first
is a short code sequence that always works. The second is a bit longer
and also always works, but is optimized for the case where the
micro-architecture has CACHE.FLUSH affecting only a single line and/or
lacks branch prediction. The two conditions are:

1) When the range has completed, rd is one byte past the end of
the region requested (rather than returning one byte past
the end of the last cache line affected).
2) rs2 is exclusive (rather than inclusive) and rs2==0 means the
range ends with the last byte of memory.

With the two conditions, the code that might be expected is:

flush_dcache_range: // a0=ptr, a1=size
ADD t0, a0, a1
loop: CACHE.FLUSH a0, a0, t0
BNE a0, t0, loop
RET

If rs2==rs1 is a nop, then the following code may be used as optimal for
CACHE.FLUSH affecting only a single line and no branch prediction.

flush_dcache_range: // a0=ptr, a1=size
ADD t0, a0, a1
loop: CACHE.FLUSH a0, a0, t0
CACHE.FLUSH a0, a0, t0
CACHE.FLUSH a0, a0, t0
CACHE.FLUSH a0, a0, t0
BNE a0, t0, loop
RET

Without these two conditions (as I understand the current draft spec),
the code that might be expected is, instead:

flush_dcache_range: // a0=ptr, a1=size
BEQZ a1, done // so a0=a1=0 doesn't flush
// entire address range
ADD t0, a0, a1
ADDI t0, t0, -1
loop: CACHE.FLUSH a0, a0, t0
BEQZ a0, done // wrap would start over where
// t0 is near the end of memory
BLEU a0, t0, loop
done: RET

Here, if rs2<rs1 is a nop, then the following code may be used as
optimal for CACHE.FLUSH affecting a single line and no branch prediction:

flush_dcache_range: // a0=ptr, a1=size
BEQZ a1, done // so a0=a1=0 doesn't flush
// entire address range
ADD t0, a0, a1
ADDI t0, t0, -1
loop: CACHE.FLUSH a0, a0, t0
BEQZ a0, done // wrap would start over where
// t0 is near the end of memory
CACHE.FLUSH a0, a0, t0
BEQZ a0, done // wrap would start over where
// t0 is near the end of memory
CACHE.FLUSH a0, a0, t0
BEQZ a0, done // wrap would start over where
// t0 is near the end of memory
CACHE.FLUSH a0, a0, t0
BEQZ a0, done // wrap would start over where
// t0 is near the end of memory
BLEU a0, t0, loop
done: RET

The significant improvement in length, performance, and
comprehensibility of the function with conditions #1 and #2 above leads
me to believe that they are a valuable addition to the spec.

Bill

Rogier Brussee

unread,
Jul 24, 2018, 8:41:25 AM7/24/18
to RISC-V ISA Dev


Op zaterdag 21 juli 2018 00:45:29 UTC+2 schreef Bill Huffman:
I'd like to go up a level and imagine that we've defined a function that
does a cacheop, say CACHE.FLUSH, over an address range.  Given that
various things in OS-land deal with base/size combinations, I'll assume
that and suggest that we want to implement:

void flush_dcache_range(void *ptr, unsigned size);


seems two functions, using a C++ convention begin,  end iterator pair fits even better:
 
/*flush from begin to end _exclusive_  if (uintptr_t)end <= (uinptr_t) begin  do nothing */
void flush_dcache_range(void* begin, void* end)

The one case where modular arithmetic seems to bite you is when you want to include the whole address range, so just make that a special case

/* flush from begin to end of address range _inclusive_ */
void flush_dcache_from(void* begin)

 

I will make the following assumptions about the function:

* ptr==0 is allowed.  The address range starts at zero
* size==0 always means do nothing.

(uintptr_t)begin <= (uintptr_t)end always means do nothing.
 
* the last byte of address space can be included - ptr+size
     wrap to zero

use 

void flush_dcache_from((void*)0x0)

to flush the whole address range, or 

void flush_dcache_from((void*)0xabcdef)

to flush the whole address range from address 0xabcdef inclusive.
 
* otherwise ptr+size is not allowed to wrap-around, but this
     is not checked

end is what it is, (uintptr_t)end <= (uintptr_t)begin always means do nothing.
 

* size is XLEN and so cannot quite represent the entire
     address space but must skip at least one byte
* very large sizes are not handled in a fully portable
     way.  They may trap and flush the entire cache, for
     example.  If they operate naively, they will be
     extremely slow.

I will also assume that CACHE.FLUSH ensures forward progress.  That is,
as long as rs2 > rs1, rd > rs1 (except for wrap around to zero).

Below is the code I expect to see with and without two conditions.  I
also want to show two possible code sequences for each case.  The first
is a short code sequence that always works.  The second is a bit longer
and also always works, but is optimized for the case where the
micro-architecture has CACHE.FLUSH affecting only a single line and/or
lacks branch prediction.  The two conditions are:

1) When the range has completed, rd is one byte past the end of
      the region requested (rather than returning one byte past
      the end of the last cache line affected).

I think rd should be the following

if rs2 != x0:
   if (uintptr_t)rs1 >=(uintptr_t) rs2:
      rd = rs2
  else
     rd = an address in the cache with rs1 <  a <= rs2   such that
            all cache for address [rs1, a) has been flushed 
 

2) rs2 is exclusive (rather than inclusive) and rs2==0 means the
      range ends with the last byte of memory.


# register x0 means end of address space.  If rs2 is a register other than x0 and contains the value 0 it just means byte 0 
if rs2 == x0:  
        rd =   an address in the cache with rs1 <  a  (in particular, a != NULL) such that
                 all cache for address [rs1, a)  has been flushed) or NULL if [rs1, end of address space] has been flushed

Of course this effectively makes CACHE.FLUSH with rs2 == x0 a related but different instruction.

With the two conditions, the code that might be expected is:

flush_dcache_range:               // a0=ptr, a1=size
         ADD          t0, a0, a1
loop:   CACHE.FLUSH  a0, a0, t0
         BNE          a0, t0, loop
         RET


That would become: 

flush_dcache_range:               // a0=begin, a1=end,  

loop: CACHE.FLUSH  a0, a0, a1 
         BNE          a0, a1, loop 
         RET 

flush_dcache_from:                 //a0 = begin

loop:  CACHE.FLUSH a0 a0 x0   
          BNEZ a0, loop
          RET


If rs2==rs1 is a nop, then the following code may be used as optimal for
CACHE.FLUSH affecting only a single line and no branch prediction.

flush_dcache_range:               // a0=ptr, a1=size
         ADD          t0, a0, a1
loop: CACHE.FLUSH  a0, a0, t0
         CACHE.FLUSH  a0, a0, t0
         CACHE.FLUSH  a0, a0, t0
         CACHE.FLUSH  a0, a0, t0
         BNE          a0, t0, loop
         RET

 
If you want to unroll loops that would be

flush_dcache_range:               // a0=begin, a1=end
loop: CACHE.FLUSH  a0, a0, a1 
         CACHE.FLUSH  a0, a0, a1
         CACHE.FLUSH  a0, a0, a1
         CACHE.FLUSH  a0, a0, a1
         BNE          a0, a1, loop 
         RET 

flush_dcache_from:   // a0 = begin
         LI t0,  -1
loop: CACHE.FLUSH  a0, a0, t0 
         CACHE.FLUSH  a0, a0, t0
         CACHE.FLUSH  a0, a0, t0
         CACHE.FLUSH  a0, a0, x0
         BNEZ          a0, loop 
         RET 

Bill Huffman

unread,
Jul 24, 2018, 11:21:41 AM7/24/18
to isa...@groups.riscv.org

If I understand correctly, you're agreeing with the two concepts I'd like to see (rd is related to inputs not to cache lines and rs2 is exclusive) and suggesting a way to take away the ugly part I have because wrap-to-zero needs to be allowed.  The suggestion is to use x0 to demarcate the wrap.  I think everything else follows from that.

I certainly see the value of not having wrap-to-zero.  But it seems to me that having two functions requires a caller to test the required endpoint dynamically and call one function or the other.  That makes the ugliness show itself at a higher level.

Putting the ugliness at an intermediate level might mean one function call with an internal test and two loops using the instruction level semantics you've suggested for one that ends at the end of memory.  But I would rather see the wrap-to-zero than two loops.

I would rather cover any ugliness at the lowest reasonable level.  Thus one routine with begin and size (without built-in wrap involved with end).  And instructions which allow a single loop.

      Bill

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Guy Lemieux

unread,
Jul 24, 2018, 11:56:32 AM7/24/18
to Bill Huffman, isa...@groups.riscv.org
I haven’t followed his thread in great detail, but it appears some people are trying to make the software as elegant as possible by saving a new instructions.

The overriding goal should be to keep the hardware simple. Using (start,length) is a nice idea, but requires an extra adder in the hardware.

As for exclusive vs inclusive, the biggest advantage of inclusive is that it fits in an entire address range in software without using extra address bits to wrap around. eg, in a 64K adddress space, it ranges from 0x0000 to 0xffff inclusive, which makes intuitive sense from a hardware perspective. The exclusive range would be 0x0000 to 0x10000 which needs 17 bits and would be wrapped to 0x0000 in software. In hardware, we can always add an extra bit to the address calculations if it makes sense, but the software and ISA layers don’t have access.

The downfall to inclusion, of course, is that returning “one past” the last address affected would return 0x0000, the same as if nothing had been done. This only happens when the entire address range is specified. I believe the recommendation from Jacob was to have software always split the full address range in half so there is no ambiguity in the result. This is an unfortunate but necessary compromise — as you can see there are many potential variations but each had a downfall one way or the other.

I’m a big advocate of keeping hardware simple. Yet, there is a huge performance penalty when large address ranges are specified for cache ops if the ISA does just one cache line and returns an incremented address. Hence, even in small processors, I advocate fully handling the address range in hardware. Only hardware knows the precise cache structure, and can keep things optimized by only iterating through the cache lines once (possibly handling multiple sets in parallel every cycle) while applying the full address range filter.

Guy


Rogier Brussee

unread,
Jul 24, 2018, 4:19:24 PM7/24/18
to RISC-V ISA Dev


Op dinsdag 24 juli 2018 17:21:41 UTC+2 schreef Bill Huffman:

If I understand correctly, you're agreeing with the two concepts I'd like to see (rd is related to inputs not to cache lines and rs2 is exclusive) and suggesting a way to take away the ugly part


Quite possibly bothering about inclusive and exclusive wrapping is not actually necessary. 

The inquisitive reader can find plenty of examples of actual cache flush routines 


The ARM64 
/*
 *	MM Cache Management
 *	===================
 *
 *	The arch/arm64/mm/cache.S implements these methods.
 *
 *	Start addresses are inclusive and end addresses are exclusive; start
 *	addresses should be rounded down, end addresses up.
 *
 *	See Documentation/cachetlb.txt for more information. Please note that
 *	the implementation assumes non-aliasing VIPT D-cache and (aliasing)
 *	VIPT I-cache.
 *
 *	flush_cache_mm(mm)
 *
 *		Clean and invalidate all user space cache entries
 *		before a change of page tables.
 *
 *	flush_icache_range(start, end)
 *
 *		Ensure coherency between the I-cache and the D-cache in the
 *		region described by start, end.
 *		- start  - virtual start address
 *		- end    - virtual end address
 *
 *	invalidate_icache_range(start, end)
 *
 *		Invalidate the I-cache in the region described by start, end.
 *		- start  - virtual start address
 *		- end    - virtual end address
 *
 *	__flush_cache_user_range(start, end)
 *
 *		Ensure coherency between the I-cache and the D-cache in the
 *		region described by start, end.
 *		- start  - virtual start address
 *		- end    - virtual end address
 *
 *	__flush_dcache_area(kaddr, size)
 *
 *		Ensure that the data held in page is written back.
 *		- kaddr  - page address
 *		- size   - region size
 */
extern void flush_icache_range(unsigned long start, unsigned long end);
extern int  invalidate_icache_range(unsigned long start, unsigned long end);
extern void __flush_dcache_area(void *addr, size_t len);
extern void __inval_dcache_area(void *addr, size_t len);
extern void __clean_dcache_area_poc(void *addr, size_t len);
extern void __clean_dcache_area_pop(void *addr, size_t len);
extern void __clean_dcache_area_pou(void *addr, size_t len);
extern long __flush_cache_user_range(unsigned long start, unsigned long end);
extern void sync_icache_aliases(void *kaddr, unsigned long len)

 

I have because wrap-to-zero needs to be allowed.  The suggestion is to use x0 to demarcate the wrap.


The suggestion is to accept that modular arithmetic makes it impossible to cleanly distinguish between [0,0) as an empty range and [0,0) as the full range, unless you have a separate way to say the upper bound of the latter interval wraps and  is  1<< XLEN (much like the difference between 0 and 360 degrees), which is what the distinction between a register with content 0 and register x0 will do for you. 
 

  I think everything else follows from that.

I certainly see the value of not having wrap-to-zero.  But it seems to me that having two functions requires a caller to test the required endpoint dynamically and call one function or the other. 


They seem two different use cases anyway.   The kernel seems to consider a wrap a bug, so it may not even be interested in flushing all of the cache or all cache starting from an adres.  I suggested to use begin, end range semantics so that if you really really really want to wrap, you have to take measures above the interface boundary.
 

That makes the ugliness show itself at a higher level.


yes, by design so you have to think about it, and have software do the heavy lifting if needed. But fortunately, this gets deeply buried in OS kernels and compilers.  
 


Putting the ugliness at an intermediate level might mean one function call with an internal test and two loops using the instruction level semantics you've suggested for one that ends at the end of memory.  But I would rather see the wrap-to-zero than two loops.


I sort of hoped my suggestion would be easier to understand and easier to implement in hardware, YMMV.

Best

Rogier

Rogier Brussee

unread,
Jul 24, 2018, 4:48:24 PM7/24/18
to RISC-V ISA Dev, huf...@cadence.com


Op dinsdag 24 juli 2018 17:56:32 UTC+2 schreef glemieux:
I haven’t followed his thread in great detail, but it appears some people are trying to make the software as elegant as possible by saving a new instructions.

The overriding goal should be to keep the hardware simple. Using (start,length) is a nice idea, but requires an extra adder in the hardware.

As for exclusive vs inclusive, the biggest advantage of inclusive is that it fits in an entire address range in software without using extra address bits to wrap around. eg, in a 64K adddress space, it ranges from 0x0000 to 0xffff inclusive, which makes intuitive sense from a hardware perspective. The exclusive range would be 0x0000 to 0x10000 which needs 17 bits and would be wrapped to 0x0000 in software. In hardware, we can always add an extra bit to the address calculations if it makes sense, but the software and ISA layers don’t have access.

The downfall to inclusion, of course, is that returning “one past” the last address affected would return 0x0000, the same as if nothing had been done. This only happens when the entire address range is specified.

That's why I suggested using x0 for end of address range inclusive.
 
I believe the recommendation from Jacob was to have software always split the full address range in half so there is no ambiguity in the result. This is an unfortunate but necessary compromise — as you can see there are many potential variations but each had a downfall one way or the other.

I’m a big advocate of keeping hardware simple. Yet, there is a huge performance penalty when large address ranges are specified for cache ops if the ISA does just one cache line and returns an incremented address. Hence, even in small processors, I advocate fully handling the address range in hardware. Only hardware knows the precise cache structure, and can keep things optimized by only iterating through the cache lines once (possibly handling multiple sets in parallel every cycle) while applying the full address range filter.


That's what I hoped my proposal allowed by leaving leeway for how much the cpu has to flush when calling CASH.FLUSH once, and allowing the cpu "to take a breath" when flushing lots of cache to keep interrupt latencies decent.   
Reply all
Reply to author
Forward
0 new messages