Proposal: Explicit cache-control instructions (draft 6 after feedback)

813 views
Skip to first unread message

Jacob Bachmeyer

unread,
Jun 22, 2018, 9:18:42 PM6/22/18
to RISC-V ISA Dev
Previous discussions suggested that explicit cache-control instructions
could be useful, but RISC-V has some constraints here that other
architectures lack, namely that caching must be transparent to the user ISA.

I propose a new minor opcode REGION := 3'b001 within the existing
MISC-MEM major opcode. Instructions in REGION are R-type and use rs1 to
indicate a base address, use rs2 to indicate an (inclusive) upper bound
address, and produce a result in rd that is the first address after the
highest address affected by the operation. If an operation cannot be
applied to the entire requested region at once, an implementation must
reduce the upper bound to encompass a region to which the operation can
be applied at once and the produced value must reflect this reduction.
If rd is x0, the instruction has no directly visible effects and can be
executed entirely asynchronously as an implementation option.

Non-destructive operations permit an implementations to expand a
requested region on both ends to meet hardware granularity for the
operation. An application can infer alignment from the produced value
if it is a concern. As a practical matter, cacheline lengths are
expected to be declared in the processor configuration ROM.

Destructive operations are a thornier issue, and are resolved by
requiring any partial cachelines (at most 2 -- first and last) to be
flushed or prefetched instead of performing the requested operation on
those cachelines. Implementations may perform the destructive operation
on the parts of these cachelines included in the region, or may simply
flush or prefetch the entire cacheline.

If the upper and lower bounds are specified by the same register, the
smallest region that can be affected that includes the lower bound is
affected if the operation is non-destructive; destructive operations are
no-ops. Otherwise, the upper bound must be greater than the lower bound
and the contrary case is reserved.

??? Issue for discussion: what happens if this reserved case is executed?

In general, this proposal uses "cacheline" to describe the hardware
granularity for an operation that affects multiple words of memory or
address space. Where these operations are implemented using traditional
caches, the use of the term "cacheline" is entirely accurate, but this
proposal does not prevent implementations from using other means to
implement these instructions.

Similarly, this proposal uses "main memory" to refer to any ultimate
memory bus target, including MMIO or other hardware.

Instructions in MISC-MEM/REGION may be implemented as no-ops if an
implementation lacks the corresponding hardware. The value produced in
this case is the base address.

The instructions defined in this proposal are *not* hints, however: if
caches exist, the CACHE.* instructions *must* have their defined
effects. Similarly, the prefetch instructions can be no-ops only if the
implementation has neither caches nor prefetch buffers. Likewise for
MEM.DISCARD and MEM.REWRITE: they *must* actually have the stated
effects if hardware such as caches or write buffers is present.

Synchronous operations imply all relevant fences: the effect of a
synchronous CACHE.FLUSH instruction must be globally visible before any
subsequent memory accesses begin, for example.

??? Issue for discussion: exactly what fences are implied and can any
of these instructions be defined purely in terms of fences or other
ordering constraints applied to the memory model? Note that this
proposal predates the new memory model and has not yet been aligned with
the new model.

The new MISC-MEM/REGION space will have room for 128 opcodes, one of
which is the existing FENCE.I. I propose:

[note as of draft 6: preliminary implementations have been announced
that support CACHE.WRITEBACK, CACHE.FLUSH, and MEM.DISCARD from draft 5.]


===Fences===

[function 7'b0000000 is the existing FENCE.I instruction]

[function 7'b0000001 reserved]

FENCE.RD ("range data fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
Perform a conservative fence affecting only data accesses to the
chosen region. This instruction always has visible effects on memory
consistency and is therefore synchronous in all cases. Fences must
fully complete and are not permitted to fail.

FENCE.RI ("range instruction fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000011}
Perform equivalent of FENCE.I but affecting only instruction fetches
from the chosen region. This instruction always has visible effects on
memory consistency and is therefore synchronous in all cases. Fences
must fully complete and are not permitted to fail.

===Non-destructive cache control===

====Prefetch====

All prefetch instructions ignore page faults and other access faults.
In general use, applications should use rd == x0 for prefetching,
although this is not required. If a fault occurs during a synchronous
prefetch (rd != x0), the operation must terminate and produce the
faulting address. A fault occurring during an asynchronous prefetch (rd
== x0) may cause the prefetching to stop or the implementation may
attempt to continue prefetching past the faulting location.

MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
Load as much of the chosen region as possible into the data cache,
with varying levels of expected temporal access locality. The number in
the opcode is proportionate to the expected frequency of access to the
prefetched data: MEM.PF3 is for data that will be very heavily used.

MEM.PF.EXCL ("prefetch exclusive")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
Load as much of the chosen region as possible into the data cache,
with the expectation that future stores will soon occur to this region.
In a cache-coherent system, any locks required for writing the affected
cachelines should be acquired.

MEM.PF.ONCE ("prefetch once")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
Prefetch as much of the region as possible, but expect the prefetched
data to be used at most once in any order.

MEM.PF.STREAM ("prefetch stream")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
Initiate streaming prefetch of the region, expecting the prefetched
data to be used at most once and in sequential order, while minimizing
cache pollution. This operation may activate a prefetch unit and
prefetch the region incrementally if rd is x0. Software is expected to
access the region sequentially, starting at the base address.

MEM.PF.TEXT ("prefetch program text")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
Load as much of the chosen region as possible into the instruction cache.

====Cacheline pinning====

??? Issue for discussion: should a page fault while pinning cachelines
cause a trap to be taken or simply cause the operation to stop or fail?
Should CACHE.PIN use the same approach to TLB fills as MEM.REWRITE uses?
??? Issue for discussion: what if another processor attempts to write
to an address in a cacheline pinned on this processor? [partially addressed]

CACHE.PIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
Arrange for as much of the chosen region as possible to be accessible
with minimal delay and no traffic to main memory. Pinning a region is
idempotent and an implementation may pin a larger region than requested,
provided that an unpin operation with the same base and bound will also
unpin the larger region.
One possible implementation is to load as much of the chosen region as
possible into the data cache and keep it there until unpinned. Another
implementation is to configure a scratchpad RAM and map it over at least
the chosen region, preloading it with data from main memory.
Scratchpads may be processor-local, but writes to a scratchpad mapped
with CACHE.PIN must be visible to other nodes in a coherent system.
Implementations are expected to ensure that pinned cachelines will not
impede the efficacy of a cache. Implementations with fully-associative
caches may permit any number of pins, provided that at least one
cacheline remains available for normal use. Implementations with N-way
set associative caches may support pinning up to (N-1) ways within each
set, provided that at least one way in each set remains available for
normal use. Implementations with direct-mapped caches should not pin
cachelines, but may still use CACHE.PIN to configure an overlay
scratchpad, which may itself use storage shared with caches, such that
mapping the scratchpad decreases the size of the cache.

Implementations may support both cacheline pinning and scratchpads,
choosing which to use to perform a CACHE.PIN operation in an
implementation-defined manner.

Pinned dirty cachelines may be written back at any time, after which
they are clean but remain valid. Pinned cachelines may be used as
writable scratchpad storage overlaying ROM; any errors writing back a
pinned cacheline are ignored.

CACHE.UNPIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010001}
Explicitly release a pin set with CACHE.PIN. Pinned regions are also
implicitly released if the memory protection and virtual address mapping
is changed. (Specifically, a write to the current satp CSR or an
SFENCE.VM will unpin all cachelines as a side effect, unless the
implementation partitions its cache by ASID. Even with ASID-partitioned
caches, changing the root page number associated with an ASID unpins all
cachelines belonging to that ASID.) Unpinning a region does not
immediately remove it from the cache. Unpinning a region always
succeeds, even if parts of the region were not pinned. For an
implementation that implements CACHE.PIN using scratchpad RAM, unpinning
a region that uses a scratchpad causes the current contents of the
scratchpad to be written to main memory if possible.

An implementation with hardware-enforced cache coherency using an
invalidation-based coherency protocol may force pinned cachelines to be
written back and unpinned if another processor attempts to write to a
cacheline pinned locally. Implementations using an update-based
coherency protocol may update pinned cachelines "in-place" when another
processor attempts to write to a cacheline pinned locally. Either
solution adversely impacts performance and software should avoid writing
to pinned cachelines on remote harts.

And two M-mode-only privileged instructions:

CACHE.PIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
Arrange for code to execute from as much of the chosen region as
possible without traffic to main memory. Pinning a region is idempotent.

CACHE.UNPIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010001, 2'b11}
Release resources pinned with CACHE.PIN.I. Pins are idempotent. One
unpin instruction will unpin the chosen region completely, regardless of
how many times it was pinned. Unpinning always succeeds.

====Flush====

CACHE.WRITEBACK
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
Writeback any cachelines in the requested region.

CACHE.FLUSH
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
Write any cachelines in the requested region (as by CACHE.WRITEBACK),
marking them invalid afterwards (as by MEM.DISCARD). Flushed cachelines
are automatically unpinned.

Rationale for including CACHE.FLUSH: small implementations may
significantly benefit from combining CACHE.WRITEBACK and MEM.DISCARD;
the implementations that most benefit lack the infrastructure to achieve
such combination by macro-op fusion.


===Declaring data obsolescence===

These operations declare data to be obsolete and unimportant. In
fully-coherent systems, they are two sides of the same coin:
MEM.DISCARD declares data not yet written to main memory to be obsolete,
while MEM.REWRITE declares data in main memory to be obsolete and
indicates that software on this hart will soon overwrite the region.
These operations are useful in general: a function prologue could use
MEM.REWRITE to allocate a stack frame, while a function epilogue could
use MEM.DISCARD to release a stack frame without requiring the
now-obsolete local variables ever be written back to main memory. In
non-coherent systems, MEM.DISCARD is also an important tool for
software-enforced coherency, since its semantics provide an invalidate
operation on all caches on the path between a hart and main memory.

The declarations of obsolescence produced by these instructions are
global and affect all caches on the path between a hart and main memory
and all caches coherent with those caches, but are not required to
affect non-coherent caches not on the direct path between a hart and
main memory. Implementations depending on software to maintain
coherency in such situations must provide other means (MMIO control
registers, for example) to force invalidations in remote non-coherent
caches.

These instructions create regions with undefined contents and share a
requirement that foreign data never be introduced. Foreign data is,
simply, data that was not previously visible to the current hart at the
current privilege level at any address. Operating systems zero pages
before attaching them to user address spaces to prevent foreign data
from appearing in freshly-allocated pages. Implementations must ensure
that these instructions do not cause foreign data to leak through caches
or other structures.

These instructions must be executed synchronously. If executed with x0
as rd, they are no-ops.

MEM.DISCARD
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
Declare the contents of the region obsolete, dropping any copies
present between the hart's load/store unit and main memory without
performing writes to main memory. The contents of the region are
undefined after the operation completes, but shall be data that was
previously written to the region and shall not include foreign data.
If the region does not align with cacheline boundaries, any partial
cachelines are written back. If hardware requires such, the full
contents of a cacheline partially included may be written back,
including data just declared obsolete. In a non-coherent system,
partial cachelines written back are also invalidated. In a system with
hardware cache coherency, partial cachelines must be written back, but
may remain valid.
Any cachelines fully affected by MEM.DISCARD are automatically unpinned.
NOTE WELL: MEM.DISCARD is *not* an "undo" operation for memory writes
-- an implementation is permitted to aggressively writeback dirty
cachelines, or even to omit caches entirely. *ANY* combination of "old"
and "new" data may appear in the region after executing MEM.DISCARD.

MEM.REWRITE
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
Declare the contents of the region obsolete, indicating that the
current hart will soon overwrite the entire region. Reading any datum
from the region before the current hart has written that datum (or other
data fully overlapping that datum) is incorrect behavior and produces
undefined results, but shall return data as described below for security
reasons. Note that undefined results may include destruction of nearby
data within the region. For optimal performance, software should write
the entire region before reading any part of the region and should do so
sequentially, starting at the base address and moving towards the
address produced by MEM.REWRITE.
For security reasons, implementations must ensure that cachelines
allocated by MEM.REWRITE appear to contain either all-zeros or all-ones
if invalidly read. The choice of all-zeros or all-ones is left to
implementation convenience, but must be consistent and fixed for any
particular hart.
TLB fills occur normally as for writes to the region and must appear
to occur sequentially, starting at the base address. A page fault in
the middle of the region causes the operation to stop and produce the
faulting address. A page fault at the base address causes a page fault
trap to be taken.
Implementations with coherent caches should arrange for all cachelines
in the region to be in a state that permits the current hart to
immediately overwrite the region with no further delay. In common
cache-coherency protocols, this is an "exclusive" state.
An implementation may have a maximum size of a region that can have a
rewrite pending. If software declares intent to overwrite a larger
region than the implementation can prepare at once, the operation must
complete partially and return the first address beyond the region
immediately prepared for overwrite. Software is expected to overwrite
the region prepared, then iterate for the next part of the region that
software intends to overwrite until the entire larger region is overwritten.
If the region does not align with cacheline boundaries, any partial
cachelines are prefetched as by MEM.PF.EXCL. If hardware requires such,
the full contents of a cacheline partially included may be loaded from
memory, including data just declared obsolete.
NOTE WELL: MEM.REWRITE is *not* memset(3) -- any portion of the
region prepared for overwrite already present in cache will *retain* its
previously-visible contents.

MEM.REWRITE appears to be a relatively novel operation and previous
iterations of this proposal have produced considerable confusion. While
the above semantics are the required behavior, there are different ways
to implement them. One simple option is to temporarily mark the region
as "write-through" in internal configuration. Another option is to
allocate cachelines, but load the allocated cachelines with all-zeros or
all-ones instead of fetching contents from memory. A third option is to
track whether cachelines have been overwritten and use a monitor trap to
zero cachelines that software attempts to invalidly read. A fourth
option is to provide dedicated write-combining buffers for MEM.REWRITE.
In systems that implement MEM.REWRITE using cache operations,
MEM.REWRITE allocates cachelines, marking them "valid, exclusive, dirty"
and filling them with a constant without reading from main memory.
Other cachelines may be evicted to make room if needed but
implementations should avoid evicting data recently fetched with
MEM.PF.ONCE or MEM.PF.STREAM, as software may intend to copy that data
into the region. Implementations are recommended to permit at most half
of a data cache to be allocated for MEM.REWRITE if data has been
recently prefetched into the cache to aid in optimizing memcpy(3), but
may permit the full data cache to be used to aid in optimizing
memset(3). In particular, an active asynchronous MEM.PF.ONCE or
MEM.PF.STREAM ("active" meaning that the data prefetched has not yet
been read) can be taken as a hint that MEM.REWRITE is preparing to copy
data and should use at most half or so of the data cache.


??? Issue for discussion: the requirements necessary for MEM.REWRITE to
be safe come very close to implementing memset(3) for the special case
of a zero value. Is there significant incremental cost to going the
rest of the way and changing MEM.REWRITE to MEM.CLEAR?


=== ===

Thoughts?

Thanks to:
[draft 1]
Bruce Hoult for citing a problem with the HiFive board that inspired
the I-cache pins.
[draft 2]
Stefan O'Rear for suggesting the produced value should point to the
first address after the affected region.
Alex Elsayed for pointing out serious problems with expanding the
region for a destructive operation and suggesting that "backwards"
bounds be left reserved.
Guy Lemieux for pointing out that pinning was insufficiently specified.
Andrew Waterman for suggesting that MISC-MEM/REGION could be encoded
around the existing FENCE.I instruction.
Allen Baum for pointing out the incomplete handling of page faults.
[draft 3]
Guy Lemieux for raising issues that inspired renaming PREZERO to RELAX.
Chuanhua Chang for suggesting that explicit invalidation should unpin
cachelines.
Guy Lemieux for being persistent asking for CACHE.FLUSH and giving
enough evidence to support that position.
Guy Lemieux and Andrew Waterman for discussion that led to rewriting a
more general description for pinning.
[draft 4]
Guy Lemieux for suggesting that CACHE.WRITE be renamed CACHE.WRITEBACK.
Allen J. Baum and Guy Lemieux for suggestions that led to rewriting the
destructive operations.
[draft 5]
Allen Baum for offering a clarification for the case of using the same
register for both bounds to select a minimal-length region.
[draft 6]
Aaron Serverance for highlighting possible issues with the new memory
model and other minor issues.
Albert Cahalan for suggesting a use case for pinned cachelines making
ROM appear writable.
Christoph Hellwig for pointing out poor wording in various parts of the
proposal.
Guy Lemieux for pointing out serious potential issues with cacheline
pinning and coherency.
Cesar Eduardo Barros for providing the example that proved that
MEM.REWRITE must appear to load a constant when allocating cachelines.
Others had raised concerns, but Cesar Eduardo Barros provided the
counter-example that proved the previous semantics to be unsafe.
Richard Herveille for raising issues related to streaming data access
patterns.


-- Jacob

Paul A. Clayton

unread,
Jun 24, 2018, 11:51:15 PM6/24/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
[snip]
> ====Prefetch====
>
> All prefetch instructions ignore page faults and other access faults.
> In general use, applications should use rd == x0 for prefetching,
> although this is not required. If a fault occurs during a synchronous
> prefetch (rd != x0), the operation must terminate and produce the
> faulting address. A fault occurring during an asynchronous prefetch (rd
> == x0) may cause the prefetching to stop or the implementation may
> attempt to continue prefetching past the faulting location.

This seems to be hint-like semantics (even though it is only triggered
on a fault).

> MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
> Load as much of the chosen region as possible into the data cache,
> with varying levels of expected temporal access locality. The number in
> the opcode is proportionate to the expected frequency of access to the
> prefetched data: MEM.PF3 is for data that will be very heavily used.

"heavily used" seems an improper term with respect to temporal
locality. A memory region can have the same number or frequency
of accesses ("heaviness" can refer to "weight" or "density") but
different use lifetimes.

If this information is intended to communicate temporal locality
(and not some benefit measure of caching), then prefetch once
might be merged as the lowest temporal locality.

(Utility, persistence, and criticality are different measures that
software may wish to communicate to the memory system.)

> MEM.PF.EXCL ("prefetch exclusive")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
> Load as much of the chosen region as possible into the data cache,
> with the expectation that future stores will soon occur to this region.
> In a cache-coherent system, any locks required for writing the affected
> cachelines should be acquired.

It may be useful to make a distinction between prefetch for write
where reads are not expected but the region is not guaranteed to
be overwritten. An implementation might support general avoidance
of read-for-ownership (e.g., finer-grained validity indicators) but
still benefit from a write prefetch to establish ownership.

> MEM.PF.ONCE ("prefetch once")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
> Prefetch as much of the region as possible, but expect the prefetched
> data to be used at most once in any order.

"used once" may be defined at byte level or at cache block level.
If the use does not have considerable spatial locality at cache block
granularity, use once at byte level would have a different intent
than use once at a coarser granularity. With cache-block granular
access, the cache can evict after the first access; with byte-granular
access and lower spatio-temporal locality, another byte within a
cache block may be accessed relatively long after the first access
to the cache block (so software would not want the block to be
marked for eviction after the first block access).

> MEM.PF.STREAM ("prefetch stream")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
> Initiate streaming prefetch of the region, expecting the prefetched
> data to be used at most once and in sequential order, while minimizing
> cache pollution. This operation may activate a prefetch unit and
> prefetch the region incrementally if rd is x0. Software is expected to
> access the region sequentially, starting at the base address.

It may be useful to include a stride for stream prefetching.

> MEM.PF.TEXT ("prefetch program text")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
> Load as much of the chosen region as possible into the instruction cache.

With a unified Ln cache, would the instruction prefetching "overflow"
into the Ln cache? (Similar questions apply to data.)

> ====Cacheline pinning====
>
> ??? Issue for discussion: should a page fault while pinning cachelines
> cause a trap to be taken or simply cause the operation to stop or fail?
> Should CACHE.PIN use the same approach to TLB fills as MEM.REWRITE uses?
> ??? Issue for discussion: what if another processor attempts to write
> to an address in a cacheline pinned on this processor? [partially
> addressed]
>
> CACHE.PIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
> Arrange for as much of the chosen region as possible to be accessible
> with minimal delay and no traffic to main memory. Pinning a region is
> idempotent and an implementation may pin a larger region than requested,
> provided that an unpin operation with the same base and bound will also
> unpin the larger region.

"as much as possible" and "minimal delay" interact in a more complex
memory system. Would software prefer more of the region be cached
(at some latency or bandwidth penalty) or just have the excess ignored?
(While these cache operations are presumably intended for more
microarchitecture-specific tuning, some software may be developed
quickly to work well on one microarchitecture with an expectation
that the software would work decently for a general class of
implementations.)

Pinning also seems to be a specification of temporal locality:
"perpetual" locality.

> One possible implementation is to load as much of the chosen region as
> possible into the data cache and keep it there until unpinned. Another
> implementation is to configure a scratchpad RAM and map it over at least
> the chosen region, preloading it with data from main memory.
> Scratchpads may be processor-local, but writes to a scratchpad mapped
> with CACHE.PIN must be visible to other nodes in a coherent system.
> Implementations are expected to ensure that pinned cachelines will not
> impede the efficacy of a cache. Implementations with fully-associative
> caches may permit any number of pins, provided that at least one
> cacheline remains available for normal use. Implementations with N-way
> set associative caches may support pinning up to (N-1) ways within each
> set, provided that at least one way in each set remains available for
> normal use. Implementations with direct-mapped caches should not pin
> cachelines, but may still use CACHE.PIN to configure an overlay
> scratchpad, which may itself use storage shared with caches, such that
> mapping the scratchpad decreases the size of the cache.

What about overlaid skewed associative caches (rf. "Concurrent Support of
Multiple Page Sizes On a Skewed Associative TLB")? In such a design
the capacity is not isolated by ways and enforcing a guarantee that any
cache block could be allocated might be somewhat expensive. (The
easiest method might be to provide an equal-latency side cache which
might otherwise be used as a victim cache, prefetch buffer, stream
cache, or provide other functionality.)

It might also be noted that with overlaid skewed associativity, different
block alignments would be practical with alignments associated with
ways (like different page sizes in the Seznec paper).

(Another consideration is that cache block sizes may be diverse and
non-constant. E.g., an implementation that allowed half of L1 cache
to be mapped as a scratchpad might use the extra tags to support
twice as many smaller cache blocks (requiring only an extra bit per
tag).)
Even an invalidation-based coherency protocol could provide relatively
low delay writeback (at the cost of modest hardware complexity, e.g., an
additional block state similar to "owned" in MOESI and a mechanism
to determine the home of the pinned memory and poor performance
under significant reuse, i.e., converting "don't write to pinned memory"
to "don't repeatedly write to the same pinned block"). Support for
limited scope update coherence can be useful for relatively static
producer-consumer relationships.

> And two M-mode-only privileged instructions:
>
> CACHE.PIN.I
> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
> Arrange for code to execute from as much of the chosen region as
> possible without traffic to main memory. Pinning a region is idempotent.

Why M-mode-only?

If one is supporting such ranged fetches, it seems that support for
memory-copy could be trivially provided.

TLB prefetching also seems worth considering.

In addition, the above does not seem to consider cache
hierarchies and cache sharing (even temporal sharing through
software context switches). While most of the uses of such
operations would have tighter control over thread allocation,
some uses might expect graceful/gradual decay.

Jacob Bachmeyer

unread,
Jun 25, 2018, 1:09:52 AM6/25/18
to Paul A. Clayton, RISC-V ISA Dev
Paul A. Clayton wrote:
> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> [snip]
>
>> ====Prefetch====
>>
>> All prefetch instructions ignore page faults and other access faults.
>> In general use, applications should use rd == x0 for prefetching,
>> although this is not required. If a fault occurs during a synchronous
>> prefetch (rd != x0), the operation must terminate and produce the
>> faulting address. A fault occurring during an asynchronous prefetch (rd
>> == x0) may cause the prefetching to stop or the implementation may
>> attempt to continue prefetching past the faulting location.
>>
>
> This seems to be hint-like semantics (even though it is only triggered
> on a fault).
>

Please elaborate, although asynchronous prefetches do seem rather
hint-like now that you mention it.

>> MEM.PF0 - MEM.PF3 ("prefetch, levels 0 - 3")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
>> Load as much of the chosen region as possible into the data cache,
>> with varying levels of expected temporal access locality. The number in
>> the opcode is proportionate to the expected frequency of access to the
>> prefetched data: MEM.PF3 is for data that will be very heavily used.
>>
>
> "heavily used" seems an improper term with respect to temporal
> locality.

That part is intended to define which end of the scale is which in a
generic manner. The region instructions are intended to be independent
of actual cache structure, which is why address-space-extents are used
instead of cacheline addresses.

> A memory region can have the same number or frequency
> of accesses ("heaviness" can refer to "weight" or "density") but
> different use lifetimes.
>
> If this information is intended to communicate temporal locality
> (and not some benefit measure of caching), then prefetch once
> might be merged as the lowest temporal locality.
>
> (Utility, persistence, and criticality are different measures that
> software may wish to communicate to the memory system.)
>

Please elaborate on this. I would like to be sure that we are using the
same terms here.

>> MEM.PF.EXCL ("prefetch exclusive")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
>> Load as much of the chosen region as possible into the data cache,
>> with the expectation that future stores will soon occur to this region.
>> In a cache-coherent system, any locks required for writing the affected
>> cachelines should be acquired.
>>
>
> It may be useful to make a distinction between prefetch for write
> where reads are not expected but the region is not guaranteed to
> be overwritten. An implementation might support general avoidance
> of read-for-ownership (e.g., finer-grained validity indicators) but
> still benefit from a write prefetch to establish ownership.
>

In other words, something in-between MEM.PF.EXCL (which prefetches the
current data in main memory) and MEM.REWRITE (which destroys the current
data in main memory)?

>> MEM.PF.ONCE ("prefetch once")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
>> Prefetch as much of the region as possible, but expect the prefetched
>> data to be used at most once in any order.
>>
>
> "used once" may be defined at byte level or at cache block level.
>

It was intended to be at the level of "something", initially defined as
the width of the first access made to the prefetched region.

> If the use does not have considerable spatial locality at cache block
> granularity, use once at byte level would have a different intent
> than use once at a coarser granularity. With cache-block granular
> access, the cache can evict after the first access; with byte-granular
> access and lower spatio-temporal locality, another byte within a
> cache block may be accessed relatively long after the first access
> to the cache block (so software would not want the block to be
> marked for eviction after the first block access).
>

I now wonder if MEM.PF.ONCE and MEM.PF.PF[0123] might be most useful in
combination, where MEM.PF.ONCE specifies a region and the second
prefetch specifies a granularity within that region, although this would
make the overall interface less regular.

>> MEM.PF.STREAM ("prefetch stream")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
>> Initiate streaming prefetch of the region, expecting the prefetched
>> data to be used at most once and in sequential order, while minimizing
>> cache pollution. This operation may activate a prefetch unit and
>> prefetch the region incrementally if rd is x0. Software is expected to
>> access the region sequentially, starting at the base address.
>>
>
> It may be useful to include a stride for stream prefetching.
>

Could the stride be inferred from the subsequent access pattern? If
words at X, X+24, and X+48 are subsequently read, skipping the
intermediate locations, the prefetcher could infer a stride of 24 and
simply remember (1) the last location actually accessed and (2) the next
expected access location. The reason to remember the last actual access
is to still meet the "mimimize cache pollution" goal when a stride is
mispredicted.

>> MEM.PF.TEXT ("prefetch program text")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
>> Load as much of the chosen region as possible into the instruction cache.
>>
>
> With a unified Ln cache, would the instruction prefetching "overflow"
> into the Ln cache? (Similar questions apply to data.)
>

By "overflow" do you mean that once L1 cache is full, additional
cachelines would be prefetched into L2 only? I had expected to leave
that implementation-defined, since the proposal tries to avoid tying
itself to any particular cache structure (or even to actually using
caches at all, although I am not aware of any other useful implementations).

>> ====Cacheline pinning====
>>
>> ??? Issue for discussion: should a page fault while pinning cachelines
>> cause a trap to be taken or simply cause the operation to stop or fail?
>> Should CACHE.PIN use the same approach to TLB fills as MEM.REWRITE uses?
>> ??? Issue for discussion: what if another processor attempts to write
>> to an address in a cacheline pinned on this processor? [partially
>> addressed]
>>
>> CACHE.PIN
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0010000}
>> Arrange for as much of the chosen region as possible to be accessible
>> with minimal delay and no traffic to main memory. Pinning a region is
>> idempotent and an implementation may pin a larger region than requested,
>> provided that an unpin operation with the same base and bound will also
>> unpin the larger region.
>>
>
> "as much as possible" and "minimal delay" interact in a more complex
> memory system. Would software prefer more of the region be cached
> (at some latency or bandwidth penalty) or just have the excess ignored?
>

The key requirement is "no traffic to main memory". Is the "minimal
delay" confusing in this context?

> (While these cache operations are presumably intended for more
> microarchitecture-specific tuning, some software may be developed
> quickly to work well on one microarchitecture with an expectation
> that the software would work decently for a general class of
> implementations.)
>
> Pinning also seems to be a specification of temporal locality:
> "perpetual" locality.
>

More like malloc(3)/free(3), but yes, pinning is "do not evict this
until I say so". The catch is that multi-tasking systems might not
actually be able to maintain that, so pins can be lost on
context-switch. For user mode, this should not be a significant
concern, since the pin can be redone on each iteration of an outer loop,
which is why pinning/unpinning is idempotent.

>> One possible implementation is to load as much of the chosen region as
>> possible into the data cache and keep it there until unpinned. Another
>> implementation is to configure a scratchpad RAM and map it over at least
>> the chosen region, preloading it with data from main memory.
>> Scratchpads may be processor-local, but writes to a scratchpad mapped
>> with CACHE.PIN must be visible to other nodes in a coherent system.
>> Implementations are expected to ensure that pinned cachelines will not
>> impede the efficacy of a cache. Implementations with fully-associative
>> caches may permit any number of pins, provided that at least one
>> cacheline remains available for normal use. Implementations with N-way
>> set associative caches may support pinning up to (N-1) ways within each
>> set, provided that at least one way in each set remains available for
>> normal use. Implementations with direct-mapped caches should not pin
>> cachelines, but may still use CACHE.PIN to configure an overlay
>> scratchpad, which may itself use storage shared with caches, such that
>> mapping the scratchpad decreases the size of the cache.
>>
>
> What about overlaid skewed associative caches (rf. "Concurrent Support of
> Multiple Page Sizes On a Skewed Associative TLB")? In such a design
> the capacity is not isolated by ways and enforcing a guarantee that any
> cache block could be allocated might be somewhat expensive. (The
> easiest method might be to provide an equal-latency side cache which
> might otherwise be used as a victim cache, prefetch buffer, stream
> cache, or provide other functionality.)
>

The concern here is that implementations where memory accesses *must* be
cached can avoid pinning so many cachelines that non-pinned memory
becomes inaccessible. There is no requirement that such a safety
interlock be provided, and implementations are permitted (but
discouraged) to allow cache pinning to be used as a footgun.

> It might also be noted that with overlaid skewed associativity, different
> block alignments would be practical with alignments associated with
> ways (like different page sizes in the Seznec paper).
>
> (Another consideration is that cache block sizes may be diverse and
> non-constant. E.g., an implementation that allowed half of L1 cache
> to be mapped as a scratchpad might use the extra tags to support
> twice as many smaller cache blocks (requiring only an extra bit per
> tag).)
>

This is a good reason to keep cacheline size out of the actual
instructions and was one of the motivating factors for the "region"
model, although the original assumption was that cacheline sizes would
only "vary" due to migrations in heterogeneous multiprocessor systems.
This shows that even uniprocessor systems can also benefit from this
abstraction.
Breaking pins on remote write was added to address complaints that
expecting such memory to remain pinned effectively made
invalidation-based coherency protocols unusable. The behaviors
described are both "may" options precisely because, while such writes
must work and must maintain coherency, exactly how writes to cachelines
pinned elsewhere are handled is intended to be implementation-defined.
Writes to pinned memory by the same hart that pinned it are expected;
the problems occur when other harts write to memory pinned on "this" hart.

>> And two M-mode-only privileged instructions:
>>
>> CACHE.PIN.I
>> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1010000, 2'b11}
>> Arrange for code to execute from as much of the chosen region as
>> possible without traffic to main memory. Pinning a region is idempotent.
>>
>
> Why M-mode-only?
>

The I-cache pins are M-mode-only because that is the only mode where
context-switch can be guaranteed to not occur. These were added to
allow using the I-cache as temporary program memory on implementations
that normally execute from flash but cannot read from flash while a
flash write is in progress. The issue was seen on one of the HiFive
boards that was also an M-mode-only implementation.

> If one is supporting such ranged fetches, it seems that support for
> memory-copy could be trivially provided.
>

Maybe. There are scope-creep concerns and there are currently no ranged
writes in the proposal. (MEM.REWRITE is not itself a write.)

> TLB prefetching also seems worth considering.
>

Any suggestions?

> In addition, the above does not seem to consider cache
> hierarchies and cache sharing (even temporal sharing through
> software context switches). While most of the uses of such
> operations would have tighter control over thread allocation,
> some uses might expect graceful/gradual decay.
>

I am making an effort to keep these operations relatively abstract even
though that limits the amount of detail that can be specified.
Generally, embedded systems are expected to have that sort of tight
control and large systems (such as a RISC-V PC) are expected to use
ASID-partitioned caches (effectively an independent cache for each ASID)
for reasons of performance and security, since Spectre-like attacks
enable cache side-channels without the "sender's" cooperation.


-- Jacob

Paul Miranda

unread,
Jun 25, 2018, 11:19:30 AM6/25/18
to RISC-V ISA Dev, paaron...@gmail.com, jcb6...@gmail.com


On Monday, June 25, 2018 at 12:09:52 AM UTC-5, Jacob Bachmeyer wrote:
Paul A. Clayton wrote:
> It may be useful to make a distinction between prefetch for write
> where reads are not expected but the region is not guaranteed to
> be overwritten. An implementation might support general avoidance
> of read-for-ownership (e.g., finer-grained validity indicators) but
> still benefit from a write prefetch to establish ownership.
>  

In other words, something in-between MEM.PF.EXCL (which prefetches the
current data in main memory) and MEM.REWRITE (which destroys the current
data in main memory)?

Yes. I believe it assumes the ability to hold partially dirty data in a cacheline, or cachelines small enough to be fully written in the coming instructions, or a memory system that can merge partially dirty data from one cache with clean data from the next level down. Or something else I haven't thought of that isn't supported in simple protocols like ACE.
 
>> MEM.PF.STREAM ("prefetch stream")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
>>   Initiate streaming prefetch of the region, expecting the prefetched
>> data to be used at most once and in sequential order, while minimizing
>> cache pollution.  This operation may activate a prefetch unit and
>> prefetch the region incrementally if rd is x0.  Software is expected to
>> access the region sequentially, starting at the base address.
>>    
>
> It may be useful to include a stride for stream prefetching.
>  

Could the stride be inferred from the subsequent access pattern?  If
words at X, X+24, and X+48 are subsequently read, skipping the
intermediate locations, the prefetcher could infer a stride of 24 and
simply remember (1) the last location actually accessed and (2) the next
expected access location.  The reason to remember the last actual access
is to still meet the "mimimize cache pollution" goal when a stride is
mispredicted.

If hardware has to wait to see the access pattern then the utility of the MEM.PF is reduced, although I suppose there isn't a good way of indicating stride in this format?


>> And two M-mode-only privileged instructions: 
>
> Why M-mode-only?
>  

The I-cache pins are M-mode-only because that is the only mode where
context-switch can be guaranteed to not occur.  These were added to
allow using the I-cache as temporary program memory on implementations
that normally execute from flash but cannot read from flash while a
flash write is in progress.  The issue was seen on one of the HiFive
boards that was also an M-mode-only implementation.

I can see a use for S-mode pinning. I might only use it for ASID=0 (global) lines, but I can see a use for multiple ASIDs if the implementation can handle it efficiently.
Bottom-line, I'd like to see S-mode pinning, although it might not be useful in all systems.

 
> TLB prefetching also seems worth considering.
>  

Any suggestions?

Definitely seems useful... just another flavor of MEM.PF, I think, but with no worries about getting ownership, I think.
TLB pinning could also be useful, IMO.
 
Thanks for putting all of this together. I was thinking about trying to do it all with nonstandard CSR registers but having instructions should be broadly useful.


One thing still missing from RISC-V (unless I myself missed something along the way) is a streaming-store hint. Right now I believe the only way to avoid cache pollution is through PMA's but that's not a very fine-grained tool (if it exists at all) in all systems.

Jacob Bachmeyer

unread,
Jun 25, 2018, 7:22:31 PM6/25/18
to Paul Miranda, RISC-V ISA Dev, paaron...@gmail.com
Paul Miranda wrote:
> On Monday, June 25, 2018 at 12:09:52 AM UTC-5, Jacob Bachmeyer wrote:
>
> Paul A. Clayton wrote:
> > It may be useful to make a distinction between prefetch for write
> > where reads are not expected but the region is not guaranteed to
> > be overwritten. An implementation might support general avoidance
> > of read-for-ownership (e.g., finer-grained validity indicators) but
> > still benefit from a write prefetch to establish ownership.
> >
>
> In other words, something in-between MEM.PF.EXCL (which prefetches
> the
> current data in main memory) and MEM.REWRITE (which destroys the
> current
> data in main memory)?
>
>
> Yes. I believe it assumes the ability to hold partially dirty data in
> a cacheline, or cachelines small enough to be fully written in the
> coming instructions, or a memory system that can merge partially dirty
> data from one cache with clean data from the next level down. Or
> something else I haven't thought of that isn't supported in simple
> protocols like ACE.

For now, I will call it MEM.WRHINT and think about how to actually
define such an operation.

> >> MEM.PF.STREAM ("prefetch stream")
> >> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
> >> Initiate streaming prefetch of the region, expecting the
> prefetched
> >> data to be used at most once and in sequential order, while
> minimizing
> >> cache pollution. This operation may activate a prefetch unit and
> >> prefetch the region incrementally if rd is x0. Software is
> expected to
> >> access the region sequentially, starting at the base address.
> >>
> >
> > It may be useful to include a stride for stream prefetching.
> >
>
> Could the stride be inferred from the subsequent access pattern? If
> words at X, X+24, and X+48 are subsequently read, skipping the
> intermediate locations, the prefetcher could infer a stride of 24 and
> simply remember (1) the last location actually accessed and (2)
> the next
> expected access location. The reason to remember the last actual
> access
> is to still meet the "mimimize cache pollution" goal when a stride is
> mispredicted.
>
>
> If hardware has to wait to see the access pattern then the utility of
> the MEM.PF is reduced, although I suppose there isn't a good way of
> indicating stride in this format?

We only have 2 source operands in the instruction, so there is no good
way to add an explicit stride parameter. However, for streaming
prefetch, that should be less of a concern: the hardware can load some
initial block into a prefetch buffer and use the inferred stride to make
more efficient use of later prefetches. In other words, initially
assume that streaming data is packed and revise that assumption as the
access pattern indicates.

> >> And two M-mode-only privileged instructions:
> >
> > Why M-mode-only?
>
> The I-cache pins are M-mode-only because that is the only mode where
> context-switch can be guaranteed to not occur. These were added to
> allow using the I-cache as temporary program memory on
> implementations
> that normally execute from flash but cannot read from flash while a
> flash write is in progress. The issue was seen on one of the HiFive
> boards that was also an M-mode-only implementation.
>
>
> I can see a use for S-mode pinning. I might only use it for ASID=0
> (global) lines, but I can see a use for multiple ASIDs if the
> implementation can handle it efficiently.
> Bottom-line, I'd like to see S-mode pinning, although it might not be
> useful in all systems.

I can see uses for U-mode pinning, but the problem goes back to the
original motivation for I-cache pins: holding some (small) amount of
code in the I-cache to handle a known period where the main program
store is not accessible. Interrupts must be disabled for this to work
and that means that I-cache pinning must be restricted to M-mode, since
no other mode can truly disable interrupts. Frequent use of MEM.PF.TEXT
can allow less-privileged modes to "quasi-pin" certain code but cannot
provide the same guarantee.

The difference stems from what happens when a cache pin is broken: for
a data cache pin, the relevant information will be reloaded when needed
and re-pinned on the next iteration of some loop; for an instruction
cache pin, an interrupt may result in a branch to program text that is
temporarily inaccessible until the (interrupted) pinned code completes,
which will not happen because an interrupt occurred, leading to a
temporal contradiction (interrupt handler cannot be fetched until pinned
code completes; pinned code cannot continue until interrupt is handled)
that deadlocks (reads block) or crashes (reads return garbage) the system.

> > TLB prefetching also seems worth considering.
>
> Any suggestions?
>
>
> Definitely seems useful... just another flavor of MEM.PF, I think, but
> with no worries about getting ownership, I think.

Perhaps MEM.PF.MAP?

> TLB pinning could also be useful, IMO.

Cache pins provide an otherwise-unavailable ability to use (part of) the
cache as a scratchpad. What new ability do we get from pinning TLB entries?

> Thanks for putting all of this together. I was thinking about trying
> to do it all with nonstandard CSR registers but having instructions
> should be broadly useful.

The goal is for this proposal to have enough broad community support to
be adopted as a standard extension and/or rolled into a future baseline.

> One thing still missing from RISC-V (unless I myself missed something
> along the way) is a streaming-store hint. Right now I believe the only
> way to avoid cache pollution is through PMA's but that's not a very
> fine-grained tool (if it exists at all) in all systems.

For the case of a packed streaming-store (every octet will be
overwritten), there is MEM.REWRITE, but that is also a "prefetch
constant" and allocates cachelines. Could a combination of
MEM.PF.STREAM and MEM.WRHINT address this generally? Or is a
MEM.SPARSEWRITE a better option? How to (conceptually) distinguish
MEM.SPARSEWRITE and MEM.WRHINT?

For that matter, is a general rule that "reads are prefetched while
writes are hinted" a good dividing line?

Should MEM.PF.STREAM be more of a modifier to another prefetch or hint
instead of its own prefetch instruction?


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 25, 2018, 7:52:32 PM6/25/18
to Jacob Bachmeyer, Paul Miranda, RISC-V ISA Dev, Paul A. Clayton
On Tue, Jun 26, 2018 at 12:22 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> TLB pinning could also be useful, IMO.
>
>
> Cache pins provide an otherwise-unavailable ability to use (part of) the
> cache as a scratchpad.

oo now that's *very* interesting, particularly given that a GPU has
such a strong requirement to process relatively large amounts of data
(4x4 blocks of 32-bit pixels) *without* that going back to L2 and
definitely not to main memory before the work's completely done.

scratchpads are a bit of a pain as they need to be context-switched
(or an entire core hard-dedicated to one task). if L1 cache can
double-duty as a scratchpad that would be *great*, as all the things
that an L1 cache has to take care of for context-switching and so on
are already well-known and solved.

> What new ability do we get from pinning TLB entries?

working sets. reduced thrashing. batch processing.

(if there is anyone who believes that the scenario below is not
realistic please feel free to alter it and contribute an improvement
or alternative that is).

it should be fairly easy to envisage a case where a long-running
process that needs regular but short-duration assistance of a process
that requires some (but not a lot) of memory could have its
performance adversely affected by the short-duration task due to the
short-duration task pushing out significant numbers of TLBs for the
long-running process.

network packets coming in on database servers easily fits that scenario.

if the short-duration task instead used a small subset of the TLB,
then despite the short-duration's task being a bit slower, it's quite
likely that the longer-duration task would be even worse-affected,
particularly if there are significant numbers of smaller TLB entries
(heavily-fragmented memory workloads) rather than fewer huge ones [for
whatever reason].

monero's mining algorithm in particular i understand to have been
*deliberately* designed to hit [really large amounts of] memory
particularly hard with random accesses, as a deliberate way to ensure
that people don't design custom ASICs for it.

l.

Paul Miranda

unread,
Jun 25, 2018, 9:15:58 PM6/25/18
to jcb6...@gmail.com, RISC-V ISA Dev, paaron...@gmail.com


On Mon, Jun 25, 2018 at 6:22 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
...

I can see uses for U-mode pinning, but the problem goes back to the original motivation for I-cache pins:  holding some (small) amount of code in the I-cache to handle a known period where the main program store is not accessible.  Interrupts must be disabled for this to work and that means that I-cache pinning must be restricted to M-mode, since no other mode can truly disable interrupts.  Frequent use of MEM.PF.TEXT can allow less-privileged modes to "quasi-pin" certain code but cannot provide the same guarantee.

The difference stems from what happens when a cache pin is broken:  for a data cache pin, the relevant information will be reloaded when needed and re-pinned on the next iteration of some loop; for an instruction cache pin, an interrupt may result in a branch to program text that is temporarily inaccessible until the (interrupted) pinned code completes, which will not happen because an interrupt occurred, leading to a temporal contradiction (interrupt handler cannot be fetched until pinned code completes; pinned code cannot continue until interrupt is handled) that deadlocks (reads block) or crashes (reads return garbage) the system.

    > TLB prefetching also seems worth considering.

    Any suggestions?


Definitely seems useful... just another flavor of MEM.PF, I think, but with no worries about getting ownership, I think.

Perhaps MEM.PF.MAP?

TLB pinning could also be useful, IMO.

Cache pins provide an otherwise-unavailable ability to use (part of) the cache as a scratchpad.  What new ability do we get from pinning TLB entries?

My desire for both of these is for providing reliably low latency interrupt handling for certain vectors despite the presence of caching and virtual memory. I want to pin I cache and I TLB (Data side would be useful too) to allow an S-mode handler to operate even if it isn't the most recently used code. I am assuming >N+1 associativity so that N pins could never lock out other threads or modes from caching completely, which I think would address your concern but I might not have understood it completely.


 
One thing still missing from RISC-V (unless I myself missed something along the way) is a streaming-store hint. Right now I believe the only way to avoid cache pollution is through PMA's but that's not a very fine-grained tool (if it exists at all) in all systems.

For the case of a packed streaming-store (every octet will be overwritten), there is MEM.REWRITE, but that is also a "prefetch constant" and allocates cachelines.  Could a combination of MEM.PF.STREAM and MEM.WRHINT address this generally?  Or is a MEM.SPARSEWRITE a better option?  How to (conceptually) distinguish MEM.SPARSEWRITE and MEM.WRHINT?

For that matter, is a general rule that "reads are prefetched while writes are hinted" a good dividing line?

Should MEM.PF.STREAM be more of a modifier to another prefetch or hint instead of its own prefetch instruction?

MEM.WRHINT is similar to what I'm thinking... get ownership of a line but not data...however I didn't even want to burn a tag entry for a streaming store. I think all of the hints talked about before hold the line in some way. 
MEM.REWRITE is also similar, but is allowed to drop data and again gets ownership.
It may be that some systems couldn't accommodate what I'm thinking, although it's not really any different than when a noncoherent device wants to write to coherent memory.

 

-- Jacob

Paul A. Clayton

unread,
Jun 25, 2018, 10:36:23 PM6/25/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Paul A. Clayton wrote:
>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
[snip for async prefetch fault stops or skips faulting addresses]

>> This seems to be hint-like semantics (even though it is only triggered
>> on a fault).
>
> Please elaborate, although asynchronous prefetches do seem rather
> hint-like now that you mention it.

By not faulting the asynchronous form is not strictly a directive (unless
the range exceeded capacity in such a way that either skipped addresses
would have been overwritten regardless of the skipping or the prefetch
would stop when capacity was reached).

[snip]
>> A memory region can have the same number or frequency
>> of accesses ("heaviness" can refer to "weight" or "density") but
>> different use lifetimes.
>>
>> If this information is intended to communicate temporal locality
>> (and not some benefit measure of caching), then prefetch once
>> might be merged as the lowest temporal locality.
>>
>> (Utility, persistence, and criticality are different measures that
>> software may wish to communicate to the memory system.)
>
> Please elaborate on this. I would like to be sure that we are using the
> same terms here.

Utility is roughly the number of accesses serviced/latency penalty
avoided due to the prefetch. Persistence refers to the useful lifetime
of the storage granule/prefetched region. Criticality refers to the
urgency, e.g., a prefetch operation might be ready early enough
that hardware can prefer bandwidth or energy efficiency over
latency without hurting performance.

[snip]
>> It may be useful to make a distinction between prefetch for write
>> where reads are not expected but the region is not guaranteed to
>> be overwritten. An implementation might support general avoidance
>> of read-for-ownership (e.g., finer-grained validity indicators) but
>> still benefit from a write prefetch to establish ownership.
>>
>
> In other words, something in-between MEM.PF.EXCL (which prefetches the
> current data in main memory) and MEM.REWRITE (which destroys the current
> data in main memory)?

Yes.

>>> MEM.PF.ONCE ("prefetch once")
>>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
>>> Prefetch as much of the region as possible, but expect the prefetched
>>> data to be used at most once in any order.
>>>
>>
>> "used once" may be defined at byte level or at cache block level.
>>
>
> It was intended to be at the level of "something", initially defined as
> the width of the first access made to the prefetched region.

Defining the granularity of expiration of usefulness by the first
access seems somewhat complex. Tracking "has been used"
would seem to require a bit for the smallest possible granule,
which seems unlikely to be provided (given that caches rarely
track present or dirty at even 32-bit granularity).

[snip]
> I now wonder if MEM.PF.ONCE and MEM.PF.PF[0123] might be most useful in
> combination, where MEM.PF.ONCE specifies a region and the second
> prefetch specifies a granularity within that region, although this would
> make the overall interface less regular.

This could also be an argument for not limiting such instructions to
32-bit encodings.

[snip]
>> It may be useful to include a stride for stream prefetching.
>>
>
> Could the stride be inferred from the subsequent access pattern? If
> words at X, X+24, and X+48 are subsequently read, skipping the
> intermediate locations, the prefetcher could infer a stride of 24 and
> simply remember (1) the last location actually accessed and (2) the next
> expected access location. The reason to remember the last actual access
> is to still meet the "mimimize cache pollution" goal when a stride is
> mispredicted.

As others noted, this would delay prefetching or waste bandwidth
by assuming unit stride.

[snip]
>> With a unified Ln cache, would the instruction prefetching "overflow"
>> into the Ln cache? (Similar questions apply to data.)
>>
>
> By "overflow" do you mean that once L1 cache is full, additional
> cachelines would be prefetched into L2 only? I had expected to leave
> that implementation-defined, since the proposal tries to avoid tying
> itself to any particular cache structure (or even to actually using
> caches at all, although I am not aware of any other useful
> implementations).

This concerns the hardware fulfilling the intent of the software.
If accesses to the prefetched region are "random", software might
prefer a large prefetch region to be fetched into L2 rather than
evicting most of L1.

[snip]
>> "as much as possible" and "minimal delay" interact in a more complex
>> memory system. Would software prefer more of the region be cached
>> (at some latency or bandwidth penalty) or just have the excess ignored?
>>
>
> The key requirement is "no traffic to main memory". Is the "minimal
> delay" confusing in this context?

Pinning to an off-chip L4 cache would avoid memory traffic and
provide substantial capacity (which may be what is desired) but
the latency (and bandwidth) would be worse than L1 latency (and
bandwidth).

>> (While these cache operations are presumably intended for more
>> microarchitecture-specific tuning, some software may be developed
>> quickly to work well on one microarchitecture with an expectation
>> that the software would work decently for a general class of
>> implementations.)
>>
>> Pinning also seems to be a specification of temporal locality:
>> "perpetual" locality.
>>
>
> More like malloc(3)/free(3), but yes, pinning is "do not evict this
> until I say so". The catch is that multi-tasking systems might not
> actually be able to maintain that, so pins can be lost on
> context-switch. For user mode, this should not be a significant
> concern, since the pin can be redone on each iteration of an outer loop,
> which is why pinning/unpinning is idempotent.

Having to repin even with only moderate frequency would remove
some of the performance advantage. (This could be optimized with
an approximate conservative filter; even a simple filter could
quickly convert repinning to a nop if the pinning remained entirely
intact. However, that adds significant hardware complexity.)

[snip]
> The concern here is that implementations where memory accesses *must* be
> cached can avoid pinning so many cachelines that non-pinned memory
> becomes inaccessible. There is no requirement that such a safety
> interlock be provided, and implementations are permitted (but
> discouraged) to allow cache pinning to be used as a footgun.

It seems that sometimes software would want to pin more cache
than is "safe". (By the way, presumably uncacheable memory is
still accessible. Memory for which no space is available for caching
could be treated as uncached/uncacheable memory.)

[snip]

> Breaking pins on remote write was added to address complaints that
> expecting such memory to remain pinned effectively made
> invalidation-based coherency protocols unusable. The behaviors
> described are both "may" options precisely because, while such writes
> must work and must maintain coherency, exactly how writes to cachelines
> pinned elsewhere are handled is intended to be implementation-defined.
> Writes to pinned memory by the same hart that pinned it are expected;
> the problems occur when other harts write to memory pinned on "this" hart.

(Actually the problem occurs when the pinned cache blocks are not
shared between the pinning hart and the writing hart.)

Implementation defined behavior requires a method of discovery. It
also seems desirable to have guidelines on families of implementation
to facilitate portability within a family.

[snip Icache pinning]
>> Why M-mode-only?
>>
>
> The I-cache pins are M-mode-only because that is the only mode where
> context-switch can be guaranteed to not occur. These were added to
> allow using the I-cache as temporary program memory on implementations
> that normally execute from flash but cannot read from flash while a
> flash write is in progress. The issue was seen on one of the HiFive
> boards that was also an M-mode-only implementation.

How is different from data pinning?

>> If one is supporting such ranged fetches, it seems that support for
>> memory-copy could be trivially provided.
>>
>
> Maybe. There are scope-creep concerns and there are currently no ranged
> writes in the proposal. (MEM.REWRITE is not itself a write.)

On the other hand, recognizing that features are closely related
in implementation seems important.

>> TLB prefetching also seems worth considering.
>>
>
> Any suggestions?

Nothing specific. Locking TLB entries is not uncommonly
supported (original MIPS TLB could set a maximum on
the random number generated for replacement to lock
entries; Itanium defined "Translation Registers" which
could be taken from the translation cache). PTE and
paging structure prefetching is a natural extension of
data prefetching.

Jacob Bachmeyer

unread,
Jun 25, 2018, 10:42:49 PM6/25/18
to Luke Kenneth Casson Leighton, Paul Miranda, RISC-V ISA Dev, Paul A. Clayton
Luke Kenneth Casson Leighton wrote:
> On Tue, Jun 26, 2018 at 12:22 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>>> TLB pinning could also be useful, IMO.
>>>
>> Cache pins provide an otherwise-unavailable ability to use (part of) the
>> cache as a scratchpad.
>>
>
> oo now that's *very* interesting, particularly given that a GPU has
> such a strong requirement to process relatively large amounts of data
> (4x4 blocks of 32-bit pixels) *without* that going back to L2 and
> definitely not to main memory before the work's completely done.
>
> scratchpads are a bit of a pain as they need to be context-switched
> (or an entire core hard-dedicated to one task). if L1 cache can
> double-duty as a scratchpad that would be *great*, as all the things
> that an L1 cache has to take care of for context-switching and so on
> are already well-known and solved.
>

Note that cache pins are broken upon context switch unless the cache is
ASID-partitioned -- each task must appear to have the entire cache
available. User tasks that use D-cache pins are expected to re-pin
their working buffer frequently. Of course, if the working buffer is
"marching" through the address space, the problem solves itself as the
scratchpad advances (unpin old, pin new) through the address space.

>> What new ability do we get from pinning TLB entries?
>>
>
> working sets. reduced thrashing. batch processing.
>
> (if there is anyone who believes that the scenario below is not
> realistic please feel free to alter it and contribute an improvement
> or alternative that is).
>
> it should be fairly easy to envisage a case where a long-running
> process that needs regular but short-duration assistance of a process
> that requires some (but not a lot) of memory could have its
> performance adversely affected by the short-duration task due to the
> short-duration task pushing out significant numbers of TLBs for the
> long-running process.
>
> network packets coming in on database servers easily fits that scenario.
>
> if the short-duration task instead used a small subset of the TLB,
> then despite the short-duration's task being a bit slower, it's quite
> likely that the longer-duration task would be even worse-affected,
> particularly if there are significant numbers of smaller TLB entries
> (heavily-fragmented memory workloads) rather than fewer huge ones [for
> whatever reason].
>

Either the TLB is also partitioned by ASID, in which case this does not
matter because the two tasks will have effectively independent TLBs, or
TLB pinning will not help because pins are broken on context switch to
prevent one task from denying service to another by tying up a
significant part of the cache or TLB.

Failing to break pins on context switch also creates some nasty side
channels, so pins must be limited by ASID. ASID-partitioning is the
only way to close cache side-channels generally. Spectre-like attacks
can cause even innocent processes to "send" on a cache side-channel.
The only solution is to close the side-channels, and ASID-partitioning
has the added advantage of being able to pack more cache into the same
area by interleaving the partitions, since only one partition will be
active at a time, the power dissipation is that of a single partition
but spread over the area of multiple partitions, reducing overall
thermal power density.

> monero's mining algorithm in particular i understand to have been
> *deliberately* designed to hit [really large amounts of] memory
> particularly hard with random accesses, as a deliberate way to ensure
> that people don't design custom ASICs for it.
>

Does monero use scrypt? That was designed for hashing passwords to
resist GPU-based cracking.


-- Jacob

Jacob Bachmeyer

unread,
Jun 25, 2018, 11:11:19 PM6/25/18
to Paul Miranda, RISC-V ISA Dev, paaron...@gmail.com
> <http://MEM.PF>, I think, but with no worries about getting
> ownership, I think.
>
>
> Perhaps MEM.PF.MAP?
>
> TLB pinning could also be useful, IMO.
>
>
> Cache pins provide an otherwise-unavailable ability to use (part
> of) the cache as a scratchpad. What new ability do we get from
> pinning TLB entries?
>
> My desire for both of these is for providing reliably low latency
> interrupt handling for certain vectors despite the presence of caching
> and virtual memory. I want to pin I cache and I TLB (Data side would
> be useful too) to allow an S-mode handler to operate even if it isn't
> the most recently used code. I am assuming >N+1 associativity so that
> N pins could never lock out other threads or modes from caching
> completely, which I think would address your concern but I might not
> have understood it completely.

I think that pinning a cacheline needs to implicitly pin the associated
TLB entry, since invalidation of the latter must also writeback and
invalidate the former. Is there any use for pinning TLB entries
separately from cachelines?

The associativity assumption should not be a problem: hardware is
expected to reject requests to pin so many cachelines as to preclude
other data from being cached at all.

A better option here would probably be to partition the cache, providing
some number of cachelines exclusively for S-mode. With a partitioned
I-cache, the supervisor could simply include MEM.PF.TEXT in its
trap-exit code to prefetch the trap entry code back into the supervisor
I-cache. The supervisor could also use MEM.PF.EXCL to prefetch the
saved context area into the data cache, but that would be redundant as
restoring the user context will ensure that the saved context area is in
the supervisor D-cache. The supervisor caches simply hold thier
contents while executing in user mode (and get blown away anyway on a
hypervisor VM switch unless appropriately partitioned) so MEM.PF.TEXT
will ensure that the supervisor trap entry is cached when the next trap
is taken. (The S-mode prefetch can continue and complete after user
execution has resumed.)

I agree that having the supervisor trap entry always (or nearly always)
cached could be useful. Again, partitioning closes side-channels and is
likely to be "free" in modern high-density processes where power
dissipation is a more-constraining limit than geometry. Of course,
very-low-power embedded systems probably will not have those
power-dissipation constraints and would actually have an area cost for
additional caches.

The key difference between M-mode-only I-cache pins and general I-cache
pins is that the monitor can pin a bit of code, disable interrupts, and
then do something with that pinned code that temporarily disables main
memory, like rewriting flash on the aforementioned HiFive board.
Less-privileged modes cannot do this, since more-privileged interrupts
are always enabled. Effectively, there are two different types of
I-cache pins possible, one which is guaranteed but only available in
M-mode, and one which can be broken by a more-privileged context switch,
but can be made available all the way down to U-mode.


-- Jacob

Jacob Bachmeyer

unread,
Jun 25, 2018, 11:52:01 PM6/25/18
to Paul Miranda, RISC-V ISA Dev, paaron...@gmail.com
MEM.PF.STREAM is expected (on the high end) to activate a separate
prefetch unit with its own prefetch buffer; a "streaming write hint"
would act similarly but use a write-combining buffer, possibly with
associated logic to merge the combined writes into existing data, which
may require allocation of cachelines in an outer cache to perform the merge.

This leads to questions of what MEM.WRHINT should actually do. Perhaps
the simplest option is to force invalidation of any copies held by other
harts in the system, writing back a dirty copy if present and leaving
the region "up for grabs" and currently uncached? That followed by
allocating a cacheline in an "owned, invalid-data" state? Would a
separate write-combining cache be more useful even without write-hints?

I envision that as a third L1 cache: L1I, L1D, L1C; the "C-cache" is a
write-only write-combining element. Each byte in the C-cache has its
own validity flag and the C-cache performs an intervention (merging its
valid bytes with data brought in from L2) when the L1D cache requests a
line from L2 which the C-cache contains. The same line cannot be
present in both L1D and L1C; writes to cachelines present in L1D hit
there instead of arriving at L1C. (This can be performed in parallel if
the C-cache keeps one "new item" line available and simply drops the
"new item" if L1D reports a hit.) Reads from lines present in L1C force
allocation of a cacheline in L1D and the transfer-and-merge of that line
from L2 and L1C to L1D. Completely valid lines can be transferred to
the L1D cache without accessing L2, but then also need to be "written
back" to L2, which L1D should be able to handle on its own since a
C-cache intervention must mark the data "dirty" upon arrival in L1D.
Completely valid lines can be written from the C-cache to L2 at any
time. Evicting a line from the C-cache is a bit more complex, since the
line must be brought into L2 and the valid bytes from L1C merged.
Perhaps allocating a C-cache line could initiate an L2 prefetch? But
this is wasteful if an entire L2 cacheline is overwritten; in that case
the prefetch could have been elided. The C-cache does not intervene on
L1I-cache requests and is flushed to L2 upon execution of FENCE.I.

And that leads to a "bikeshed" question: what is the best two-letter
abbreviation for "hint"? I currently lean towards "MEM.NT." as a prefix
for write hints.

> MEM.REWRITE is also similar, but is allowed to drop data and again
> gets ownership.

MEM.REWRITE does a bit more than that: cachelines are allocated, filled
with a constant (either all-zeros or all-ones, depending on hardware),
and assigned ownership for that region with all remote copies (even if
dirty!) simply invalidated. (Writeback is permitted but useless, since
the remote dirty cacheline will be overwritten locally.)

> It may be that some systems couldn't accommodate what I'm thinking,
> although it's not really any different than when a noncoherent device
> wants to write to coherent memory.

If there are systems that cannot accommodate what you are thinking, then
I have misunderstood. Please explain.


-- Jacob

Jacob Bachmeyer

unread,
Jun 26, 2018, 1:26:10 AM6/26/18
to Paul A. Clayton, RISC-V ISA Dev
Paul A. Clayton wrote:
> On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Paul A. Clayton wrote:
>>
>>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>
> [snip for async prefetch fault stops or skips faulting addresses]
>
>>> This seems to be hint-like semantics (even though it is only triggered
>>> on a fault).
>>>
>> Please elaborate, although asynchronous prefetches do seem rather
>> hint-like now that you mention it.
>>
>
> By not faulting the asynchronous form is not strictly a directive (unless
> the range exceeded capacity in such a way that either skipped addresses
> would have been overwritten regardless of the skipping or the prefetch
> would stop when capacity was reached).
>

Prefetches always stop when capacity is reached; synchronous prefetches
report where they stopped.

Synchronous prefetches do not fault either -- prefetches are allowed to
run "off the end" of valid memory and into unmapped space; if this were
not so, "LOAD x0" would be a prefetch instruction, but it is not.
Permitting asynchronous prefetch to continue past a fault accommodates
pages being swapped out and is a significant win if the swapped-out page
is not actually accessed. This is also an implementation option --
implementations are permitted equally well to simply stop an
asynchronous prefetch when any fault is encountered. Ideally, an
implementation could do both -- stop if a permission check fails, but
advance prefetch to the next page in the region if a page is not present.

> [snip]
>
>>> A memory region can have the same number or frequency
>>> of accesses ("heaviness" can refer to "weight" or "density") but
>>> different use lifetimes.
>>>
>>> If this information is intended to communicate temporal locality
>>> (and not some benefit measure of caching), then prefetch once
>>> might be merged as the lowest temporal locality.
>>>
>>> (Utility, persistence, and criticality are different measures that
>>> software may wish to communicate to the memory system.)
>>>
>> Please elaborate on this. I would like to be sure that we are using the
>> same terms here.
>>
>
> Utility is roughly the number of accesses serviced/latency penalty
> avoided due to the prefetch. Persistence refers to the useful lifetime
> of the storage granule/prefetched region. Criticality refers to the
> urgency, e.g., a prefetch operation might be ready early enough
> that hardware can prefer bandwidth or energy efficiency over
> latency without hurting performance.
>

Then the prefetch levels are intended to indicate relative utility.
Persistence is more limited: zero is no prefetch at all, once is the
MEM.PF.ONCE and MEM.PF.STREAM instructions, many times is
MEM.PF.PF[0123]. Criticality is not well-represented in this proposal,
aside from a slight implication that streaming prefetch prefers
bandwidth over latency.

How well are utility and criticality typically correlated? For that
matter, how many levels of each should be distinguished?

I almost want to describe criticality as "yesterday", "now", and
"later". But "yesterday" can be represented by an actual access, so
that leaves "now" and "later" as prefetch options.

> [snip]
>
>>> It may be useful to make a distinction between prefetch for write
>>> where reads are not expected but the region is not guaranteed to
>>> be overwritten. An implementation might support general avoidance
>>> of read-for-ownership (e.g., finer-grained validity indicators) but
>>> still benefit from a write prefetch to establish ownership.
>>>
>>>
>> In other words, something in-between MEM.PF.EXCL (which prefetches the
>> current data in main memory) and MEM.REWRITE (which destroys the current
>> data in main memory)?
>>
>
> Yes.
>

[This has branched off into "write-combining hints".]

>>>> MEM.PF.ONCE ("prefetch once")
>>>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001101}
>>>> Prefetch as much of the region as possible, but expect the prefetched
>>>> data to be used at most once in any order.
>>>>
>>> "used once" may be defined at byte level or at cache block level.
>>>
>> It was intended to be at the level of "something", initially defined as
>> the width of the first access made to the prefetched region.
>>
>
> Defining the granularity of expiration of usefulness by the first
> access seems somewhat complex. Tracking "has been used"
> would seem to require a bit for the smallest possible granule,
> which seems unlikely to be provided (given that caches rarely
> track present or dirty at even 32-bit granularity).
>

Well, the smallest granule that the hardware cares about, which may be
an entire cacheline in practice, but application code does not know the
actual cacheline size. As long as accesses exhibit both spacial and
temporal locality, there is a good chance that marking the accessed
lines as "first to evict" but not actually dropping them until new
cachelines need to be loaded will work in practice.

> [snip]
>
>> I now wonder if MEM.PF.ONCE and MEM.PF.PF[0123] might be most useful in
>> combination, where MEM.PF.ONCE specifies a region and the second
>> prefetch specifies a granularity within that region, although this would
>> make the overall interface less regular.
>>
>
> This could also be an argument for not limiting such instructions to
> 32-bit encodings.
>

It is an argument, but I do not think that fine-grained prefetch has
enough demand to be the first to break beyond 32-bit instructions.

> [snip]
>
>>> It may be useful to include a stride for stream prefetching.
>>>
>> Could the stride be inferred from the subsequent access pattern? If
>> words at X, X+24, and X+48 are subsequently read, skipping the
>> intermediate locations, the prefetcher could infer a stride of 24 and
>> simply remember (1) the last location actually accessed and (2) the next
>> expected access location. The reason to remember the last actual access
>> is to still meet the "mimimize cache pollution" goal when a stride is
>> mispredicted.
>>
>
> As others noted, this would delay prefetching or waste bandwidth
> by assuming unit stride.
>

You are correct. The idea is to assume unit stride for the first
"prefetch group" and then adapt to the observed actual access pattern.
Bandwidth is only wasted if the stride exceeds a cacheline however,
since most implementations are expected to prefetch in units of
cachelines and the region is contiguous.

> [snip]
>
>>> With a unified Ln cache, would the instruction prefetching "overflow"
>>> into the Ln cache? (Similar questions apply to data.)
>>>
>> By "overflow" do you mean that once L1 cache is full, additional
>> cachelines would be prefetched into L2 only? I had expected to leave
>> that implementation-defined, since the proposal tries to avoid tying
>> itself to any particular cache structure (or even to actually using
>> caches at all, although I am not aware of any other useful
>> implementations).
>>
>
> This concerns the hardware fulfilling the intent of the software.
> If accesses to the prefetched region are "random", software might
> prefer a large prefetch region to be fetched into L2 rather than
> evicting most of L1.
>

Is there a useful way for software to indicate this intent or should
hardware simply recognize that larger prefetch requests should target
L2? Perhaps lower prefetch levels should prefetch only into outer
caches (a current implementation option)?

> [snip]
>
>>> "as much as possible" and "minimal delay" interact in a more complex
>>> memory system. Would software prefer more of the region be cached
>>> (at some latency or bandwidth penalty) or just have the excess ignored?
>>>
>> The key requirement is "no traffic to main memory". Is the "minimal
>> delay" confusing in this context?
>>
>
> Pinning to an off-chip L4 cache would avoid memory traffic and
> provide substantial capacity (which may be what is desired) but
> the latency (and bandwidth) would be worse than L1 latency (and
> bandwidth).
>

For a region that fits in L4 but no farther in, "minimal delay" would be
the latency of L4, since that is the only place the entire region can
fit. I think a clarification that pinned cachelines must remain cached,
but can be moved within the cache subsystem may be needed. Do you agree?

>>> (While these cache operations are presumably intended for more
>>> microarchitecture-specific tuning, some software may be developed
>>> quickly to work well on one microarchitecture with an expectation
>>> that the software would work decently for a general class of
>>> implementations.)
>>>
>>> Pinning also seems to be a specification of temporal locality:
>>> "perpetual" locality.
>>>
>> More like malloc(3)/free(3), but yes, pinning is "do not evict this
>> until I say so". The catch is that multi-tasking systems might not
>> actually be able to maintain that, so pins can be lost on
>> context-switch. For user mode, this should not be a significant
>> concern, since the pin can be redone on each iteration of an outer loop,
>> which is why pinning/unpinning is idempotent.
>>
>
> Having to repin even with only moderate frequency would remove
> some of the performance advantage. (This could be optimized with
> an approximate conservative filter; even a simple filter could
> quickly convert repinning to a nop if the pinning remained entirely
> intact. However, that adds significant hardware complexity.)
>

Hardware knows the last region pinned (2 XLEN-bit words or 2 XLEN-bit
words per cache ASID-partition) and knows if any pins have been broken
(one bit or one bit per cache ASID-partition; set it when CACHE.PIN is
executed, clear it when any pins are broken). Repinning the same region
would be trivial to recognize. Upon resuming the task that lost its
pins, some cachelines may be loaded normally before the region is
repinned. Pinning simply updates valid cachelines to "valid, pinned"
and loads cachelines not previously present. Remote invalidation can
"poke holes" in a pinned region, but repinning should still work
normally and the "pins intact" bit can be set again.

> [snip]
>
>> The concern here is that implementations where memory accesses *must* be
>> cached can avoid pinning so many cachelines that non-pinned memory
>> becomes inaccessible. There is no requirement that such a safety
>> interlock be provided, and implementations are permitted (but
>> discouraged) to allow cache pinning to be used as a footgun.
>>
>
> It seems that sometimes software would want to pin more cache
> than is "safe". (By the way, presumably uncacheable memory is
> still accessible. Memory for which no space is available for caching
> could be treated as uncached/uncacheable memory.)
>

Perhaps I am mistaken, but I understand that some current processors
cannot do that -- even "uncacheable" accesses must go through the cache,
but are simply immediately invalidated or written back.

Software could also want to pin more cache than exists, but that is
obviously not possible. Setting the limit for pinned cachelines
slightly lower than its physical hard limit is an implementation option
and can allow avoiding that entire can of worms.

> [snip]
>
>> Breaking pins on remote write was added to address complaints that
>> expecting such memory to remain pinned effectively made
>> invalidation-based coherency protocols unusable. The behaviors
>> described are both "may" options precisely because, while such writes
>> must work and must maintain coherency, exactly how writes to cachelines
>> pinned elsewhere are handled is intended to be implementation-defined.
>> Writes to pinned memory by the same hart that pinned it are expected;
>> the problems occur when other harts write to memory pinned on "this" hart.
>>
>
> (Actually the problem occurs when the pinned cache blocks are not
> shared between the pinning hart and the writing hart.)
>
> Implementation defined behavior requires a method of discovery. It
> also seems desirable to have guidelines on families of implementation
> to facilitate portability within a family.
>

The choice of implementation-defined behavior is software-invisible:
all valid options maintain coherency and user cache pins can be dropped
without warning anyway (such as by swapping out pages, although user
programs should use mlock(2) on anything pinned). Those behaviors are
described as an existence proof that cache pins are implementable.

> [snip Icache pinning]
>
>>> Why M-mode-only?
>>>
>> The I-cache pins are M-mode-only because that is the only mode where
>> context-switch can be guaranteed to not occur. These were added to
>> allow using the I-cache as temporary program memory on implementations
>> that normally execute from flash but cannot read from flash while a
>> flash write is in progress. The issue was seen on one of the HiFive
>> boards that was also an M-mode-only implementation.
>>
>
> How is different from data pinning?
>

[This is branched off into "S-mode I-cache pins".]

>>> If one is supporting such ranged fetches, it seems that support for
>>> memory-copy could be trivially provided.
>>>
>> Maybe. There are scope-creep concerns and there are currently no ranged
>> writes in the proposal. (MEM.REWRITE is not itself a write.)
>>
>
> On the other hand, recognizing that features are closely related
> in implementation seems important.
>

Yes, but I am uncertain how a memory-copy opcode fits in here.
MEM.REWRITE is intended for optimizing memcpy(3) and memset(3), however.

>>> TLB prefetching also seems worth considering.
>>>
>> Any suggestions?
>>
>
> Nothing specific. Locking TLB entries is not uncommonly
> supported (original MIPS TLB could set a maximum on
> the random number generated for replacement to lock
> entries; Itanium defined "Translation Registers" which
> could be taken from the translation cache). PTE and
> paging structure prefetching is a natural extension of
> data prefetching.
>

[This also branched into "TLB pinning".]

Is there a use for TLB prefetch/pinning separate from data
prefetch/pinning? Or is this purely a matter of reach, since TLBs can
map much larger regions than caches can store? Should prefetch
instructions be able to (asynchronously) continue TLB prefetching of the
region after reaching cache capacity?

>>> In addition, the above does not seem to consider cache
>>> hierarchies and cache sharing (even temporal sharing through
>>> software context switches). While most of the uses of such
>>> operations would have tighter control over thread allocation,
>>> some uses might expect graceful/gradual decay.
>>>
>> I am making an effort to keep these operations relatively abstract even
>> though that limits the amount of detail that can be specified.
>> Generally, embedded systems are expected to have that sort of tight
>> control and large systems (such as a RISC-V PC) are expected to use
>> ASID-partitioned caches (effectively an independent cache for each ASID)
>> for reasons of performance and security, since Spectre-like attacks
>> enable cache side-channels without the "sender's" cooperation.
>>
[Was a response intended here? The message ended with this quote block.]


-- Jacob

Luke Kenneth Casson Leighton

unread,
Jun 26, 2018, 3:40:30 AM6/26/18
to Jacob Bachmeyer, Paul Miranda, RISC-V ISA Dev, Paul A. Clayton
On Tue, Jun 26, 2018 at 3:42 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

>> if the short-duration task instead used a small subset of the TLB,
>> then despite the short-duration's task being a bit slower, it's quite
>> likely that the longer-duration task would be even worse-affected,
>> particularly if there are significant numbers of smaller TLB entries
>> (heavily-fragmented memory workloads) rather than fewer huge ones [for
>> whatever reason].
>>
>
>
> Either the TLB is also partitioned by ASID, in which case this does not
> matter because the two tasks will have effectively independent TLBs, or TLB
> pinning will not help because pins are broken on context switch to prevent
> one task from denying service to another by tying up a significant part of
> the cache or TLB.

ok so it's not a good idea to attempt, you need to switch the entire
TLB out anyway. i was under the impression that even with
different... argh what's that prefix for TLBs that makes them
separate... tags is it called? i was under the impression that even
with different "tags" it might be possible to leave some entries
untouched. evidently not.

> Failing to break pins on context switch also creates some nasty side
> channels, so pins must be limited by ASID. ASID-partitioning is the only
> way to close cache side-channels generally. Spectre-like attacks can cause
> even innocent processes to "send" on a cache side-channel. The only
> solution is to close the side-channels, and ASID-partitioning has the added
> advantage of being able to pack more cache into the same area by
> interleaving the partitions, since only one partition will be active at a
> time, the power dissipation is that of a single partition but spread over
> the area of multiple partitions, reducing overall thermal power density.

that sounds like a reasonable analysis to me.

>> monero's mining algorithm in particular i understand to have been
>> *deliberately* designed to hit [really large amounts of] memory
>> particularly hard with random accesses, as a deliberate way to ensure
>> that people don't design custom ASICs for it.
>>
>
>
> Does monero use scrypt? That was designed for hashing passwords to resist
> GPU-based cracking.

i don't know if it uses scrypt (and it would likely not be useful for
it to do so): the algorithm is designed specifically so that it's ok
to implement on GPUs but *NOT* in a custom ASIC. i.e. it
*specifically* requires significant numbers of data / table lookups
across really really large amounts of memory (currently 4GB, to be
increased to 8GB in the event that someone actually does try creating
a custom ASIC to mine monero).

by contrast bitcoin relies exclusively on SHA256 which can be
massively parallelised (making it a serious runaway race consuming
vast amounts of power and resources, planet-wide).

so due to the massive deliberate random-access memory pattern monero
is a fair candidate for use-case analysis.

l.

Albert Cahalan

unread,
Jun 26, 2018, 5:30:18 AM6/26/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

> Previous discussions suggested that explicit cache-control instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the user
> ISA.

You can implement hardware with 32-byte cache lines, but expose it to
software as if it were 512-byte cache lines. Adjust the numbers as you like.

The page size of 4096 bytes is already known. That makes a fine choice
for a software-visible cache line. Make that the case on all hardware, even
if the underlying implementation uses a smaller cache line size. Another fine
choice is 512 bytes, the traditional size of a disk block and thus
relevant for DMA.
The biggest I've heard of was 128 bytes. Going up an extra power of two for
some breathing room gives 256 bytes. That too is a perfectly fine size.

In other words: just pick something.

Paul A. Clayton

unread,
Jun 26, 2018, 11:47:15 AM6/26/18
to jcb6...@gmail.com, RISC-V ISA Dev
On 6/26/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Paul A. Clayton wrote:
>> On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>>> Paul A. Clayton wrote:
>>>
>>>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
[snip]
>> By not faulting the asynchronous form is not strictly a directive (unless
>> the range exceeded capacity in such a way that either skipped addresses
>> would have been overwritten regardless of the skipping or the prefetch
>> would stop when capacity was reached).
>>
>
> Prefetches always stop when capacity is reached; synchronous prefetches
> report where they stopped.

I was not thinking clearly, emphasizing in my mind "as much of the
chosen region as possible" and ignoring "must terminate and produce
the faulting address" (although the later might be tightened/clarified
to indicate that the range is loaded sequentially until a permission,
validity, invalid data (uncorrectable ECC error) fault (are there other
possible faults?)), particularly considering "end of capacity" as a
fault.

Since the operation as described gives equal weight to all data
in the region and gives no preferred ordering of access, loading
the end (or any arbitrary subset) would meet the goal *if* it was
intended as a single request and not as a conditional partial load
(where unloaded parts may be loaded later). If the access is
"random" within the region, then a single request might be normal
as software might not be able to use blocking to adjust the use
pattern to fit the available capacity.

> Synchronous prefetches do not fault either -- prefetches are allowed to
> run "off the end" of valid memory and into unmapped space; if this were
> not so, "LOAD x0" would be a prefetch instruction, but it is not.

They fault in the sense that they stop operation and return a
value related to the failure, allowing continuation if the access
pattern can be blocked.

> Then the prefetch levels are intended to indicate relative utility.
> Persistence is more limited: zero is no prefetch at all, once is the
> MEM.PF.ONCE and MEM.PF.STREAM instructions, many times is
> MEM.PF.PF[0123]. Criticality is not well-represented in this proposal,
> aside from a slight implication that streaming prefetch prefers
> bandwidth over latency.

So it is mostly communicates to hardware how much performance
would be lost by dropping the prefetch? However, if the prefetch is a
directive (especially if limited to a uniform latency cache), how
would hardware be expected to use this information? It cannot
drop the request even at level 0 (e.g., based on bandwidth cost)
because it is specified as a directive. Hardware might have a
"has been used" bit (to better support "use-once" and speculative
hardware prefetching utility determination) and after first use
set replacement information according to expected resuse distance.

There are at least two considerations: priority of prefetch (which
should include being able to drop it entirely, but this is not allowed
in the current specification) and replacement of prefetched data.

> How well are utility and criticality typically correlated? For that
> matter, how many levels of each should be distinguished?

For large regions, utility can be high while criticality could be low
(assuming random access, in which case the first access may
be at the end of the prefetch stream).

> I almost want to describe criticality as "yesterday", "now", and
> "later". But "yesterday" can be represented by an actual access, so
> that leaves "now" and "later" as prefetch options.

One might want to provide "yesterday" as an asynchronous
prefetch; if only part of the expected latency can be hidden by work,
one really would have preferred an earlier prefetch (i.e., "yesterday").
N loads even to x0 is also likely to be more expensive than a
single prefetch instruction for some values of N.

> Well, the smallest granule that the hardware cares about, which may be
> an entire cacheline in practice, but application code does not know the
> actual cacheline size. As long as accesses exhibit both spacial and
> temporal locality, there is a good chance that marking the accessed
> lines as "first to evict" but not actually dropping them until new
> cachelines need to be loaded will work in practice.

One might generally have significant temporal locality for use once
within a cache block, but not for the entire prefetch region (though
this would imply a significant degree of non-random access assuming
there is not knowledge of the block size; a "cache oblivious" algorithm
might try to localize accesses to use the largest likely block size, but
arbitrary blocking sizes are not always practical, especially for "random"
access use once).

>> This could also be an argument for not limiting such instructions to
>> 32-bit encodings.
>>
>
> It is an argument, but I do not think that fine-grained prefetch has
> enough demand to be the first to break beyond 32-bit instructions.

Part of the point of RISC-V encoding is to encourage VLE with
its ability to support extension. A 64-bit encoding would not be
worse than requiring two 32-bit instructions.

[snip]

> Is there a useful way for software to indicate this intent or should
> hardware simply recognize that larger prefetch requests should target
> L2? Perhaps lower prefetch levels should prefetch only into outer
> caches (a current implementation option)?

Size is a significant piece of information, but other information
might be worth communicating.

I do not have the time and energy to work out a good suggestion.
I am not even sure what the intended uses are for random
access (non-pinning) ranges are.

> For a region that fits in L4 but no farther in, "minimal delay" would be
> the latency of L4, since that is the only place the entire region can
> fit. I think a clarification that pinned cachelines must remain cached,
> but can be moved within the cache subsystem may be needed. Do you agree?

I suspect use may be more complex. If there is sequential
access with reuse, the first portion might be preferentially fetched
to L1 and hardware could be aware of when prefetches from L2 etc.
should be initiated. Hardware could even retain awareness of the
portion of the region that did not fit and prefetch that to L1 in a
more timely manner.

[snip]
>> It seems that sometimes software would want to pin more cache
>> than is "safe". (By the way, presumably uncacheable memory is
>> still accessible. Memory for which no space is available for caching
>> could be treated as uncached/uncacheable memory.)
>>
>
> Perhaps I am mistaken, but I understand that some current processors
> cannot do that -- even "uncacheable" accesses must go through the cache,
> but are simply immediately invalidated or written back.

That seems unlikely.

[snip]
>> Implementation defined behavior requires a method of discovery. It
>> also seems desirable to have guidelines on families of implementation
>> to facilitate portability within a family.
>>
>
> The choice of implementation-defined behavior is software-invisible:
> all valid options maintain coherency and user cache pins can be dropped
> without warning anyway (such as by swapping out pages, although user
> programs should use mlock(2) on anything pinned). Those behaviors are
> described as an existence proof that cache pins are implementable.

They are architecturally invisible but not performance invisible.
Since prefetches (without pinning) are performance-targeting
operations, such effects may be significant to software.

[snip]

> Is there a use for TLB prefetch/pinning separate from data
> prefetch/pinning? Or is this purely a matter of reach, since TLBs can
> map much larger regions than caches can store? Should prefetch
> instructions be able to (asynchronously) continue TLB prefetching of the
> region after reaching cache capacity?

If one has a large region that is randomly accessed in a
given execution phase, prefetching address translation
information may be useful where data prefetching would
not be (because most of the range might not be accessed
in that phase).

I do not think I can contribute anything to this proposal;
I just had a few thoughts that sprang quickly to mind.


>>>> In addition, the above does not seem to consider cache
>>>> hierarchies and cache sharing (even temporal sharing through
>>>> software context switches). While most of the uses of such
>>>> operations would have tighter control over thread allocation,
>>>> some uses might expect graceful/gradual decay.
>>>>
>>> I am making an effort to keep these operations relatively abstract even
>>> though that limits the amount of detail that can be specified.
>>> Generally, embedded systems are expected to have that sort of tight
>>> control and large systems (such as a RISC-V PC) are expected to use
>>> ASID-partitioned caches (effectively an independent cache for each ASID)
>>> for reasons of performance and security, since Spectre-like attacks
>>> enable cache side-channels without the "sender's" cooperation.
>>>
> [Was a response intended here? The message ended with this quote block.]

No.

Jacob Bachmeyer

unread,
Jun 26, 2018, 10:36:47 PM6/26/18
to Paul A. Clayton, RISC-V ISA Dev
Paul A. Clayton wrote:
> On 6/26/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Paul A. Clayton wrote:
>>
>>> On 6/25/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>
>>>> Paul A. Clayton wrote:
>>>>
>>>>> On 6/22/18, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>>>>
> [snip]
>
>>> By not faulting the asynchronous form is not strictly a directive (unless
>>> the range exceeded capacity in such a way that either skipped addresses
>>> would have been overwritten regardless of the skipping or the prefetch
>>> would stop when capacity was reached).
>>>
>> Prefetches always stop when capacity is reached; synchronous prefetches
>> report where they stopped.
>>
>
> I was not thinking clearly, emphasizing in my mind "as much of the
> chosen region as possible" and ignoring "must terminate and produce
> the faulting address" (although the later might be tightened/clarified
> to indicate that the range is loaded sequentially until a permission,
> validity, invalid data (uncorrectable ECC error) fault (are there other
> possible faults?)), particularly considering "end of capacity" as a
> fault.
>

Uncorrectable ECC error is expected to be an NMI in RISC-V. This raises
an interesting point: should a prefetch that encounters an
uncorrectable ECC error raise the associated NMI or simply drop the
unusable data and pretend that it did not happen?

> Since the operation as described gives equal weight to all data
> in the region and gives no preferred ordering of access, loading
> the end (or any arbitrary subset) would meet the goal *if* it was
> intended as a single request and not as a conditional partial load
> (where unloaded parts may be loaded later). If the access is
> "random" within the region, then a single request might be normal
> as software might not be able to use blocking to adjust the use
> pattern to fit the available capacity.
>

The intent is that synchronous prefetches load some contiguous prefix of
the requested region, while asynchronous prefetches can load an
arbitrary subset of the requested region, skipping over pages that are
not present.

>> Synchronous prefetches do not fault either -- prefetches are allowed to
>> run "off the end" of valid memory and into unmapped space; if this were
>> not so, "LOAD x0" would be a prefetch instruction, but it is not.
>>
>
> They fault in the sense that they stop operation and return a
> value related to the failure, allowing continuation if the access
> pattern can be blocked.
>

They do not fault in the sense of raising an exception and taking a trap.

>> Then the prefetch levels are intended to indicate relative utility.
>> Persistence is more limited: zero is no prefetch at all, once is the
>> MEM.PF.ONCE and MEM.PF.STREAM instructions, many times is
>> MEM.PF.PF[0123]. Criticality is not well-represented in this proposal,
>> aside from a slight implication that streaming prefetch prefers
>> bandwidth over latency.
>>
>
> So it is mostly communicates to hardware how much performance
> would be lost by dropping the prefetch? However, if the prefetch is a
> directive (especially if limited to a uniform latency cache), how
> would hardware be expected to use this information? It cannot
> drop the request even at level 0 (e.g., based on bandwidth cost)
> because it is specified as a directive. Hardware might have a
> "has been used" bit (to better support "use-once" and speculative
> hardware prefetching utility determination) and after first use
> set replacement information according to expected resuse distance.
>
> There are at least two considerations: priority of prefetch (which
> should include being able to drop it entirely, but this is not allowed
> in the current specification) and replacement of prefetched data.
>

The intention is that prefetches must be placed into the queue, but the
queue may be "leaky" -- that detail is unspecified. The original
concept was that prefetch levels indicated expected frequency of use on
some fuzzy scale and implementations could map them to prefetching into
various cache levels. I now know that that model is a bit ...
simplistic, although I believe that x86 uses a similar approach.

>> How well are utility and criticality typically correlated? For that
>> matter, how many levels of each should be distinguished?
>>
>
> For large regions, utility can be high while criticality could be low
> (assuming random access, in which case the first access may
> be at the end of the prefetch stream).
>

So prefetches should be defined for various combinations of utility and
criticality? Is there a simple algorithm that allows hardware to
untangle those into a prefetch queue ordering and can this instead be
moved to "compile-time", flattening that 2D space into a 1D "conflated
prefetch level"?

>> I almost want to describe criticality as "yesterday", "now", and
>> "later". But "yesterday" can be represented by an actual access, so
>> that leaves "now" and "later" as prefetch options.
>>
>
> One might want to provide "yesterday" as an asynchronous
> prefetch; if only part of the expected latency can be hidden by work,
> one really would have preferred an earlier prefetch (i.e., "yesterday").
> N loads even to x0 is also likely to be more expensive than a
> single prefetch instruction for some values of N.
>

Yes. I had the idea to have effectively two prefetch priorities and
actual accesses as a higher priority than any prefetch.


-- Jacob

Paul Miranda

unread,
Jun 28, 2018, 3:49:04 PM6/28/18
to RISC-V ISA Dev
While the gold standard in separating threads is to partition TLB and Cache by ASID (and that was my first thought as well when I first read the spec), I don't think it's strictly required by the RISCV-ISA, allowing for a lighter-weight implementation that does not necessarily provide this hard partitioning. The only useful way I have thought of applying this is to keep selected ASID==0 TLB and cache entries pinned (or simply not blindly flushed by a SFENCE.VMA with nonzero ASID), and everything else is subject to flushing.
There are definitely applications that care more about latency and cost than protecting data against side channel attacks. 
Even if you do want to enforce ASID partitioning, I can imagine implementations that don't have a 1-to-1 mapping of hardware partitions to ASID values, so long as context switch timing doesn't vary with data values. (probably easier said than done provably correct!)


On Monday, June 25, 2018 at 9:42:49 PM UTC-5, Jacob Bachmeyer wrote:
Note that cache pins are broken upon context switch unless the cache is
ASID-partitioned -- each task must appear to have the entire cache
available.  User tasks that use D-cache pins are expected to re-pin
their working buffer frequently.  Of course, if the working buffer is
"marching" through the address space, the problem solves itself as the
scratchpad advances (unpin old, pin new) through the address space.
...
 

Jacob Bachmeyer

unread,
Jun 28, 2018, 7:24:46 PM6/28/18
to Paul Miranda, RISC-V ISA Dev
Paul Miranda wrote:
> While the gold standard in separating threads is to partition TLB and
> Cache by ASID (and that was my first thought as well when I first read
> the spec), I don't think it's strictly required by the RISCV-ISA,
> allowing for a lighter-weight implementation that does not necessarily
> provide this hard partitioning. The only useful way I have thought of
> applying this is to keep selected ASID==0 TLB and cache entries pinned
> (or simply not blindly flushed by a SFENCE.VMA with nonzero ASID), and
> everything else is subject to flushing.
> There are definitely applications that care more about latency and
> cost than protecting data against side channel attacks.
> Even if you do want to enforce ASID partitioning, I can imagine
> implementations that don't have a 1-to-1 mapping of hardware
> partitions to ASID values, so long as context switch timing doesn't
> vary with data values. (probably easier said than done provably correct!)

ASIDs effectively partition the TLB; that is their purpose. The only
difference between a hard-partitioned TLB and a plain ASID-capable TLB
is that the hard-partitioned TLB reserves each slot for some particular
ASID, while the plain TLB stores an ASID in every slot and can assign
the same slot to different ASIDs at different times. This could
actually make the hard-partitioned TLB *less* complex in a sense, since
the ASID CAM columns are unneeded. RISC-V implementations are allowed
to support subsets of the ISA-defined ASID space, so 1-to-1 mapping of
ASID to hardware cache/TLB partitions is not unreasonable to expect,
although neither is it mandatory.

Caches are a different matter, but for some cache topologies
(particularly simple VIPT caches that are limited by the page size),
partitioning can allow the system to have larger caches than any
individual task can directly use in addition to its power density
reduction benefits.

Partitioned caches are never mandated in RISC-V, but implementations
that cannot provide fully-independent per-task caches must take steps to
isolate tasks that will affect some of the proposed features. In
particular, pinned cachelines must be unpinned if they would affect
another task, since leaving them pinned could itself open a side channel
due to the apparent reduction in cache size and its performance
effects. ASID-partitioned caches are a performance feature, since
non-partitioned caches must be flushed on every context switch. Further
partitioning the cache by privilege level is both a performance and
security feature, but again is never mandated. I expect that small
embedded systems will resort to flush-on-context-switch or even accept
the insecurity of non-isolated caches, which can be managed if the
complete software set is known, as it often is in embedded systems.
Larger, general-purpose, systems will want hard-partitioned caches and
TLBs for both security and performance.


-- Jacob

Paul Miranda

unread,
Jun 28, 2018, 11:12:26 PM6/28/18
to Jacob Bachmeyer, RISC-V ISA Dev
"non-partitioned caches must be flushed on every context switch."

I have looked for something saying that or not saying that in the RISC-V priv. spec and never found it. 

It is clear that the TLB must be flushed on an ASID or PPN change, and there is the explicit SFENCE.VMA to indicate when.
Similarly there is FENCE.I for flushing instruction cache.
Architecturally I can't find any statement when or how a data cache should be flushed. (how can be covered by the proposed instructions quite well, but I think the when is open to different usage scenarios. 
I have been thinking about the small embedded core case often, and would advocate limited data cache flushing.

Jacob Bachmeyer

unread,
Jun 28, 2018, 11:46:30 PM6/28/18
to Paul Miranda, RISC-V ISA Dev
The RISC-V ISA spec is silent on the matter, but failing to do so opens
side-channels. Some systems may be able to tolerate these side-channels
(example: embedded systems with known software) while others will have
big problems from them.

Some cache architectures may also require caches to be flushed or other
measures taken to prevent cached data from one task appearing in
another. Some (embedded) implementations may be able to tolerate such
leakage, however, so the RISC-V spec does not explicitly prohibit it.

My personal experience is with (1) PC-class hardware and (2) very small
microcontrollers (AVR, PIC32) so the proposal may have some blind spots
as it is oriented more towards large systems. Suggestions for
improvement on this issue are welcome, of course.


-- Jacob

Bill Huffman

unread,
Jul 6, 2018, 3:41:13 PM7/6/18
to RISC-V ISA Dev, jcb6...@gmail.com
Thank you, Jacob, for the work of putting together this proposal. RISC-V absolutely needs cache controls of some sort and I very much like the use of regions for the reasons you've articulated.

I have several separate comments I'll put under headings below.  I'm new to the discussion, so please forgive me if I'm rehashing any old issues.

========== asynchronous operations ==========================
In the proposal, setting rd to x0 indicates (potential) asynchronous operation.  The statement is that this can be used when "the instruction has no directly visible effects."  I'm guessing this means when the instruction has no required functional results (which probably means it's done for performance reasons).  A MEM.DISCARD before a non-coherent I/O block write, for example, cannot be done asynchronously because the result is functionally required and because the thread would have no way to know when the operation was complete. These two are tied together by the use of x0 as the destination register, I think.

If this is the thinking behind the asynchronous operations, I suggest a little more along this line be said in the spec.  I find myself wondering as well what happens to an asynchronous operation when another cacheop (sync or async) is executed.  Is the old one dropped?  Is the new one stalled until the existing one is completed?  Does anything happen on an interrupt?  What if the interrupt executes a new cacheop (sync or async)?  A full context switch?  Debug?

Perhaps all of these could be implementation defined, but I suspect some requirements need to be stated.  We might need an instruction to force the asynchronous machine to stop, for example.  We certainly need an interrupt to be able to occur in the middle of a synchronous operation and there needs to be a way for any cacheops in the interrupt handler to have priority over any ongoing cacheop.  Probably a synchronous cacheop stops in the middle on interrupt with rd pointing to the remaining region.  Perhaps an asynchronous cacheop is killed on interrupt, or is killed if the interrupt does a cacheop.

I also find myself wondering why MEM.DISCARD and MEM.REWRITE are specified to be synchronous only.  It's clear why they need a warning to the implementer that writing into the region after executing asynchronous versions *must* appear to write after the operations are complete.  But this could be accomplished in an implementation by a variety of mechanisms, such as doing the operations synchronously anyway, ending the operation if a write hit the remaining region, or stalling the write until the danger was past.

========= whole cache operations =========================
The proposal doesn't appear to include any plan for operations on the entire cache.  The most prominent needs here are probably initialization and, prior to disabling or changing size, flushing the entire cache.  It would be convenient to be able to use the same logic for these kinds of operations.  Has there been any consideration of how this might be included?

These can be M-mode only and don't need operands.  They might be considered to belong only under the SYSTEM major opcode, but I wanted to ask what thinking had been done.

========== pinned cache regions and cache-coherence =========
There has been discussion of what to do in a coherent system when a pinned region of the cache is written by another hart.  It seems like it would work to allow a pinned line to be valid or not.  The idea is that it is the address that is pinned, but the validity (and data) can still be taken away by the coherence protocol.  I think this would allow pinning and coherence to peacefully coexist as orthogonal concepts. This might also allow the pinning process to be separated into the "pin" itself and a prefetch that is expected to fill the data for the pinned range.  A scratchpad in a coherent system would then have "sectored cache" characteristics with many valid bits but one or a few tags.

This would solve some of the open issues with pinning, but I still find myself wondering about several issues.  What happens when a pin cannot be accomplished?  This could be because there are not enough cache ways, or because register pairs are used to hold pinned ranges and there are no more register pairs.  What happens when switching threads and executing more pin operations?  What guarantees are there that a region will remain pinned?  Any?  Why does a scratchpad implementation write back on an unpin operation where a cache implementation does not need to?  And why does it write back "if possible?"  If there's uncertainty, how does a scratchpad implementation work in a known fashion?

Also, is there the possibility of a performance related pin operation for the I-Cache in addition to the absolute, M-mode pin discussed?

========== memory obsolescence issues ========================
On the "foreign data" question wrt MEM.REWRITE, I was impressed that someone worked out how just being allowed to read the line being changed was not enough.  Thank you.  But it still bothers me to have to do all the zeroing.  It's not the gates that bother me.  It's the cycles in a machine with a memory access much narrower than the size of the cache line, where several accesses are required to the data portion of the cache to zero a line while only one read and one write are required to the tag otherwise.

I think you are suggesting that a line that is a MEM.REWRITE target and is already in the cache can remain unchanged.  I think this is an important property to keep because in a hart with a large cache which runs a single thread for a long time, it will often be the case that the target line of the MEM.REWRITE is already in the cache in an exclusive state.  Without this, for performance reasons, users (or memcpy/memset) might need to try to determine whether much of the data will already be in cache before deciding whether or not to execute a MEM.REWRITE.

I also wonder whether it might be OK, in an M-mode only implementation, to not zero any lines, ever, on MEM.REWRITE.

I wonder why destructive operations are nops when they specify one byte (same register in rs1 and rs2).  It seems like they should do the non-destructive version the way partial cachelines at the ends of a region would.

On the question of whether to turn MEM.REWRITE into MEM.CLEAR instead, I would suggest not, for the reasons above.  But we might consider adding MEM.CLEAR as a separate operation, since all the hardware for that will already be there.  On the other hand, if MEM.REWRITE is implemented as a nop or possibly the zeroing isn't done in an M-mode only implementation, this might be an issue.  Perhaps rd==rs1 could signal that the clear has not been done and needs to be done by another mechanism.

========== returning rs1 ======================================
The statement is made that these instructions can be implemented as nops and return rs1.  I don't fully understand this provision.  Is this to be done when there is no cache hardware (or it's disabled)?  If so, shouldn't the instruction return success (rd <- rs2+1) rather than what seems like a kind of failure, or at least has to be separately tested for?  If there is an enabled cache and the instructions don't work, shouldn't they take an illegal/unimplemented instruction exception?

The "hint" description seems to be involved here and seems like it either needs a more exact description or I think it should be somewhat different.  I don't understand the statement that prefetches cannot be nops.  Flush and several others probably can't be nops, but prefetches can be.  I think MEM.REWRITE can be a nop.

In general, it seems "downgrades" (flush, discard, etc.) must have their defined behavior while "upgrades" (prefetch, rewrite, etc.) can be nops (returning rs2+1) in an implementation that doesn't wish to make them available.

======== rs2 inclusive or exclusive ====================
The definition given for the upper bound is inclusive rather than exclusive.  This has two advantages I can think of: if the top of the region is the top of memory, there's no wrap to zero issue, and it's possible to use the same register twice for a one line cache operation.

Maybe you've considered this earlier, but there are some reasons to have rs2 be exclusive.  (In providing similar instructions in the past, we used a size, but the partial completion capability here needs to specify a bound.)

** It is consistent with the "exclusive," if you will, return in rd of the first byte address that has not been operated on.

** It is more straight-forward to compute (as rs1 plus size).

** Upon correct completion, rd <- rs2 instead of rs2+1

** Testing for completion is simpler (rd==rs2).

If this representation were used, we might set rs2 to x0 to signify operating on the smallest unit containing rs1.  Or we might simply use rs2 <- rs1+1 if providing a one-byte cacheop is not especially important.

For the wrap from high memory to zero case, maybe the highest memory address can't be used.  Or maybe the wrap doesn't matter.  And, in addition, the wrap already exists for the return value in rd.  It may be confusing to have the wrap to zero in one case and not the other.

To me, the advantages of exclusive out-weigh the advantages of inclusive.  Maybe you can help more with why inclusive or maybe you can consider exclusive.

========== cache hierarchies ===================================
It seems to me that there ought to be some provision, or at least a thought of how it might be added, to control how far "in" to a cache hierarchy (or how close to the processor) a prefetch should operate. There might also be a need for controlling how far "out" in a cache hierarchy a flush should operate (depending on which non-coherent, or partially coherent, masters need to see it).

========== required capabilities ================================
I think the proposal should state somewhere exactly what capability (R, W, or X, I assume) is required of the process before it can make whatever change to any given cache line.  For example, it might need read privilege to do non-destructive operations and write privilege to do destructive operations.

     Bill

Jose Renau

unread,
Jul 6, 2018, 4:17:40 PM7/6/18
to Jacob Bachmeyer, Paul A. Clayton, RISC-V ISA Dev
To make it more deterministic, I would not raise NMI from ECC errors in prefetch requests.

Otherwise, depending on the system load a single threaded application may raise NMI or not.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5B32F83B.2090501%40gmail.com.

Luke Kenneth Casson Leighton

unread,
Jul 6, 2018, 4:22:07 PM7/6/18
to Bill Huffman, RISC-V ISA Dev, Jacob Bachmeyer
On Fri, Jul 6, 2018 at 8:41 PM, Bill Huffman <huf...@cadence.com> wrote:

> On the question of whether to turn MEM.REWRITE into MEM.CLEAR instead, I
> would suggest not, for the reasons above. But we might consider adding
> MEM.CLEAR as a separate operation, since all the hardware for that will
> already be there. On the other hand, if MEM.REWRITE is implemented as a nop
> or possibly the zeroing isn't done in an M-mode only implementation, this
> might be an issue. Perhaps rd==rs1 could signal that the clear has not been
> done and needs to be done by another mechanism.

whenever there has been the possibility of polarisation due to one
implementor preferring (or requiring) one scheme and another requiring
a mutually-exclusive alternative, to create a win-win situation i've
always advocated that implementors be required to support both. this
ssems to be what you are suggesting, bill, which is encouraging.

however... allowing implementors to choose *not* to implement parts
of a standard has ramifications: the X25 standard made that mistake
decades ago, by making hardware-control-lines optional and allowing a
"fall-back" position in software... consequently everyone *had* to
implement the software mechanism, thus making the hardware control
lines totally redundant. given that X25 did not have an XMIT clock it
made the standard expensive to deploy vs RS232: external clock box
with external PSU vs a $1 cable.

my feeling is, therefore, that it would be better to make it
mandatory, *but* that implementors are advised that they can choose
*not* to actually put in the hardware but instead *must* throw an
exception... which can be caught.... and in the trap handler the clear
may be explicitly done by any mechanism that the implementor chooses.

this saves on instructions (and cycles) for implementors that choose
to implement the clear in hardware, but without software libraries
being forced to support the "fall-back" position... *just* in case
widely and publicly distributed binaries happen to run on
widely-disparate systems.

in essence if clear is not made mandatory at the hardware level, the
requirement to action the clear at the *software* level becomes a
mandatory de-facto requirement as a de-facto and indirect part of the
standard, even if that was not intentional.

and if that's going to be the case it's much better to be done via a
trap than be left in the program. of course... there is a caveat
here: traps cause context-switching. context-switching may have
unintended side-effects on the cache... so it's not as clear-cut in my
mind as the logical reasoning above would imply.

l.

Bill Huffman

unread,
Jul 6, 2018, 4:58:08 PM7/6/18
to RISC-V ISA Dev, huf...@cadence.com, jcb6...@gmail.com
Yes, Luke, I was a little fuzzy there.  I think the standard must avoid uncertainty.  In this case
I can think of the following ways to do that:

* Of course we can decide not to define MEM.CLEAR.

* We can mandate that MEM.CLEAR clear memory.

* We can mandate that MEM.CLEAR either clear memory or raise an exception.

* We can include in MEM.CLEAR a set of responses in rd that the program is required to deal
with.  The partial completion responses already require re-executing the instruction.  Maybe
other responses possible as well.

But things cannot simply happen or not happen!

On the other hand, things such as prefetch, which I view as having no functional results, can
just happen or not.  The current statement that they must happen surprises me.

      Bill

Jacob Bachmeyer

unread,
Jul 6, 2018, 8:29:49 PM7/6/18