Proposal: Explicit cache-control instructions (draft 2 after feedback)

462 views
Skip to first unread message

Jacob Bachmeyer

unread,
Jun 16, 2017, 8:42:26 PM6/16/17
to isa...@groups.riscv.org
Recent discussions have suggested that explicit cache-control
instructions could be useful, but RISC-V has some constraints here that
other architectures lack, namely that caching must be transparent to the
user ISA.

I propose a new minor opcode REGION := 3'b001 within the existing
MISC-MEM major opcode. Instructions in REGION are R-type and use rs1 to
indicate a base address, rs2 to indicate an upper bound address, and
produce a result in rd that is the first address after the highest
address affected by the operation. If rd is x0, the instruction has no
directly visible effects and can be executed entirely asynchronously as
an implementation option.

Non-destructive operations permit an implementations to expand a
requested region on both ends to meet hardware granularity for the
operation. An application can infer alignment from the produced value
if it is a concern. As a practical matter, cacheline lengths are
expected to be declared in the processor configuration ROM.

Destructive operations are a thornier issue, and are resolved by
requiring any partial cachelines (at most 2 -- first and last) to be
prefetched instead of performing the requested operation on those
cachelines. Implementations may perform the destructive operation on
the parts of these cachelines included in the region, or may simply
prefetch them.

If the upper and lower bounds are the same register, the smallest region
that can be affected that includes the lower bound is affected if the
operation is non-destructive; destructive operations are no-ops.
Otherwise, the upper bound must be greater than the lower bound and the
contrary case is reserved. (Issue for discussion: what happens if the
reserved case is executed?)

Instructions in MISC-MEM/REGION may be implemented as no-ops if an
implementation lacks the corresponding hardware. The value produced in
this case is implementation-defined. (Zero or the base address are good
choices.)

The new MISC-MEM/REGION space will have room for 128 opcodes, one of
which is the existing FENCE.I. I initially propose:

===Fences===

[function 7'b0000000 is the existing FENCE.I instruction]

[function 7'b0000001 reserved]

FENCE.RD ("range data fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
Perform a conservative fence affecting only data accesses to the chosen
region. This instruction always has visible effects on memory
consistency and is therefore synchronous.

FENCE.RI ("range instruction fence")
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000011}
Perform equivalent of FENCE.I but affecting only instruction fetches
from the chosen region. This instruction always has visible effects on
memory consistency and is therefore synchronous.

===Non-destructive cache control===

====Prefetch====

All prefetch instructions ignore page faults and other access faults.
In general use, applications should use rd == x0 for prefetching,
although this is not required. If a fault occurs during a synchronous
prefetch (rd != x0), the operation must terminate and produce the
faulting address. A fault occurring during an asynchronous prefetch (rd
== x0) may cause the prefetching to stop or the implementation may
attempt to continue prefetching past the faulting location.

CACHE.PREFETCH0 - CACHE.PREFETCH3
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
Load as much of the chosen region as possible into the data cache, with
varying levels of expected temporal access locality.

CACHE.PREFETCH.ONCE
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
Prefetch as much of the region as possible, but expect the prefetched
data to be used at most once.

CACHE.PREFETCH.I
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
Load as much of the chosen region as possible into the instruction cache.


====Cacheline pinning====

??? Issue for discussion: should a page fault while pinning cachelines
cause a trap to be taken?

CACHE.PIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
Load as much of the chosen region as possible into the data cache and
keep it there until unpinned. Pinning a cacheline is idempotent.

CACHE.UNPIN
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
Explicitly release a pin set with CACHE.PIN. Pinned cachelines are also
implicitly released if the memory protection and virtual address mapping
is changed. (Specifically, a write to the current satp CSR or an
SFENCE.VM will unpin all cachelines as a side effect, unless the
implementation partitions its cache by ASID. Even with ASID-partitioned
caches, changing the root page number associated with an ASID unpins all
cachelines belonging to that ASID.) Unpinning a cacheline does not
immediately remove it from the cache.

And two M-mode-only privileged instructions:

CACHE.PIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1001010, 2'b11}
Load as much of the chosen region as possible into the instruction cache
and keep it there. Pinning a cacheline is idempotent.

CACHE.UNPIN.I
{opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1001011, 2'b11}
Release instruction cachelines pinned with CACHE.PIN.I. Pins are
idempotent. One unpin instruction will unpin all affected cachelines
completely, regardless of how many times they were pinned.

====Flush====

CACHE.WRITE
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
Write any cachelines in the requested region. An ordinary or ranged
fence can invalidate the cachelines if needed.

[function 7'b0001101 reserved for write-and-invalidate if using fences
proves unconvincing]


===Destructive cache control===

The destructive operations are performance optimizations when the
current contents of a region are unimportant. Both are no-ops if an
implementation does not have caches.

CACHE.DISCARD
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
Drop the requested region from the cache without writing dirty
cachelines to memory. This declares the requested region to be "don't
care" and its contents are undefined after the operation completes. If
the region contains partial cachelines, those cachelines are written and
invalidated, but cachelines entirely within the region are simply
invalidated.
NOTE WELL: This is *not* an "undo" operation for memory writes -- an
implementation is permitted to aggressively writeback dirty cachelines,
or even to omit caches entirely.

CACHE.PREZERO
{opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
Allocate cachelines for the requested region, marking them dirty, but do
not fetch data from memory, instead initializing the contents to zero.
If the requested region contains partial cachelines, those cachelines
*are* fetched from memory.
TLB fills occur normally as for writes to the requested region. Page
faults in the middle of the region cause this operation to stop and
produce the faulting address. A page fault at the beginning of the
operation causes a trap to be taken.
NOTE WELL: This is *not* memset(3) -- a cacheline already present will
*not* be affected.


=== ===

Thoughts?

Thanks to:
[draft 1]
Bruce Hoult for citing a problem with the HiFive board that inspired
the I-cache pins.
[draft 2]
Stefan O'Rear for suggesting the produced value should point to the
first address after the affected region.
Alex Elsayed for pointing out serious problems with expanding the
region for a destructive operation and suggesting that "backwards"
bounds be left reserved.
Guy Lemieux for pointing out that pinning was insufficiently specified.
Andrew Waterman for suggesting that MISC-MEM/REGION could be encoded
around the existing FENCE.I instruction.
Allen Baum for pointing out the incomplete handling of page faults.


-- Jacob

Stefan O'Rear

unread,
Jun 16, 2017, 11:14:11 PM6/16/17
to Jacob Bachmeyer, RISC-V ISA Dev
On Fri, Jun 16, 2017 at 5:42 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Recent discussions have suggested that explicit cache-control instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the user ISA.

If there is to be

Stefan O'Rear

unread,
Jun 16, 2017, 11:17:37 PM6/16/17
to Jacob Bachmeyer, RISC-V ISA Dev
On Fri, Jun 16, 2017 at 5:42 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Recent discussions have suggested that explicit cache-control instructions
> could be useful, but RISC-V has some constraints here that other
> architectures lack, namely that caching must be transparent to the user ISA.

If there is to be a standard extension for interoperable
software-managed caching, I would rather see it developed and
championed by representatives of two or more teams developing systems
that require it.

-s

Michael Clark

unread,
Jun 17, 2017, 12:00:24 AM6/17/17
to jcb6...@gmail.com, isa...@groups.riscv.org

> On 17 Jun 2017, at 12:42 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Recent discussions have suggested that explicit cache-control instructions could be useful, but RISC-V has some constraints here that other architectures lack, namely that caching must be transparent to the user ISA.
>
> I propose a new minor opcode REGION := 3'b001 within the existing MISC-MEM major opcode. Instructions in REGION are R-type and use rs1 to indicate a base address, rs2 to indicate an upper bound address, and produce a result in rd that is the first address after the highest address affected by the operation. If rd is x0, the instruction has no directly visible effects and can be executed entirely asynchronously as an implementation option.
>
> Non-destructive operations permit an implementations to expand a requested region on both ends to meet hardware granularity for the operation. An application can infer alignment from the produced value if it is a concern. As a practical matter, cacheline lengths are expected to be declared in the processor configuration ROM.
>
> Destructive operations are a thornier issue, and are resolved by requiring any partial cachelines (at most 2 -- first and last) to be prefetched instead of performing the requested operation on those cachelines. Implementations may perform the destructive operation on the parts of these cachelines included in the region, or may simply prefetch them.
>
> If the upper and lower bounds are the same register, the smallest region that can be affected that includes the lower bound is affected if the operation is non-destructive; destructive operations are no-ops. Otherwise, the upper bound must be greater than the lower bound and the contrary case is reserved. (Issue for discussion: what happens if the reserved case is executed?)
>
> Instructions in MISC-MEM/REGION may be implemented as no-ops if an implementation lacks the corresponding hardware. The value produced in this case is implementation-defined. (Zero or the base address are good choices.)
>
> The new MISC-MEM/REGION space will have room for 128 opcodes, one of which is the existing FENCE.I. I initially propose:
>
> ===Fences===
>
> [function 7'b0000000 is the existing FENCE.I instruction]
>
> [function 7'b0000001 reserved]
>
> FENCE.RD ("range data fence")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
> Perform a conservative fence affecting only data accesses to the chosen region. This instruction always has visible effects on memory consistency and is therefore synchronous.

The text should mention more semantics, but not implementation details. You refer to fence as a flush operation later in the text. I’d prefer we were more “explicit”

In an implementation with data caches, the data cache state for the range specified by rs1 and rd should be should be ‘invalid’

Given this is explicit cache management I prefer CACHE.FLUSH and to move it alongside the other explicit cache control instructions. While it is a form of FENCE, the goal here is explicit cache control.

Perhaps CACHE.FLUSH or CACHE.FENCE (to use the CACHE prefix consistently for this set of instructions).

This is a FLUSH operation? e.g. Write and invalidate.

> FENCE.RI ("range instruction fence")
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000011}
> Perform equivalent of FENCE.I but affecting only instruction fetches from the chosen region. This instruction always has visible effects on memory consistency and is therefore synchronous.

Perhaps CACHE.FENCE.I

> ===Non-destructive cache control===
>
> ====Prefetch====
>
> All prefetch instructions ignore page faults and other access faults. In general use, applications should use rd == x0 for prefetching, although this is not required. If a fault occurs during a synchronous prefetch (rd != x0), the operation must terminate and produce the faulting address. A fault occurring during an asynchronous prefetch (rd == x0) may cause the prefetching to stop or the implementation may attempt to continue prefetching past the faulting location.
>
> CACHE.PREFETCH0 - CACHE.PREFETCH3
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
> Load as much of the chosen region as possible into the data cache, with varying levels of expected temporal access locality.

The semantics of temporal access needs to be described in commentary. e.g. the temporal access level may in an implementation correspond to a cache tier. e.g. Code executing CACHE.PREFETCH1 expects prefetched memory to be available in L1 cache. CACHE.PREFETCH2 expects prefetched memory to be available in L2, etc. Suggest numbers have symmetry well known cache tiers, as this is all I can think of for temporal correspondence, however the commentary should note that this is a possible implementation and that the instructions may be no-ops. Otherwise add commentary of potential implementation defined semantics of the temporal tiers and what they might correspond to in an implementation. i.e. 1 is higher temporal priority than 2? and subsequent accesses are expected to have less latency.

Perhaps CACHE.PREFETCH1 - CACHE.PREFETCH4 (for symmetry with cache level nomenclature)

> CACHE.PREFETCH.ONCE
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
> Prefetch as much of the region as possible, but expect the prefetched data to be used at most once.
>
> CACHE.PREFETCH.I
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
> Load as much of the chosen region as possible into the instruction cache.
>
>
> ====Cacheline pinning====
>
> ??? Issue for discussion: should a page fault while pinning cachelines cause a trap to be taken?
>
> CACHE.PIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
> Load as much of the chosen region as possible into the data cache and keep it there until unpinned. Pinning a cacheline is idempotent.
>
> CACHE.UNPIN
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
> Explicitly release a pin set with CACHE.PIN. Pinned cachelines are also implicitly released if the memory protection and virtual address mapping is changed. (Specifically, a write to the current satp CSR or an SFENCE.VM will unpin all cachelines as a side effect, unless the implementation partitions its cache by ASID. Even with ASID-partitioned caches, changing the root page number associated with an ASID unpins all cachelines belonging to that ASID.) Unpinning a cacheline does not immediately remove it from the cache.

I’ve always been fond of the idea of being able to pin data into L1 cache i.e. a bit on one or more of the ways.

This would be useful for cryptographic algorithms that want constant time behaviour and can control their cache working set size.

> And two M-mode-only privileged instructions:
>
> CACHE.PIN.I
> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1001010, 2'b11}
> Load as much of the chosen region as possible into the instruction cache and keep it there. Pinning a cacheline is idempotent.
>
> CACHE.UNPIN.I
> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1001011, 2'b11}
> Release instruction cachelines pinned with CACHE.PIN.I. Pins are idempotent. One unpin instruction will unpin all affected cachelines completely, regardless of how many times they were pinned.
>
> ====Flush====
>
> CACHE.WRITE
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
> Write any cachelines in the requested region. An ordinary or ranged fence can invalidate the cachelines if needed.

In an implementation with data caches, the data cache state for the range specified by rs1 and rd should be should be ‘valid’ and ‘clean’ after this transaction.

This is a WRITEBACK operation? e.g. Write and mark clean.

> [function 7'b0001101 reserved for write-and-invalidate if using fences proves unconvincing]
>
>
> ===Destructive cache control===
>
> The destructive operations are performance optimizations when the current contents of a region are unimportant. Both are no-ops if an implementation does not have caches.
>
> CACHE.DISCARD
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
> Drop the requested region from the cache without writing dirty cachelines to memory. This declares the requested region to be "don't care" and its contents are undefined after the operation completes. If the region contains partial cachelines, those cachelines are written and invalidated, but cachelines entirely within the region are simply invalidated.
> NOTE WELL: This is *not* an “undo" operation for memory writes -- an implementation is permitted to aggressively writeback dirty cachelines, or even to omit caches entirely.

Consider CACHE.INVALIDATE

In an implementation with data caches, the data cache state for the range specified by rs1 and rd should be ‘invalid’ after this transaction.

> CACHE.PREZERO
> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
> Allocate cachelines for the requested region, marking them dirty, but do not fetch data from memory, instead initializing the contents to zero. If the requested region contains partial cachelines, those cachelines *are* fetched from memory.
> TLB fills occur normally as for writes to the requested region. Page faults in the middle of the region cause this operation to stop and produce the faulting address. A page fault at the beginning of the operation causes a trap to be taken.
> NOTE WELL: This is *not* memset(3) -- a cacheline already present will *not* be affected.
>
>
> === ===
>
> Thoughts?

Permissions bits. Perhaps a “cachectrl” CSR with 2 bits for each operation, indicating the lowest mode that can use the instruction, then an OS can decide whether it gives cache control to U mode.

> Thanks to:
> [draft 1]
> Bruce Hoult for citing a problem with the HiFive board that inspired the I-cache pins.
> [draft 2]
> Stefan O'Rear for suggesting the produced value should point to the first address after the affected region.
> Alex Elsayed for pointing out serious problems with expanding the region for a destructive operation and suggesting that "backwards" bounds be left reserved.
> Guy Lemieux for pointing out that pinning was insufficiently specified.
> Andrew Waterman for suggesting that MISC-MEM/REGION could be encoded around the existing FENCE.I instruction.
> Allen Baum for pointing out the incomplete handling of page faults.
>
>
> -- Jacob
>
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/59447AEA.7050109%40gmail.com.

Guy Lemieux

unread,
Jun 17, 2017, 1:34:03 AM6/17/17
to jcb6...@gmail.com, isa...@groups.riscv.org
FENCE.I and FENCE.RI can share the same opcode. The only difference is that FENCE.I operated on the whole memory range. this can be conveniently encoded as rs1=r0 and rs2=r0, or as rs1=0 and rs2=-1.


FENCE.RD cannot replace FLUSH. a fence does not necessarily have to flush a data cache if it has a cache coherence protocol. a FLUSH must always write back and mark invalid. please add a FLUSH/CACHE.FLUSH instruction.

Equivalent from my proposal:
CACHE.WRITE = WRITEBACK
CACHE.DISCARD = INVALIDATE

I agree that when memory range does not perfectly align with cache lines, INVALIDATE should also write back up to 2 dirty lines: rs1 if dirty and unaligned, and rs2 if dirty and unaligned. This will preserve correctness for applications. However it will make implementations complex because the WRITEBACK of a partial line is nontrivial and not directly supported by protocols such as AXI for bursts. The writes have to be done byte by byte, half word by halfword, or word by word, depending on the misalignment. I think INVALIDATE is the only transaction with this requirement to do something about misalignment for correctness, so we may wish to think hard about this.

I hesitate on the PREZERO proposal because DMA hardware could do the same thing. It's different than the other cache operations -- this is a "nice to have that has other implementations", whereas the others have almost no other way to be constructed. I advise caution on this because cacheless systems can't simply treat it as a NOP like the other cache instructions -- it has important side effects.

Stefan wants at least 2 different groups to advocate for this proposal -- please consider me/VectorBlox the second party. We are currently deciding exactly how to do cache control. However, I have not been thinking of prefetching or pinning, and so I'd like a third advocate to step forward here.

Sincerely,
Guy



--
Embedded Supercomputing
http://www.vectorblox.com

Michael Clark

unread,
Jun 17, 2017, 3:01:15 AM6/17/17
to Guy Lemieux, jcb6...@gmail.com, isa...@groups.riscv.org

> On 17 Jun 2017, at 5:33 PM, Guy Lemieux <glem...@vectorblox.com> wrote:
>
> FENCE.I and FENCE.RI can share the same opcode. The only difference is that FENCE.I operated on the whole memory range. this can be conveniently encoded as rs1=r0 and rs2=r0, or as rs1=0 and rs2=-1.
>
>
> FENCE.RD cannot replace FLUSH. a fence does not necessarily have to flush a data cache if it has a cache coherence protocol. a FLUSH must always write back and mark invalid. please add a FLUSH/CACHE.FLUSH instruction.
>
> Equivalent from my proposal:
> CACHE.WRITE = WRITEBACK
> CACHE.DISCARD = INVALIDATE

I was confused by that too, so we in fact have:

- FENCE.RD
- FENCE.RI
- CACHE.WRITE
- CACHE.FLUSH
- CACHE.DISCARD

> I agree that when memory range does not perfectly align with cache lines, INVALIDATE should also write back up to 2 dirty lines: rs1 if dirty and unaligned, and rs2 if dirty and unaligned. This will preserve correctness for applications. However it will make implementations complex because the WRITEBACK of a partial line is nontrivial and not directly supported by protocols such as AXI for bursts. The writes have to be done byte by byte, half word by halfword, or word by word, depending on the misalignment. I think INVALIDATE is the only transaction with this requirement to do something about misalignment for correctness, so we may wish to think hard about this.
>
> I hesitate on the PREZERO proposal because DMA hardware could do the same thing. It's different than the other cache operations -- this is a "nice to have that has other implementations", whereas the others have almost no other way to be constructed. I advise caution on this because cacheless systems can’t simply treat it as a NOP like the other cache instructions -- it has important side effects.

PREZERO would be a very useful feature.

While DMA hardware /may/ be able to do this, does it in practice? OS kernels have to zero all pages returned to the user and in practice they will resort to using vector stores, or as they do now; packed SIMD (SSE) for performance reasons. All allocated pages (user or kernel) have to be zeroed and I’m not aware of kernels that use DMA to do this. memset and bzero likely dominate quite a few profiles.

A load store unit in a single issue CPU is going to take 8 sequential 64-bit stores to zero a 64-bit cache line, but I suspect a cache could mux or broadcast a zero line in one cache transaction, e.g. if the width between the L1 and L2 or the memory system is cache line width.

PREZERO is worth considering. Perhaps some of the instructions can be optional and revealed via read-only feature bits in a “cachectl” CSR. The permission issue also needs to be solved as some OSes may wish to use sysctls or capabilities to allow userspace access to cache invalidation functions. For various reasons, on some platforms it may be restricted to Supervisors. I believe a CSR for the feature will be required, much like the masking of performance counters with mcounteren.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/E492E465-DBB6-4B02-8F8F-981F357FF756%40vectorblox.com.

Michael Clark

unread,
Jun 17, 2017, 3:04:51 AM6/17/17
to Guy Lemieux, jcb6...@gmail.com, isa...@groups.riscv.org

On 17 Jun 2017, at 7:01 PM, Michael Clark <michae...@mac.com> wrote:

A load store unit in a single issue CPU is going to take 8 sequential 64-bit stores to zero a 64-bit cache line, but I suspect a cache could mux or broadcast a zero line in one cache transaction, e.g. if the width between the L1 and L2 or the memory system is cache line width.

64-byte cache line. Getting my bits and bytes mixed up.

Jacob Bachmeyer

unread,
Jun 18, 2017, 8:30:14 PM6/18/17
to Michael Clark, isa...@groups.riscv.org
Michael Clark wrote:
>> On 17 Jun 2017, at 12:42 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Recent discussions have suggested that explicit cache-control instructions could be useful, but RISC-V has some constraints here that other architectures lack, namely that caching must be transparent to the user ISA.
>>
>> I propose a new minor opcode REGION := 3'b001 within the existing MISC-MEM major opcode. Instructions in REGION are R-type and use rs1 to indicate a base address, rs2 to indicate an upper bound address, and produce a result in rd that is the first address after the highest address affected by the operation. If rd is x0, the instruction has no directly visible effects and can be executed entirely asynchronously as an implementation option.
>>
>> Non-destructive operations permit an implementations to expand a requested region on both ends to meet hardware granularity for the operation. An application can infer alignment from the produced value if it is a concern. As a practical matter, cacheline lengths are expected to be declared in the processor configuration ROM.
>>
>> Destructive operations are a thornier issue, and are resolved by requiring any partial cachelines (at most 2 -- first and last) to be prefetched instead of performing the requested operation on those cachelines. Implementations may perform the destructive operation on the parts of these cachelines included in the region, or may simply prefetch them.
>>
>> If the upper and lower bounds are the same register, the smallest region that can be affected that includes the lower bound is affected if the operation is non-destructive; destructive operations are no-ops. Otherwise, the upper bound must be greater than the lower bound and the contrary case is reserved. (Issue for discussion: what happens if the reserved case is executed?)
>>
>> Instructions in MISC-MEM/REGION may be implemented as no-ops if an implementation lacks the corresponding hardware. The value produced in this case is implementation-defined. (Zero or the base address are good choices.)
>>
>> The new MISC-MEM/REGION space will have room for 128 opcodes, one of which is the existing FENCE.I. I initially propose:
>>
>> ===Fences===
>>
>> [function 7'b0000000 is the existing FENCE.I instruction]
>>
>> [function 7'b0000001 reserved]
>>
>> FENCE.RD ("range data fence")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000010}
>> Perform a conservative fence affecting only data accesses to the chosen region. This instruction always has visible effects on memory consistency and is therefore synchronous.
>>
>
> The text should mention more semantics, but not implementation details. You refer to fence as a flush operation later in the text. I’d prefer we were more “explicit”
>
> In an implementation with data caches, the data cache state for the range specified by rs1 and rd should be should be ‘invalid’
>
> Given this is explicit cache management I prefer CACHE.FLUSH and to move it alongside the other explicit cache control instructions. While it is a form of FENCE, the goal here is explicit cache control.
>
> Perhaps CACHE.FLUSH or CACHE.FENCE (to use the CACHE prefix consistently for this set of instructions).
>
> This is a FLUSH operation? e.g. Write and invalidate.
>

Fences *can* be implemented as flushes, and a cache is effectively a
fence, but "fence" and "flush" are distinct operations.

>> FENCE.RI ("range instruction fence")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000011}
>> Perform equivalent of FENCE.I but affecting only instruction fetches from the chosen region. This instruction always has visible effects on memory consistency and is therefore synchronous.
>>
>
> Perhaps CACHE.FENCE.I
>

The catch is that fences are fences, not cache-control. The ranged
fences are effectively FENCE and FENCE.I that only affect part of the
address space instead of all memory.

>> ===Non-destructive cache control===
>>
>> ====Prefetch====
>>
>> All prefetch instructions ignore page faults and other access faults. In general use, applications should use rd == x0 for prefetching, although this is not required. If a fault occurs during a synchronous prefetch (rd != x0), the operation must terminate and produce the faulting address. A fault occurring during an asynchronous prefetch (rd == x0) may cause the prefetching to stop or the implementation may attempt to continue prefetching past the faulting location.
>>
>> CACHE.PREFETCH0 - CACHE.PREFETCH3
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000100}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000101}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000110}
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0000111}
>> Load as much of the chosen region as possible into the data cache, with varying levels of expected temporal access locality.
>>
>
> The semantics of temporal access needs to be described in commentary. e.g. the temporal access level may in an implementation correspond to a cache tier. e.g. Code executing CACHE.PREFETCH1 expects prefetched memory to be available in L1 cache. CACHE.PREFETCH2 expects prefetched memory to be available in L2, etc. Suggest numbers have symmetry well known cache tiers, as this is all I can think of for temporal correspondence, however the commentary should note that this is a possible implementation and that the instructions may be no-ops. Otherwise add commentary of potential implementation defined semantics of the temporal tiers and what they might correspond to in an implementation. i.e. 1 is higher temporal priority than 2? and subsequent accesses are expected to have less latency.
>
> Perhaps CACHE.PREFETCH1 - CACHE.PREFETCH4 (for symmetry with cache level nomenclature)
>

This was kept ambiguous because I am unsure of the best way to arrange
these.

>> CACHE.PREFETCH.ONCE
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001000}
>> Prefetch as much of the region as possible, but expect the prefetched data to be used at most once.
>>
>> CACHE.PREFETCH.I
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001001}
>> Load as much of the chosen region as possible into the instruction cache.
>>
>>
>> ====Cacheline pinning====
>>
>> ??? Issue for discussion: should a page fault while pinning cachelines cause a trap to be taken?
>>
>> CACHE.PIN
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001010}
>> Load as much of the chosen region as possible into the data cache and keep it there until unpinned. Pinning a cacheline is idempotent.
>>
>> CACHE.UNPIN
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001011}
>> Explicitly release a pin set with CACHE.PIN. Pinned cachelines are also implicitly released if the memory protection and virtual address mapping is changed. (Specifically, a write to the current satp CSR or an SFENCE.VM will unpin all cachelines as a side effect, unless the implementation partitions its cache by ASID. Even with ASID-partitioned caches, changing the root page number associated with an ASID unpins all cachelines belonging to that ASID.) Unpinning a cacheline does not immediately remove it from the cache.
>>
>
> I’ve always been fond of the idea of being able to pin data into L1 cache i.e. a bit on one or more of the ways.
>
> This would be useful for cryptographic algorithms that want constant time behaviour and can control their cache working set size.
>

Cache pinning is intended to allow programs to keep, for example, a
complete lookup table in the cache while processing a larger amount of data.

Interactions between synchronous prefetch and pinning are interesting to
consider. In particular, if data is explicitly prefetched into L2 cache
and then pinned, should the pin be applied to the L2 cache, or should
the data be moved into the L1 cache? Probably best to leave this as an
implementation option.

>> And two M-mode-only privileged instructions:
>>
>> CACHE.PIN.I
>> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1001010, 2'b11}
>> Load as much of the chosen region as possible into the instruction cache and keep it there. Pinning a cacheline is idempotent.
>>
>> CACHE.UNPIN.I
>> {opcode, funct3, funct7, MODE} = {$MISC-MEM, $REGION, 7'b1001011, 2'b11}
>> Release instruction cachelines pinned with CACHE.PIN.I. Pins are idempotent. One unpin instruction will unpin all affected cachelines completely, regardless of how many times they were pinned.
>>
>> ====Flush====
>>
>> CACHE.WRITE
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001100}
>> Write any cachelines in the requested region. An ordinary or ranged fence can invalidate the cachelines if needed.
>>
>
> In an implementation with data caches, the data cache state for the range specified by rs1 and rd should be should be ‘valid’ and ‘clean’ after this transaction.
>
> This is a WRITEBACK operation? e.g. Write and mark clean.
>

Correct, however "valid" and "clean" lines can also be dropped from the
cache at any time.

>> [function 7'b0001101 reserved for write-and-invalidate if using fences proves unconvincing]
>>
>>
>> ===Destructive cache control===
>>
>> The destructive operations are performance optimizations when the current contents of a region are unimportant. Both are no-ops if an implementation does not have caches.
>>
>> CACHE.DISCARD
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
>> Drop the requested region from the cache without writing dirty cachelines to memory. This declares the requested region to be "don't care" and its contents are undefined after the operation completes. If the region contains partial cachelines, those cachelines are written and invalidated, but cachelines entirely within the region are simply invalidated.
>> NOTE WELL: This is *not* an “undo" operation for memory writes -- an implementation is permitted to aggressively writeback dirty cachelines, or even to omit caches entirely.
>>
>
> Consider CACHE.INVALIDATE
>
> In an implementation with data caches, the data cache state for the range specified by rs1 and rd should be ‘invalid’ after this transaction.
>

I suggested naming it "discard" because it discards data and
"invalidate" is harder to type. :)

>> CACHE.PREZERO
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001111}
>> Allocate cachelines for the requested region, marking them dirty, but do not fetch data from memory, instead initializing the contents to zero. If the requested region contains partial cachelines, those cachelines *are* fetched from memory.
>> TLB fills occur normally as for writes to the requested region. Page faults in the middle of the region cause this operation to stop and produce the faulting address. A page fault at the beginning of the operation causes a trap to be taken.
>> NOTE WELL: This is *not* memset(3) -- a cacheline already present will *not* be affected.
>>
>>
>> === ===
>>
>> Thoughts?
>>
>
> Permissions bits. Perhaps a “cachectrl” CSR with 2 bits for each operation, indicating the lowest mode that can use the instruction, then an OS can decide whether it gives cache control to U mode.
>

An interesting idea, but I believe that cache operations should be
inherently unprivileged (excepting I-cache line pinning, due to its
expected use case) and should not give a user program any more control
than it already can have by carefully choosing its memory access patterns.


-- Jacob

Jacob Bachmeyer

unread,
Jun 18, 2017, 8:52:26 PM6/18/17
to Guy Lemieux, isa...@groups.riscv.org
Guy Lemieux wrote:
> FENCE.I and FENCE.RI can share the same opcode. The only difference is that FENCE.I operated on the whole memory range. this can be conveniently encoded as rs1=r0 and rs2=r0, or as rs1=0 and rs2=-1.
>
>
> FENCE.RD cannot replace FLUSH. a fence does not necessarily have to flush a data cache if it has a cache coherence protocol. a FLUSH must always write back and mark invalid. please add a FLUSH/CACHE.FLUSH instruction.
>

FENCE.RD does not replace FLUSH, but I suspect that a
CACHE.WRITE/FENCE.RD sequence could be equivalent to CACHE.FLUSH. The
CACHE.WRITE forces writeback and the FENCE.RD requires that a subsequent
read return a new result if the memory location has been written from
another source, effectively forcing the cacheline to be invalidated if
it matters. Do I misunderstand fence operation semantics? Is having this
as a single instruction important?

> Equivalent from my proposal:
> CACHE.WRITE = WRITEBACK
> CACHE.DISCARD = INVALIDATE
>

Those are intended equivalences. It is good to see that you agree they
are equivalent.

> I agree that when memory range does not perfectly align with cache lines, INVALIDATE should also write back up to 2 dirty lines: rs1 if dirty and unaligned, and rs2 if dirty and unaligned. This will preserve correctness for applications. However it will make implementations complex because the WRITEBACK of a partial line is nontrivial and not directly supported by protocols such as AXI for bursts. The writes have to be done byte by byte, half word by halfword, or word by word, depending on the misalignment. I think INVALIDATE is the only transaction with this requirement to do something about misalignment for correctness, so we may wish to think hard about this.
>

There is no writeback of a partial line -- only complete lines can be
written back. The wording here probably needs to be improved; the intent
is that a cacheline partially included in the region will be written
back in its entirety. The CACHE.DISCARD instruction writes the entire
cacheline if any part of it is *not* in the region; only cachelines
entirely within the region are actually discarded. This is correct
because cachelines can also be written back at any time, so the writes
by CACHE.DISCARD can be said to have occurred just before CACHE.DISCARD
was executed.

> I hesitate on the PREZERO proposal because DMA hardware could do the same thing. It's different than the other cache operations -- this is a "nice to have that has other implementations", whereas the others have almost no other way to be constructed. I advise caution on this because cacheless systems can't simply treat it as a NOP like the other cache instructions -- it has important side effects.
>

The proposed CACHE.PREZERO instruction is explicitly *not* memset(3) for
this reason. Its side effects are explicitly *not* guaranteed to happen.
(Indeed, it does nothing at all if the region is already in the cache.)
All CACHE.PREZERO does is ensure that the region is in cache and ready
for write. CACHE.PREZERO is effectively a prefetch operation that can
avoid actually reading memory. Cacheless systems *can* treat it as a
NOP; how best to clarify this?

On that note, should a "prefetch-for-write" operation be added, or one
of the existing prefetch levels be changed to also acquire any cache
coherency locks that are needed before a cacheline can be updated? Or
does the RISC-V memory model explicitly not guarantee coherency on that
level, such that "prefetch-for-write" is useless in RISC-V?

> Stefan wants at least 2 different groups to advocate for this proposal -- please consider me/VectorBlox the second party. We are currently deciding exactly how to do cache control. However, I have not been thinking of prefetching or pinning, and so I'd like a third advocate to step forward here.
>

Agreed. I would also like to see a second implementor who wants/needs
this; I am just pushing discussion along here.


-- Jacob


Jacob Bachmeyer

unread,
Jun 18, 2017, 9:07:34 PM6/18/17
to Michael Clark, Guy Lemieux, isa...@groups.riscv.org
Michael Clark wrote:
>> On 17 Jun 2017, at 5:33 PM, Guy Lemieux <glem...@vectorblox.com> wrote:
>>
>> FENCE.I and FENCE.RI can share the same opcode. The only difference is that FENCE.I operated on the whole memory range. this can be conveniently encoded as rs1=r0 and rs2=r0, or as rs1=0 and rs2=-1.
>>
>>
>> FENCE.RD cannot replace FLUSH. a fence does not necessarily have to flush a data cache if it has a cache coherence protocol. a FLUSH must always write back and mark invalid. please add a FLUSH/CACHE.FLUSH instruction.
>>
>> Equivalent from my proposal:
>> CACHE.WRITE = WRITEBACK
>> CACHE.DISCARD = INVALIDATE
>>
>
> I was confused by that too, so we in fact have:
>
> - FENCE.RD
> - FENCE.RI
>

These first two are *not* really explicit cache-control; they are
variants of FENCE and FENCE.I that only affect a region rather than the
entire address space. They are included because they use the same
instruction format.

> - CACHE.WRITE
> - CACHE.FLUSH
> - CACHE.DISCARD
>

These are explicit cache-control, however I am currently suggesting that
CACHE.FLUSH can be provided by CACHE.WRITE followed by FENCE.RD. I am
looking for a good reason that CACHE.FLUSH actually needs its own opcode
rather than using CACHE.WRITE/FENCE.RD. (This is the reason that draft
2 reserves a function code for CACHE.FLUSH; I am not certain of my
position on this, but I want the list archive to have a good argument
against it.)

>> I agree that when memory range does not perfectly align with cache lines, INVALIDATE should also write back up to 2 dirty lines: rs1 if dirty and unaligned, and rs2 if dirty and unaligned. This will preserve correctness for applications. However it will make implementations complex because the WRITEBACK of a partial line is nontrivial and not directly supported by protocols such as AXI for bursts. The writes have to be done byte by byte, half word by halfword, or word by word, depending on the misalignment. I think INVALIDATE is the only transaction with this requirement to do something about misalignment for correctness, so we may wish to think hard about this.
>>
>> I hesitate on the PREZERO proposal because DMA hardware could do the same thing. It's different than the other cache operations -- this is a "nice to have that has other implementations", whereas the others have almost no other way to be constructed. I advise caution on this because cacheless systems can’t simply treat it as a NOP like the other cache instructions -- it has important side effects.
>>
>
> PREZERO would be a very useful feature.
>
> While DMA hardware /may/ be able to do this, does it in practice? OS kernels have to zero all pages returned to the user and in practice they will resort to using vector stores, or as they do now; packed SIMD (SSE) for performance reasons. All allocated pages (user or kernel) have to be zeroed and I’m not aware of kernels that use DMA to do this. memset and bzero likely dominate quite a few profiles.
>
> A load store unit in a single issue CPU is going to take 8 sequential 64-bit stores to zero a 64-bit cache line, but I suspect a cache could mux or broadcast a zero line in one cache transaction, e.g. if the width between the L1 and L2 or the memory system is cache line width.
>
> PREZERO is worth considering. Perhaps some of the instructions can be optional and revealed via read-only feature bits in a “cachectl” CSR. The permission issue also needs to be solved as some OSes may wish to use sysctls or capabilities to allow userspace access to cache invalidation functions. For various reasons, on some platforms it may be restricted to Supervisors. I believe a CSR for the feature will be required, much like the masking of performance counters with mcounteren.
>

The purpose of CACHE.PREZERO is to "prefetch" cachelines when the actual
current value is "don't care" because the entire region is about to be
overwritten. Any cachelines already in the cache are unaffected.
However, supplying zero instead of doing a memory read avoids the bus
traffic to actually read memory if the region is not already in the
cache -- bus traffic that is useless if the CPU is about to overwrite
the region.

This operation probably needs a better name. Any ideas?

All of the cache-control instructions can be "non-implemented" by simply
always transferring rs1 to rd, indicating that nothing was done.


-- Jacob

Guy Lemieux

unread,
Jun 18, 2017, 9:49:00 PM6/18/17
to jcb6...@gmail.com, Michael Clark, isa...@groups.riscv.org
On Jun 18, 2017, at 8:07 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
" I am looking for a good reason that CACHE.FLUSH actually needs its own opcode rather than using CACHE.WRITE/FENCE.RD."

sorry to repeat myself, but FLUSH is not equivalent to WRITEBACK+FENCE.RD. please add FLUSH.

In either a coherent or non-coherent system:

FLUSH: in one pass, each data cache line is examined. dirty lines are written back. all lines are marked invalid.

WRITEBACK: in one pass, each data cache line is examined. dirty lines are written back and marked valid.

In a coherent system:

FENCE: ensure write buffers are flushed. Does not alter data cache.
FENCE.I: pipeline after instruction is flushed.

In a non-coherent system:

FENCE: in one pass, each data cache line is examined. dirty lines are written back and can be marked valid or invalid (implementation dependent). write buffers are flushed.
FENCE.I: do a FENCE. in addition, in one pass, each instruction cache line is marked invalid. pipeline after instruction is flushed.

Even if they were equivalent, doing it as two instructions would be much slower (making non-coherent systems even slower). I don't think this is something you want to try macro-op fusion on.

"This operation probably needs a better name. Any ideas?"

PREZERO is a bad name, and I think it is a bad idea the way it had been proposed. If a partial cache line is to be prezeroed, then the rest of the cache line must be fetched, or the cache needs to support per-byte dirty bird. Also this only applies to a WRITEBACK policy.

A no-alloc-on-write cache policy (which could be applied at a page granularity as a PMA) would achieve much of the same purpose. There are two variants: one which does a write through on a write miss, and one which uses per-byte dirty bits and evicts in a write miss but doesn't read the memory block it just allows the write.

Guy



--
Embedded Supercomputing
http://www.vectorblox.com


Jacob Bachmeyer

unread,
Jun 18, 2017, 10:58:16 PM6/18/17
to Guy Lemieux, Michael Clark, isa...@groups.riscv.org
Guy Lemieux wrote:
> On Jun 18, 2017, at 8:07 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> " I am looking for a good reason that CACHE.FLUSH actually needs its own opcode rather than using CACHE.WRITE/FENCE.RD."
>
> sorry to repeat myself, but FLUSH is not equivalent to WRITEBACK+FENCE.RD. please add FLUSH.
>
> In either a coherent or non-coherent system:
>
> FLUSH: in one pass, each data cache line is examined. dirty lines are written back. all lines are marked invalid.
>
> WRITEBACK: in one pass, each data cache line is examined. dirty lines are written back and marked valid.
>
> In a coherent system:
>
> FENCE: ensure write buffers are flushed. Does not alter data cache.
> FENCE.I: pipeline after instruction is flushed.
>
> In a non-coherent system:
>
> FENCE: in one pass, each data cache line is examined. dirty lines are written back and can be marked valid or invalid (implementation dependent). write buffers are flushed.
> FENCE.I: do a FENCE. in addition, in one pass, each instruction cache line is marked invalid. pipeline after instruction is flushed.
>
> Even if they were equivalent, doing it as two instructions would be much slower (making non-coherent systems even slower). I don't think this is something you want to try macro-op fusion on.
>

This looks like a step in the right direction towards a good argument.
However: WRITEBACK writes all dirty lines and marks them "valid,
clean". A read FENCE ensures that any changes to memory from other
sources will be visible; in the simple case, this invalidates all
"clean" cachelines. Immediately after WRITEBACK, there are no dirty
lines in the cache. Is a second pass to invalidate the (now-clean)
lines that expensive?

> "This operation probably needs a better name. Any ideas?"
>
> PREZERO is a bad name, and I think it is a bad idea the way it had been proposed. If a partial cache line is to be prezeroed, then the rest of the cache line must be fetched, or the cache needs to support per-byte dirty bird. Also this only applies to a WRITEBACK policy.
>
> A no-alloc-on-write cache policy (which could be applied at a page granularity as a PMA) would achieve much of the same purpose. There are two variants: one which does a write through on a write miss, and one which uses per-byte dirty bits and evicts in a write miss but doesn't read the memory block it just allows the write.
>

The catch is that PMAs are supposed to be baked into hardware, not
configuration items, and are unrelated to paging in RISC-V. (PMAs are
hardware characteristics; PMP is controlled by M-mode; paging is
controlled by S-mode.) Nonetheless, cache policies are not specified in
the RISC-V ISA, so instructions must avoid depending on them and if
PREZERO is only useful with writeback caches, it needs to be rethought.

Is there value in having a "relax coherency" instruction that hints a
complete overwrite of a memory region? This was the intended purpose of
PREZERO.



-- Jacob

Michael Clark

unread,
Jun 18, 2017, 11:10:13 PM6/18/17
to jcb6...@gmail.com, Guy Lemieux, isa...@groups.riscv.org

On 19 Jun 2017, at 2:58 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

The catch is that PMAs are supposed to be baked into hardware, not configuration items, and are unrelated to paging in RISC-V.  (PMAs are hardware characteristics; PMP is controlled by M-mode; paging is controlled by S-mode.)  Nonetheless, cache policies are not specified in the RISC-V ISA, so instructions must avoid depending on them and if PREZERO is only useful with writeback caches, it needs to be rethought.

That’s completely untrue.

Baking PMAs into hardware is /one/ implementation approach. Perhaps the simplest implementation approach. We should not exclude sophisticated memory controllers.

There are several current architectures that allow dynamic control over no evict, write combine, write through (uncached), write back, shadowing ROM, cache as scratch, etc. An implementation may use PMAs to signal static properties if the address range properties are baked into hardware, but if the system has a relatively sophisticated cache system and memory controller, many of the PMAs should be able to be set dynamically. Read the coreboot source for some examples… I can provide you pointers…

Guy Lemieux

unread,
Jun 18, 2017, 11:29:44 PM6/18/17
to jcb6...@gmail.com, Michael Clark, isa...@groups.riscv.org
the second pass must iterate through every cache line, and tag compare them write tag invalid. done one per cycle it can take a few hundred cycles.

regarding PREZERO, when a complete overwrite of memory is expected, you have to consider coherent and non-coherent systems. it gets very messy because coherent systems have many ways of being built and there may not be a way to support this behavior directly, and non-coherent systems probably don't have any mechanism to remove other cached copies anyways so it will be a high overhead operation.

I don't think PREZERO should be a cache operation -- too tricky to specify something that is useful that doesn't fall apart on many different implementations. the only advantage is avoiding a useless read before writing, but you can get that effect with a write through policy (dynamically selected) and/or write combining buffer.

I recommend using a vector store or a DMA unit to pre-clear memory.

A memset or memcpy instruction may be cool, but is departing from the RISC design philosophy. That shouldn't stop the debate -- but one must treat it with caution. The vector unit that we have designed at VectorBlox has a DMA master that can do memset and memcpy.

Guy
--
Embedded Supercomputing
http://www.vectorblox.com


Jacob Bachmeyer

unread,
Jun 19, 2017, 12:26:26 AM6/19/17
to Michael Clark, Guy Lemieux, isa...@groups.riscv.org
Michael Clark wrote:
>
>> On 19 Jun 2017, at 2:58 PM, Jacob Bachmeyer <jcb6...@gmail.com
PMAs are Physical Memory Attributes, that is, /attributes/ of /physical/
/memory/. PMAs *are* baked into hardware -- ROM is ROM, MMIO is MMIO,
and RAM is RAM. The availability of sophisticated memory controllers
that allow hotplugging of physical memory does not change this basic
fact, although it *does* mean that PMAs can change at runtime. (RAM
shadowing ROM is ... ROM! Cache as scratchpad RAM is ... RAM! These are
*not* distinct PMAs.) Cache policies are *not* attributes of physical
memory; please do not confuse them with actual PMAs.


-- Jacob

Jacob Bachmeyer

unread,
Jun 19, 2017, 12:40:17 AM6/19/17
to Guy Lemieux, Michael Clark, isa...@groups.riscv.org
Guy Lemieux wrote:
> the second pass must iterate through every cache line, and tag compare them write tag invalid. done one per cycle it can take a few hundred cycles.
>

Parallel processing within each cache line is infeasible? (Every line
could have its own logic that can do "valid, clean" -> "invalid" on a
read fence if the system is non-coherent, although a read fence becomes
"interesting" if a cacheline is "valid, dirty" and whether the
corresponding memory has been written in the interim is unknown. Does
this mean that fences end up requiring coherent caches?) Is FLUSH any
different in effect from WRITEBACK in coherent systems? If cache
coherency is maintained by hardware, is invalidation ever needed? Or
does choosing a line to evict take time, and flushing the cache
therefore provides performance benefits?

> regarding PREZERO, when a complete overwrite of memory is expected, you have to consider coherent and non-coherent systems. it gets very messy because coherent systems have many ways of being built and there may not be a way to support this behavior directly, and non-coherent systems probably don't have any mechanism to remove other cached copies anyways so it will be a high overhead operation.
>
> I don't think PREZERO should be a cache operation -- too tricky to specify something that is useful that doesn't fall apart on many different implementations. the only advantage is avoiding a useless read before writing, but you can get that effect with a write through policy (dynamically selected) and/or write combining buffer.
>

That advantage is its purpose. Can we find other ways to achieve that?
Possibly ways that do not actually require support from the ISA? Or that
become mere recommendations for software to follow? (For example,
"always copy or set memory in sequential words/doublewords/quadwords
(32/64/128-bit); this allows hardware with write combining buffers to
elide a cacheline fetch if the entire line is overwritten"?)

Good answers to this concern will be enough to drop PREZERO from draft 3.

> I recommend using a vector store or a DMA unit to pre-clear memory.
>
> A memset or memcpy instruction may be cool, but is departing from the RISC design philosophy. That shouldn't stop the debate -- but one must treat it with caution. The vector unit that we have designed at VectorBlox has a DMA master that can do memset and memcpy.
>

PREZERO was very specifically *not* memset but was supposed to be an
optimization for memset and memcpy. Those functions could use PREZERO to
more quickly prepare the destination buffer.


-- Jacob


chuanhua.chang

unread,
Jul 17, 2017, 10:31:38 PM7/17/17
to RISC-V ISA Dev, jcb6...@gmail.com
Andes Technology would like to have explicit cache-control instructions in a standard extension as well.

Chuanhua

Jacob Bachmeyer

unread,
Jul 17, 2017, 11:57:22 PM7/17/17
to chuanhua.chang, RISC-V ISA Dev
chuanhua.chang wrote:
> Andes Technology would like to have explicit cache-control
> instructions in a standard extension as well.
Can you comment on the proposal I offered and the discussion so far? I
will write a draft 3 if there is interest in one.

-- Jacob

Guy Lemieux

unread,
Jul 18, 2017, 3:56:20 AM7/18/17
to Jacob Bachmeyer, Michael Clark, RISC-V ISA Dev
...this thread went dormant for a while but it's important and it has
come up again....

my thoughts below.


On Sun, Jun 18, 2017 at 9:40 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Guy Lemieux wrote:
>> the second pass must iterate through every cache line, and tag compare
>> them write tag invalid. done one per cycle it can take a few hundred cycles.
>
> Parallel processing within each cache line is infeasible? (Every line could
> have its own logic that can do "valid, clean" -> "invalid" on a read fence
> if the system is non-coherent,

Yes, this is infeasible. Low-cost systems, ie ones which will not have
coherence, will not likely budget the area required to do parallel
state transitions on all cache lines. A budget system will put the
state into an SRAM, which can only be addressed one line per cycle. If
you want to do multiple cache lines per cycle, you must increase the
width of this SRAM or build it out of dedicated flip-flops. The width
for just one cache line is the full tag and all state bits, eg about
20-24 bits wide.

> although a read fence becomes "interesting"
> if a cacheline is "valid, dirty" and whether the corresponding memory has
> been written in the interim is unknown.

If the system is noncoherent, then it will blindly write back the
cache line irrespective of whether the corresponding memory block was
written by another master (cpu, io device).

> Does this mean that fences end up
> requiring coherent caches?)

No.

> Is FLUSH any different in effect from WRITEBACK
> in coherent systems?

a WRITEBACK instruction from the CPU will leave cache lines in the
valid clean state. if the line was dirty, it will have been written
back.

a FLUSH instruction from the CPU will leave cache lines in the invalid
state. if the line was dirty, it will have been written back.

both of these would operate identically (above description) in a
coherent or non-coherent system. however, in a coherent system, the
process of writing back the dirty line may have other transactions
associated with it (such as invalidating or updating other copies in
other caches or in main memory).


> If cache coherency is maintained by hardware, is
> invalidation ever needed?

cache coherent hardware performs invalidations all the time without
the CPU knowing it. it is the main mechanism for getting rid of shared
copies and ensuring only one entity can have dirty (or write-able)
data.

the instructions I am proposing are required to manage a cache which
does not have the coherence hardware.



> Or does choosing a line to evict take time, and
> flushing the cache therefore provides performance benefits?

choosing a line to evict typically takes zero time.

flushing a cache line is not to a performance thing, it is a system
correctness thing. if you are about to have a device (eg, disk) write
new data to memory via DMA, then you must ensure no copies of the data
are in the processor cache (either dirty or clean). this means you
want to perform a FLUSH. however, a flush will trigger a writeback of
dirty data in the cache, and that will get clobbered by the disk DMA
anyways, so an INVALIDATE would be better which simply discards any
data in the cache rather than writing back any dirty data.

>> regarding PREZERO, when a complete overwrite of memory is expected, you
>> have to consider coherent and non-coherent systems. it gets very messy
>> because coherent systems have many ways of being built and there may not be
>> a way to support this behavior directly, and non-coherent systems probably
>> don't have any mechanism to remove other cached copies anyways so it will be
>> a high overhead operation.
>>
>> I don't think PREZERO should be a cache operation -- too tricky to specify
>> something that is useful that doesn't fall apart on many different
>> implementations. the only advantage is avoiding a useless read before
>> writing, but you can get that effect with a write through policy
>> (dynamically selected) and/or write combining buffer
>
> That advantage is its purpose. Can we find other ways to achieve that?
> Possibly ways that do not actually require support from the ISA?

yes: use a DMA device that can do a memcopy, initializing the source
buffer with the desired data.

or use a vector store instruction to write blocks of data.

PREZERO is a complex memory operation, not a cache management
instruction. it attempts to overlay memory functionality into the
cache. it is messy and assumes a lot about how the cache is
implemented. it is purely a performance issue, but it is not essential
for managing a processor with non-coherent caches. the instructions I
have been advocated are essential and provide a functional role that
is not just a performance thing -- it is a function that cannot be
done any other way.

> Or that
> become mere recommendations for software to follow? (For example, "always
> copy or set memory in sequential words/doublewords/quadwords
> (32/64/128-bit); this allows hardware with write combining buffers to elide
> a cacheline fetch if the entire line is overwritten"?)

I don't understand this suggestion.

> Good answers to this concern will be enough to drop PREZERO from draft 3.

Summary:

PREZERO is purely a performance thing. It can already be done with
existing instructions, and can be built out of a DMA engine. It is
complex to implement and may be difficult to add to caches. It should
be discarded from this proposal, since it is not essential for cache
management. It can be considered part of a new proposal that is for
cache-accelerated operations, along with PREFETCH and PIN, but I
wouldn't want to see its complexity become required in non-coherent
cached systems (which are trying to remain simple).

FLUSH and WRITEBACK provide functionality that is not available in any
other way. They are required in a non-coherent cache system, and can
be useful even in a cache coherent system.

An INVALIDATE instruction adds performance, but no real new
functionality. It comes almost for free with very little
implementation differences from FLUSH (it is a simpler variant of
FLUSH). Because of this simple implementation, I advocate for its
inclusion at the level of FLUSH and WRITEBACK.

FENCE.RI and FENCE.RD are ranged versions of FENCE.I and FENCE,
respectively. I think both of these are good ideas, and strongly
support their inclusion.

It is possible for FENCE.I and FENCE.RI to share the same opcode, eg
by encoding rs1 == r0 and rs2 == r0.

FENCE.RD cannot replace FLUSH. a fence does not necessarily have to
flush a data cache if it has a cache coherence protocol. a FLUSH must
always write back and mark invalid. please add a FLUSH/CACHE.FLUSH
instruction.

Of the above, only INVALIDATE is a "destructive" operation. The range
specifier will be precise, so rounding the start or ending addresses
to align with cache line boundaries will be benign for FLUSH and
WRITEBACK, but it will have dire consequences with INVALIDATE. The
easy thing to do is to writeback the first cache line and the last
cache line (if they are dirty) before invalidating them, ie behave
like FLUSH on those to cache lines but behave like INVALIDATE in the
middle.

Jacob has recommended that when rd == x0 the operation may run
asynchronously, and when rd == other register then the instruction
will return with the last address that was operated upon. I have the
following concerns:

if rd == x0, then you have no way of knowing when the operation is
done (except to automagically stall on the next instruction that
accesses the cache).

it was suggested that hardware can either auto-iterate, or it can
operate on a single cache line and return an incremented version of
rs1 (aligned to start of the next cache line) in rd. unfortunately, I
cannot think of a way to make this work. the problem is that the
hardware must iterate over the cacheline space from
cacheline(rs1)..cacheline(rs2), not the memory address space
(rs1..rs2). if hardware returns an address (rs1+increment) then it
will be forced to iterate through rs1..rs2 and it will take a very
long time for large memory regions. if hardware returns a cache line
index, cacheline(rs1+increment), then you cannot simply use that value
to update the rs1 argument and continue. if anyone sees a way this can
be made efficient, where the cache instruction operates on a single
line and returns a useful progress indicator, please help :)

Guy

chuanhua.chang

unread,
Jul 18, 2017, 6:24:07 AM7/18/17
to RISC-V ISA Dev, chuanhu...@gmail.com, jcb6...@gmail.com


The cache-control instructions we are thinking about are summarized in the following table:

 

VA based

icache

dcache

invalidate (+ unlock)

invalidate (+ unlock)

 

writeback

 

writeback & invalidate (+ unlock)

lock

lock

unlock

unlock

 

The PIN or lock operation should have a return status to indicate its success, so that this operation will not lock out all ways of a multi-way cache. And a simple implementation can decide not to support cache locking and always return “fail” for the lock operation.

 

Guy’s following comment is a good idea for the “INVALIDATE” operation.

 

“Of the above, only INVALIDATE is a "destructive" operation. The range


specifier will be precise, so rounding the start or ending addresses
to align with cache line boundaries will be benign for FLUSH and
WRITEBACK, but it will have dire consequences with INVALIDATE. The
easy thing to do is to writeback the first cache line and the last
cache line (if they are dirty) before invalidating them, ie behave
like FLUSH on those to cache lines but behave like INVALIDATE in the

middle.”

 

The PREZERO and PREFETCH operations are not our current focus. Can they be in a separate extension or be optional in the same extension?



-- Chuanhua

Guy Lemieux

unread,
Jul 18, 2017, 2:01:48 PM7/18/17
to chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer
Thanks for the suggestion Chuanhua.

What does "VA based" mean? Virtual Address based?

You have 1 operation for icache (invalidate) and 3 operations for dcache (invalidate, writeback, writeback+invalidate).

These correspond to ones I have been advocating, except with the name writeback+invalidate = flush.

The instructions advocated by Jacob and myself are all range-based. Typical ISAs operate on a single cache line, and require software to know the cache line size, which leads to software bugs (infamously reported earlier, a big.LITTLE ARM system had different cache line sizes in the cores and software forgot to check when it migrated between cores leading to a bug).

Also, I have not taken care to distinguish between icache, dcache, or both targets. Presumably, invalidate on any address in the dcache would also need to invalidate the icache. Likewise for flush (writeback+invalidate), which does does an invalidate on the icache. Writeback would only operate on dcache. Do you see a reason there must be explicit targets (icache, dcache, or even both) in the ISA and not let hardware manage this implicitly?

Finally, your Lock proposal is similar to Jacob's Pin. Generally, however, I don't see why locking is necessary or attractive when a TCM/scratchpad is superior in most cases. To add locking/pinning of cache lines, someone has to think through all of the cases of processors with coherent caches, non-coherent caches, multithreaded (sharing), VMs/hypervisors, etc. Also, FPGAs almost always use direct-mapped caches, in which case both obeying the hint or ignoring it may have negative performance consequences, so it becomes difficult to decide which to do. Locking almost always has a negative performance consequences which must be guarded against, and almost always use more power than a TCM/scratchpad.

In contrast, wouldn't a TCM/scratchpad RAM almost always solve all of these problems? It can be done on a per-system basis (where needed) without changing the ISA, and without forcing all implementations to carry the baggage of the extra instruction decode, cache coherence state, etc. One thing that may be desired is an easy way to query a system whether it contains a TCM/scratchpad, but this should be done at the OS level, eg in a device tree or such.

Thanks,
Guy



--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Jul 18, 2017, 6:34:30 PM7/18/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer
On Tue, Jul 18, 2017 at 9:01 PM, Guy Lemieux <glem...@vectorblox.com> wrote:
Thanks for the suggestion Chuanhua.

What does "VA based" mean? Virtual Address based?

You have 1 operation for icache (invalidate) and 3 operations for dcache (invalidate, writeback, writeback+invalidate).

These correspond to ones I have been advocating, except with the name writeback+invalidate = flush.

The instructions advocated by Jacob and myself are all range-based. Typical ISAs operate on a single cache line, and require software to know the cache line size, which leads to software bugs (infamously reported earlier, a big.LITTLE ARM system had different cache line sizes in the cores and software forgot to check when it migrated between cores leading to a bug).

It's not a question of forgetting to check. You can't realistically check! An app doesn't get any notification of when it is migrated from one core to another. Even if the app polls for the current core's cache line size immediately before using the cache control instruction there is still a possibility of being migrated between any two instructions.

I'm not in favour of implicit looping solutions either i.e. "invalidate everything between lo and hi". It seems far better to me to have an instruction that only promises to invalidate (for example) *something* starting at lo, and hi, and returns how much work it did (e.g. an updated value for lo).

It is then software's responsibility to check if lo is still lower than hi, and loop to do more work if so. 

Some implementations might choose to do everything in one instruction (perhaps only if not interrupted), but I'd expect many or most to only do one cache line at a time.

The important point is that whatever CPU runs the cache invalidate instruction at that moment knows its own cache line size, and thus updates lo appropriately. So it doesn't matter if the process gets migrated in the middle of the loop

Guy Lemieux

unread,
Jul 18, 2017, 6:42:55 PM7/18/17
to Bruce Hoult, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer
> there is still a
> possibility of being migrated between any two instructions.

can you create a critical section that can't be interrupted/migrated?

> I'm not in favour of implicit looping solutions either i.e. "invalidate
> everything between lo and hi". It seems far better to me to have an
> instruction that only promises to invalidate (for example) *something*
> starting at lo, and hi, and returns how much work it did (e.g. an updated
> value for lo).
>
> It is then software's responsibility to check if lo is still lower than hi,
> and loop to do more work if so.


This sounds easy but there are subtle details you seem to be
overlooking. I detailed these in my last post. Briefly, the iteration
space must be cacheline(start_addr) to cacheline(end_addr), but
software thinks it is start_addr to end_addr. you can't return the
cacheline() after incrementing, because you can't pass that back in as
an operand you must pass in start_addr+delta not
cachline(start_addr+delta) for region-based operations. Most CPUs do
single line invalidates, indexed by cache line not by region, but then
you need to know the line size, and the invalidation is done for all
tag values (it ignores the tag), not for only a specific region.

> Some implementations might choose to do everything in one instruction
> (perhaps only if not interrupted), but I'd expect many or most to only do
> one cache line at a time.

Have you implemented this before in hardware? It's not any more
difficult than doing a single cache line at a time, yet it's a lot
faster and easier to get right at the CPU level. The main issue is
whether we want to accept interrupts in a long-running instruction,
and then how to resume.

> The important point is that whatever CPU runs the cache invalidate
> instruction at that moment knows its own cache line size, and thus updates
> lo appropriately. So it doesn't matter if the process gets migrated in the
> middle of the loop

If software is migrated to a different core, there is a good chance it
will be in a different cache. This means it either (a) needs to start
over from scratch, or (b) the system needs to have coherent caches.



Guy

Sean Halle

unread,
Jul 18, 2017, 6:56:13 PM7/18/17
to Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer

Hi, I have been following this a bit..  it looks like you're making progress.  I was wondering what is your thinking about base ISA vs extension?   If they are all in the base ISA and the compiler targets them, then..  if we implement them all as nop, we will end up with binaries out there that fail, yes?  What is the thinking around scenarios like that?

Thanks,

Sean

 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Michael Clark

unread,
Jul 18, 2017, 6:58:46 PM7/18/17
to Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer

On 19 Jul 2017, at 10:34 AM, Bruce Hoult <br...@hoult.org> wrote:

I'm not in favour of implicit looping solutions either i.e. "invalidate everything between lo and hi". It seems far better to me to have an instruction that only promises to invalidate (for example) *something* starting at lo, and hi, and returns how much work it did (e.g. an updated value for lo).

I think we are all in agreement in this respect after sorear first suggested it.

The ISA precedence is the V extension setvl instruction which is passed the requested vector length and returns the available vector length for use as an operand to a subsequent add or sub in a strip-mine loop.

The bigger question, assuming that behaviour is a given is whether the incremental cache management instructions increment lo by the cache line size or some implementation defined value which is the unit of work that can be achieved in one instruction. It would be nice to constraint it to the cache line size so that the instructions can serve a dual purpose. i.e. executing one instruction is much easier than plumbing DTB all the way through to some user-code that needs to be cache-line size aware for some optimised array codes. It also constrains the unit of work and makes the instruction execution time predictable (in the good sense). i.e. constant.

Alex Marshall

unread,
Jul 18, 2017, 7:58:14 PM7/18/17
to Sean Halle, Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer

You should be able to implement it as a move from the “high” source register to the destination register, I believe. Hopefully not that much of a burden? That said, we may be past the point where this could go in the base ISA, just because it can’t be implemented as a NOP (I don’t recall how RISC-V deals with undefined opcodes, if it’s always a trap it might be ok…).

 

Thanks,

Alex

 

From: Sean Halle [mailto:sean...@gmail.com]
Sent: Tuesday, July 18, 2017 3:56 PM
To: Bruce Hoult <br...@hoult.org>
Cc: Guy Lemieux <glem...@vectorblox.com>; chuanhua.chang <chuanhu...@gmail.com>; RISC-V ISA Dev <isa...@groups.riscv.org>; Jacob Bachmeyer <jcb6...@gmail.com>
Subject: Re: [isa-dev] Proposal: Explicit cache-control instructions (draft 2 after feedback)

 

 

Hi, I have been following this a bit..  it looks like you're making progress.  I was wondering what is your thinking about base ISA vs extension?   If they are all in the base ISA and the compiler targets them, then..  if we implement them all as nop, we will end up with binaries out there that fail, yes?  What is the thinking around scenarios like that?

 

Thanks,

 

Sean

 

 

On Tue, Jul 18, 2017 at 3:34 PM, Bruce Hoult <br...@hoult.org> wrote:

 

 

On Tue, Jul 18, 2017 at 9:01 PM, Guy Lemieux <glem...@vectorblox.com> wrote:

Thanks for the suggestion Chuanhua.

 

What does "VA based" mean? Virtual Address based?

 

You have 1 operation for icache (invalidate) and 3 operations for dcache (invalidate, writeback, writeback+invalidate).

 

These correspond to ones I have been advocating, except with the name writeback+invalidate = flush.

 

The instructions advocated by Jacob and myself are all range-based. Typical ISAs operate on a single cache line, and require software to know the cache line size, which leads to software bugs (infamously reported earlier, a big.LITTLE ARM system had different cache line sizes in the cores and software forgot to check when it migrated between cores leading to a bug).

 

It's not a question of forgetting to check. You can't realistically check! An app doesn't get any notification of when it is migrated from one core to another. Even if the app polls for the current core's cache line size immediately before using the cache control instruction there is still a possibility of being migrated between any two instructions.

 

I'm not in favour of implicit looping solutions either i.e. "invalidate everything between lo and hi". It seems far better to me to have an instruction that only promises to invalidate (for example) *something* starting at lo, and hi, and returns how much work it did (e.g. an updated value for lo).

 

It is then software's responsibility to check if lo is still lower than hi, and loop to do more work if so. 

 

Some implementations might choose to do everything in one instruction (perhaps only if not interrupted), but I'd expect many or most to only do one cache line at a time.

 

The important point is that whatever CPU runs the cache invalidate instruction at that moment knows its own cache line size, and thus updates lo appropriately. So it doesn't matter if the process gets migrated in the middle of the loop

 

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

 

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.


To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.


This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Jacob Bachmeyer

unread,
Jul 18, 2017, 8:00:58 PM7/18/17
to chuanhua.chang, RISC-V ISA Dev
chuanhua.chang wrote:
>
>
> The cache-control instructions we are thinking about are summarized in
> the following table:
>
>
>
> VA based
>
> icache
>
>
>
> dcache
>
> invalidate (+ unlock)
>
>
>
> invalidate (+ unlock)
>
>
>
>
>
> writeback
>
>
>
>
>
> writeback & invalidate (+ unlock)
>
> lock
>
>
>
> lock
>
> unlock
>
>
>
> unlock
>

For icache, CACHE.PIN.I/CACHE.UNPIN.I are lock/unlock and are limited to
M-mode for the same reasons that delivery of IPIs is limited to M-mode.
I believe that FENCE.I and FENCE.RI have "icache invalidate" semantics.
For dcache, I likewise believe that FENCE and FENCE.RD can provide
"invalidate" semantics (a counterexample requires a situation where
"fence" semantics are preserved that is distinguishable from cache
invalidation). CACHE.DISCARD provides simple invalidation when the
current cached data is no longer needed, such as preparing an I/O buffer
for a DMA read from external hardware. You make a good point that
destructive cache control should unpin cachelines; this will affect
CACHE.DISCARD in draft 3. This also might be a good argument for a
CACHE.FLUSH ("writeback-and-invalidate") instruction, as requiring
CACHE.WRITE+CACHE.UNPIN+FENCE.RD is a bit long for that operation. The
other dcache operations are simple: writeback is CACHE.WRITE, lock is
CACHE.PIN, and unlock is CACHE.UNPIN.

> The PIN or lock operation should have a return status to indicate its
> success, so that this operation will not lock out all ways of a
> multi-way cache. And a simple implementation can decide not to support
> cache locking and always return “fail” for the lock operation.
>

All REGION instructions produce a result that is the first address after
the affected region. If the operation was not performed, the base
address can be returned to indicate that a zero-length region was
actually affected.

> Guy’s following comment is a good idea for the “INVALIDATE” operation.
>
>
> “Of the above, only INVALIDATE is a "destructive" operation. The range
> specifier will be precise, so rounding the start or ending addresses
> to align with cache line boundaries will be benign for FLUSH and
> WRITEBACK, but it will have dire consequences with INVALIDATE. The
> easy thing to do is to writeback the first cache line and the last
> cache line (if they are dirty) before invalidating them, ie behave
> like FLUSH on those to cache lines but behave like INVALIDATE in the
> middle.”
>

This is exactly the semantics for my proposed CACHE.DISCARD.

> The PREZERO and PREFETCH operations are not our current focus. Can
> they be in a separate extension or be optional in the same extension?
>

All of the operations are optional-to-implement; simply returning the
base address or zero is permitted.


-- Jacob

Jacob Bachmeyer

unread,
Jul 18, 2017, 8:15:59 PM7/18/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
Guy Lemieux wrote:
> Thanks for the suggestion Chuanhua.
>
> What does "VA based" mean? Virtual Address based?

That is what I have assumed.

> You have 1 operation for icache (invalidate) and 3 operations for
> dcache (invalidate, writeback, writeback+invalidate).
>
> These correspond to ones I have been advocating, except with the name
> writeback+invalidate = flush.

The I-cache is read-only, so writeback and flush make no sense for it.

> The instructions advocated by Jacob and myself are all range-based.
> Typical ISAs operate on a single cache line, and require software to
> know the cache line size, which leads to software bugs (infamously
> reported earlier, a big.LITTLE ARM system had different cache line
> sizes in the cores and software forgot to check when it migrated
> between cores leading to a bug).
>
> Also, I have not taken care to distinguish between icache, dcache, or
> both targets. Presumably, invalidate on any address in the dcache
> would also need to invalidate the icache. Likewise for flush
> (writeback+invalidate), which does does an invalidate on the icache.
> Writeback would only operate on dcache. Do you see a reason there must
> be explicit targets (icache, dcache, or even both) in the ISA and not
> let hardware manage this implicitly?

RISC-V requires an explicit FENCE.I to ensure coherency between data
access and instruction fetch already, so I believe that making this a
hardware burden was already considered and rejected.

> Finally, your Lock proposal is similar to Jacob's Pin. Generally,
> however, I don't see why locking is necessary or attractive when a
> TCM/scratchpad is superior in most cases. To add locking/pinning of
> cache lines, someone has to think through all of the cases of
> processors with coherent caches, non-coherent caches, multithreaded
> (sharing), VMs/hypervisors, etc. Also, FPGAs almost always use
> direct-mapped caches, in which case both obeying the hint or ignoring
> it may have negative performance consequences, so it becomes difficult
> to decide which to do. Locking almost always has a negative
> performance consequences which must be guarded against, and almost
> always use more power than a TCM/scratchpad.

The primary reason for data cacheline pinning is to reduce timing side
channels in general and improve performance in the particular case of
using a small lookup table to process a large amount of data.
Implementations are allowed to "non-implement" pinning and simply return
the base address.

Instruction cacheline pinning is proposed to address an issue that was
raised with the HiFive board that I suspect other implementations may
also have, where writes to flash preclude concurrently running from
flash, but there is an instruction cache available that will make this
work, if the flash-write code is cached before the process starts.

> In contrast, wouldn't a TCM/scratchpad RAM almost always solve all of
> these problems? It can be done on a per-system basis (where needed)
> without changing the ISA, and without forcing all implementations to
> carry the baggage of the extra instruction decode, cache coherence
> state, etc. One thing that may be desired is an easy way to query a
> system whether it contains a TCM/scratchpad, but this should be done
> at the OS level, eg in a device tree or such.

The extra instruction decode can be as little as recognizing a
REGION-NOOP (transfer rs1 -> rd; "ADDI rd, rs1, 0") if the feature isn't
actually implemented. The problem I see with relying on a scratchpad
RAM is how to expose that scratchpad to user programs? Scratchpads may
make more sense in embedded systems, but even there pinning can be
useful, so I have kept the "pin" operations thus far.


-- Jacob

Jacob Bachmeyer

unread,
Jul 18, 2017, 8:28:41 PM7/18/17
to Michael Clark, Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
Michael Clark wrote:
>
>> On 19 Jul 2017, at 10:34 AM, Bruce Hoult <br...@hoult.org
This constraint is simply wrong -- an implementation should be allowed
to perform as much of the work as it can, as quickly as it can.

The result of a REGION operation is the first address after the affected
region. This means that the result will be aligned at whatever hardware
granularity applies, if the operation is implemented at all.
"CACHE.PREFETCH? t1, t2, t2" will load t1 with the address of the first
hardware granularity boundary after the address in t2. Cacheline size
can be inquired with a "CACHE.PREFETCH? t1, t2, t2 ; CACHE.PREFETCH? t2,
t1, t1; SUB t1, t2, t1" sequence. Note that a major design goal of
these instructions is to make them *independent* of the cacheline size.
Correct programs should *never* have to care about the actual cacheline
size when they run. (I will admit exteme performance optimization as an
exception to this rule, but note that such programs will still produce
correct results, just more slowly.) Note also that cacheline size may
vary with address. I am uncertain whether any hardware would want to do
this, but the option does exist with the REGION instructions.


-- Jacob

Guy Lemieux

unread,
Jul 18, 2017, 8:32:35 PM7/18/17
to Jacob Bachmeyer, chuanhua.chang, RISC-V ISA Dev
> I believe that FENCE.I and FENCE.RI have "icache invalidate" semantics. For
> dcache, I likewise believe that FENCE and FENCE.RD can provide "invalidate"
> semantics (a counterexample requires a situation where "fence" semantics are
> preserved that is distinguishable from cache invalidation).

FENCE and its variants do not require an invalidation to be performed.
Do not use these instructions to invalidate cache lines. In
particular, cache coherent systems will not need to perform an
invalidation at all, and will keep all data in the cache. On
non-coherent caches, it just happens that invalidation is the easiest
way to implement FENCE. Thus, FENCE has different possible side
effects, depending upon how the cache system is built.

Therefore, FENCE cannot be used to replace an INVALIDATE.

An INVALIDATE operation is optional, as it only accelerates FLUSH
operations by indicating that a writeback of dirty data is not
required. It cannot be replaced by a FENCE. INVALIDATE will be useful
by systems with both coherent and non-coherent caches.

> CACHE.DISCARD
> provides simple invalidation when the current cached data is no longer
> needed, such as preparing an I/O buffer for a DMA read from external
> hardware. You make a good point that destructive cache control should unpin
> cachelines; this will affect CACHE.DISCARD in draft 3.

> This also might be a
> good argument for a CACHE.FLUSH ("writeback-and-invalidate") instruction, as
> requiring CACHE.WRITE+CACHE.UNPIN+FENCE.RD is a bit long for that operation.

You cannot rely upon a FENCE to perform invalidate, so that
instruction combination is not a valid solution.

I've proposed 3 instructions: WRITEBACK INVALIDATE, and FLUSH.

At minimum, you need just the a FLUSH operation. Or, you can build a
system with just 2 instructions (WRITEBACK and FLUSH, or WRITEBACK and
INVALIDATE). But I recommend all 3.

Guy

Guy Lemieux

unread,
Jul 18, 2017, 9:33:45 PM7/18/17
to Jacob Bachmeyer, chuanhua.chang, RISC-V ISA Dev
On Tue, Jul 18, 2017 at 5:15 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> The I-cache is read-only, so writeback and flush make no sense for it.

On an icache, WRITEBACK is essentially a NOP and FLUSH does an INVALIDATE.

The point is that software should not know about separate icaches and
dcaches. The caches may be unified.

The 3 cache instructions (writeback, flush, invalidate) are all
region-based. These instructions should not specify whether to operate
on dcache or icache, because that would tie them to a specific
hardware implementation. Instead, by specifying an operation on a
region, the microarchitecture must automatically know whether to
inspect an icache or a dcache and do the right thing. For example, an
INVALIDATE operation, if the memory region happens to include
instructions then it must flush the icache, and if it happens to
include data then it must flush the dcache --> in any case, it must
iterate over both caches to check for the presence of any cache tags
that match the memory region of interest, and invalidate them.

When software does an INVALIDATE, it is because the underlying memory
region is about to be modified (Eg by external DMA device). If that
memory region is in the icache, then the icache must be flushed.
Software should not say "only flush this from icache" or "only flush
this from dcache"; it must be removed from both caches.

WRITEBACK does not make sense for an icache, so hardware won't have to
check the icache during such an operation. The programmer doesn't have
to specify which cache to writeback; the hardware knows.

Similarly, FLUSH will do both WB+INVAL on dcache, and only INVAL on icache.


>> Do you see a reason there must be explicit targets (icache, dcache,
>> or even both) in the ISA and not let hardware manage this implicitly?
>
> RISC-V requires an explicit FENCE.I to ensure coherency between data access
> and instruction fetch already, so I believe that making this a hardware
> burden was already considered and rejected.

I don't understand your response.

My question is: why should cache manipulation instructions have to
specify whether they are to operate on the icache or dcache? What does
FENCE have to do with this?



> The primary reason for data cacheline pinning is to reduce timing side
> channels in general and improve performance in the particular case of using
> a small lookup table to process a large amount of data. Implementations are
> allowed to "non-implement" pinning and simply return the base address.

I understand the motivation for pinning.

Pinning on one microarchitecture may improve performance, but pinning
on another microarchitecture may not improve performance. For example,
it may help on one system with two-way set associative caches, and
greatly harm on another system with direct-mapped caches. It is not a
good, portable mechanism to improve performance.

Scratchpads are portable and work.


> Instruction cacheline pinning is proposed to address an issue that was
> raised with the HiFive board that I suspect other implementations may also
> have

I'd like to hear from SiFive about this issue and how the icache
pinning will help.

I'd also like to hear from others such Andes directly on why they
think pinning is necessary.

> where writes to flash preclude concurrently running from flash, but
> there is an instruction cache available that will make this work, if the
> flash-write code is cached before the process starts.



>> In contrast, wouldn't a TCM/scratchpad RAM almost always solve all of
>> these problems? It can be done on a per-system basis (where needed) without
>> changing the ISA, and without forcing all implementations to carry the
>> baggage of the extra instruction decode, cache coherence state, etc. One
>> thing that may be desired is an easy way to query a system whether it
>> contains a TCM/scratchpad, but this should be done at the OS level, eg in a
>> device tree or such.
>
>
> The extra instruction decode can be as little as recognizing a REGION-NOOP
> (transfer rs1 -> rd; "ADDI rd, rs1, 0") if the feature isn't actually
> implemented.

Mapping instructions to NOPs may seem trivial, but it adds to the
logic of the system.... in an FPGA, especially tiny ones like Lattice
iCE parts, it is important to be as lean as possible. Adding useless
instructions, and saying hardware can ignore them by adding hardware
to translate them into NOPs, is like making the gas tank on your car
really really big and heavy but never filling it up because you can't
afford to.

> The problem I see with relying on a scratchpad RAM is how to
> expose that scratchpad to user programs? Scratchpads may make more sense in
> embedded systems, but even there pinning can be useful, so I have kept the
> "pin" operations thus far.

How do you expose DRAM to user programs? How do you expose Flash ROM
to user programs?

Pin operations are well-intended, just like the "register" keyword in
C. However, their use can be more destructive than good and limit
software portability. Nobody writes software with the "register"
keyword any more.

Guy

Guy Lemieux

unread,
Jul 18, 2017, 9:48:43 PM7/18/17
to Jacob Bachmeyer, Michael Clark, Bruce Hoult, chuanhua.chang, RISC-V ISA Dev
> This constraint is simply wrong -- an implementation should be allowed to
> perform as much of the work as it can, as quickly as it can.
>
> The result of a REGION operation is the first address after the affected
> region. This means that the result will be aligned at whatever hardware
> granularity applies, if the operation is implemented at all.
> "CACHE.PREFETCH? t1, t2, t2" will load t1 with the address of the first
> hardware granularity boundary after the address in t2. Cacheline size can
> be inquired with a "CACHE.PREFETCH? t1, t2, t2 ; CACHE.PREFETCH? t2, t1, t1;
> SUB t1, t2, t1" sequence. Note that a major design goal of these
> instructions is to make them *independent* of the cacheline size. Correct
> programs should *never* have to care about the actual cacheline size when
> they run. (I will admit exteme performance optimization as an exception to
> this rule, but note that such programs will still produce correct results,
> just more slowly.) Note also that cacheline size may vary with address. I
> am uncertain whether any hardware would want to do this, but the option does
> exist with the REGION instructions.

Let's do the math for a specific example...

For a REGION that spans 1GB, how many clock cycles should take to
perform a cache operation on an 8kB cache with 32B cache lines?

How many software loop iterations will be required?

How quickly can hardware do it using a hardware iterating mechanism?

start_addr = 0
end_addr = 1<<30 - 1

1GB = 1024 x 1024 x 1024 Bytes
/32B = 8 million memory blocks

8kB cache = 8 x 1024 Bytes
/32B = 256 cache lines

Software to invalidate 1GB, 32B at a time in a 32b system:

for( uint32_t *p = start_addr; p <= end_addr; p += 32 ) {
invalidate_by_addr( p );
}

Hardware to invalidate 2GB, 32B at a time:

// 8kB = 1<< 13
uint32_t cache_line_index( uint32_t addr ) {
return (addr & 0x00001FE0) >> 5 ; // extract index bits from address
}

uint32_t cl_start = cache_line_index( start_addr );
uint32_t cl_end = cache_line_index( end_addr );
for( uint32_t i = cl_start; i <= cl_end; i ++ ) {
if( start_addr <= tag_to_addr(i, cache[i].tag) && tag_to_addr(i,
cache[i].tag) <= end_addr ) {
// note: I've over-simplified a bit here with the tag_to_addr() function
invalidate_by_index( i );
}
}

Would you rather have the looping done in hardware or in software?


Guy

Jacob Bachmeyer

unread,
Jul 18, 2017, 9:48:54 PM7/18/17
to Guy Lemieux, Michael Clark, RISC-V ISA Dev
Guy Lemieux wrote:
> ...this thread went dormant for a while but it's important and it has
> come up again....
>
> my thoughts below.
>
>
> On Sun, Jun 18, 2017 at 9:40 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Guy Lemieux wrote:
>>
>>> the second pass must iterate through every cache line, and tag compare
>>> them write tag invalid. done one per cycle it can take a few hundred cycles.
>>>
>> Parallel processing within each cache line is infeasible? (Every line could
>> have its own logic that can do "valid, clean" -> "invalid" on a read fence
>> if the system is non-coherent,
>>
>
> Yes, this is infeasible. Low-cost systems, ie ones which will not have
> coherence, will not likely budget the area required to do parallel
> state transitions on all cache lines. A budget system will put the
> state into an SRAM, which can only be addressed one line per cycle. If
> you want to do multiple cache lines per cycle, you must increase the
> width of this SRAM or build it out of dedicated flip-flops. The width
> for just one cache line is the full tag and all state bits, eg about
> 20-24 bits wide.
>

Perhaps I am misremembering, but I thought that cache tags always have
to be in CAM? Of course, CAM can supply an address decode function and
supporting data could be in SRAM columns. I see your point, since
parallel invalidate would require dedicated flip-flops and control logic
amidst the CAM/SRAM array. Is macro-op fusion for
CACHE.WRITE+CACHE.DISCARD feasible?

I am looking at this from a high-performance perspective, so your view
is helpful.

>> although a read fence becomes "interesting"
>> if a cacheline is "valid, dirty" and whether the corresponding memory has
>> been written in the interim is unknown.
>>
>
> If the system is noncoherent, then it will blindly write back the
> cache line irrespective of whether the corresponding memory block was
> written by another master (cpu, io device).
>
>
>> Does this mean that fences end up
>> requiring coherent caches?)
>>
>
> No.
>

Then it looks to me that non-coherent systems might be unable to provide
the stated semantics for FENCE; this is a concern.

>> Is FLUSH any different in effect from WRITEBACK
>> in coherent systems?
>>
>
> a WRITEBACK instruction from the CPU will leave cache lines in the
> valid clean state. if the line was dirty, it will have been written
> back.
>
> a FLUSH instruction from the CPU will leave cache lines in the invalid
> state. if the line was dirty, it will have been written back.
>
> both of these would operate identically (above description) in a
> coherent or non-coherent system. however, in a coherent system, the
> process of writing back the dirty line may have other transactions
> associated with it (such as invalidating or updating other copies in
> other caches or in main memory).
>

Unless I am mistaken, writeback in a coherent system can never result in
another cacheline being invalidated or updated -- any other copies had
to have been invalidated or updated when "this" copy was updated.
Otherwise, another master could produce a conflicting update to its copy
-- and the system is not cache-coherent.

>> If cache coherency is maintained by hardware, is
>> invalidation ever needed?
>>
>
> cache coherent hardware performs invalidations all the time without
> the CPU knowing it. it is the main mechanism for getting rid of shared
> copies and ensuring only one entity can have dirty (or write-able)
> data.
>

Then software invalidation is never required in a coherent system?

> the instructions I am proposing are required to manage a cache which
> does not have the coherence hardware.
>

This seems to the central issue: in a non-coherent system, I believe
that data fences require flushing caches. How else can you ensure that
all previous stores are visible to other masters in the system without a
coherency protocol?

>> Or does choosing a line to evict take time, and
>> flushing the cache therefore provides performance benefits?
>>
>
> choosing a line to evict typically takes zero time.
>
> flushing a cache line is not to a performance thing, it is a system
> correctness thing. if you are about to have a device (eg, disk) write
> new data to memory via DMA, then you must ensure no copies of the data
> are in the processor cache (either dirty or clean). this means you
> want to perform a FLUSH. however, a flush will trigger a writeback of
> dirty data in the cache, and that will get clobbered by the disk DMA
> anyways, so an INVALIDATE would be better which simply discards any
> data in the cache rather than writing back any dirty data.
>

INVALIDATE is the proposed CACHE.DISCARD.

>>> regarding PREZERO, when a complete overwrite of memory is expected, you
>>> have to consider coherent and non-coherent systems. it gets very messy
>>> because coherent systems have many ways of being built and there may not be
>>> a way to support this behavior directly, and non-coherent systems probably
>>> don't have any mechanism to remove other cached copies anyways so it will be
>>> a high overhead operation.
>>>
>>> I don't think PREZERO should be a cache operation -- too tricky to specify
>>> something that is useful that doesn't fall apart on many different
>>> implementations. the only advantage is avoiding a useless read before
>>> writing, but you can get that effect with a write through policy
>>> (dynamically selected) and/or write combining buffer
>>>
>> That advantage is its purpose. Can we find other ways to achieve that?
>> Possibly ways that do not actually require support from the ISA?
>>
>
> yes: use a DMA device that can do a memcopy, initializing the source
> buffer with the desired data.
>
> or use a vector store instruction to write blocks of data.
>
> PREZERO is a complex memory operation, not a cache management
> instruction. it attempts to overlay memory functionality into the
> cache. it is messy and assumes a lot about how the cache is
> implemented. it is purely a performance issue, but it is not essential
> for managing a processor with non-coherent caches. the instructions I
> have been advocated are essential and provide a functional role that
> is not just a performance thing -- it is a function that cannot be
> done any other way.
>

CACHE.PREZERO is also a dual to CACHE.DISCARD: CACHE.DISCARD "clears
the way" for data entering the processor, while CACHE.PREZERO (to be
renamed in draft 3) "clears the way" for data leaving the processor.
While the system-correctness issues (in non-coherent systems) are
asymmetric, the performance benefits in coherent systems are similar.

>> Or that
>> become mere recommendations for software to follow? (For example, "always
>> copy or set memory in sequential words/doublewords/quadwords
>> (32/64/128-bit); this allows hardware with write combining buffers to elide
>> a cacheline fetch if the entire line is overwritten"?)
>>
>
> I don't understand this suggestion.
>

It was another effort at meeting the use-case for CACHE.PREZERO. Put
simply, if you have a write-combining buffer, and software always
follows a known and simple pattern when clobbering an entire block,
hardware could recognize that pattern and implicitly perform a
CACHE.PREZERO-like operation line-by-line.

>> Good answers to this concern will be enough to drop PREZERO from draft 3.
>>
>
> Summary:
>
> PREZERO is purely a performance thing. It can already be done with
> existing instructions, and can be built out of a DMA engine. It is
> complex to implement and may be difficult to add to caches. It should
> be discarded from this proposal, since it is not essential for cache
> management. It can be considered part of a new proposal that is for
> cache-accelerated operations, along with PREFETCH and PIN, but I
> wouldn't want to see its complexity become required in non-coherent
> cached systems (which are trying to remain simple).
>

Another important goal is keeping the total number of extensions down,
so splitting "cache control for optimization" and "cache control for
non-coherent systems" seems ill-advised to me.

Nor is its complexity required -- all REGION instructions can be
"non-implemented" by returning the base address; decode as "ADDI rd,
rs1, 0".

> FLUSH and WRITEBACK provide functionality that is not available in any
> other way. They are required in a non-coherent cache system, and can
> be useful even in a cache coherent system.
>

WRITEBACK is not available any other way, but CACHE.WRITE+CACHE.DISCARD
is FLUSH ("writeback+invalidate").

> An INVALIDATE instruction adds performance, but no real new
> functionality. It comes almost for free with very little
> implementation differences from FLUSH (it is a simpler variant of
> FLUSH). Because of this simple implementation, I advocate for its
> inclusion at the level of FLUSH and WRITEBACK.
>

I see WRITEBACK (CACHE.WRITE) and INVALIDATE (CACHE.DISCARD) as
primitives, with FLUSH as a derived function.

> FENCE.RI and FENCE.RD are ranged versions of FENCE.I and FENCE,
> respectively. I think both of these are good ideas, and strongly
> support their inclusion.
>

FENCE and FENCE.RD are not quite the same due to encoding constraints.
FENCE can order a subset of operations across the entire address space,
while FENCE.RD orders all operations across a subset of the address space.

> It is possible for FENCE.I and FENCE.RI to share the same opcode, eg
> by encoding rs1 == r0 and rs2 == r0.
>

That would make "FENCE.RI x0, x0, x0" which performs a ranged fence on
the smallest hardware granularity at address 0 impossible to encode. On
the other hand, it would free a function code that could be used for
PREFETCH-FOR-UPDATE, so I am uncertain.

> FENCE.RD cannot replace FLUSH. a fence does not necessarily have to
> flush a data cache if it has a cache coherence protocol. a FLUSH must
> always write back and mark invalid. please add a FLUSH/CACHE.FLUSH
> instruction.
>

This is the problem I have: if caches are coherent, what
software-visible difference exists between WRITEBACK+FENCE and FLUSH?
If caches are non-coherent, I believe that FENCE *is* FLUSH. Is this
not so in non-coherent systems?

> Of the above, only INVALIDATE is a "destructive" operation. The range
> specifier will be precise, so rounding the start or ending addresses
> to align with cache line boundaries will be benign for FLUSH and
> WRITEBACK, but it will have dire consequences with INVALIDATE. The
> easy thing to do is to writeback the first cache line and the last
> cache line (if they are dirty) before invalidating them, ie behave
> like FLUSH on those to cache lines but behave like INVALIDATE in the
> middle.
>

This is the proposed behavior for CACHE.DISCARD.

> Jacob has recommended that when rd == x0 the operation may run
> asynchronously, and when rd == other register then the instruction
> will return with the last address that was operated upon. I have the
> following concerns:
>
> if rd == x0, then you have no way of knowing when the operation is
> done (except to automagically stall on the next instruction that
> accesses the cache).
>

This automagic stall is the stall-on-cache-miss that would happen
anyway. Asynchronous operations are defined generally in REGION, but
are mostly intended for asynchronous prefetch, although CACHE.UNPIN,
CACHE.DISCARD, and CACHE.PREZERO (will be renamed) could also be
usefully asynchronous, provided that hardware can ensure that the
semantics hold.

> it was suggested that hardware can either auto-iterate, or it can
> operate on a single cache line and return an incremented version of
> rs1 (aligned to start of the next cache line) in rd. unfortunately, I
> cannot think of a way to make this work. the problem is that the
> hardware must iterate over the cacheline space from
> cacheline(rs1)..cacheline(rs2), not the memory address space
> (rs1..rs2). if hardware returns an address (rs1+increment) then it
> will be forced to iterate through rs1..rs2 and it will take a very
> long time for large memory regions. if hardware returns a cache line
> index, cacheline(rs1+increment), then you cannot simply use that value
> to update the rs1 argument and continue. if anyone sees a way this can
> be made efficient, where the cache instruction operates on a single
> line and returns a useful progress indicator, please help :)
>

Easy: hardware aligns rs1 to a cacheline boundary (by masking low-order
bits) and increments *that* value. The result would be the base address
of the next cacheline; software observes that some, but not all, of the
requested work was done and iterates until hardware returns a value
greater than rs2. Software abandons the operation if hardware returns
rs1 -- indicating that a region has been reached where the requested
operation is unavailable.



-- Jacob

Jacob Bachmeyer

unread,
Jul 18, 2017, 9:52:51 PM7/18/17
to Alex Marshall, Sean Halle, Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
Alex Marshall wrote:
>
> You should be able to implement it as a move from the “high” source
> register to the destination register, I believe. Hopefully not that
> much of a burden? That said, we may be past the point where this could
> go in the base ISA, just because it can’t be implemented as a NOP (I
> don’t recall how RISC-V deals with undefined opcodes, if it’s always a
> trap it might be ok…).
>

A move from the "low" source register to the destination ("ADDI rd, rs1,
0") is the defined way to indicate "not-implemented at that address" in
my proposal. This reports that a zero-length region was affected.

Undefined opcodes always trap in RISC-V.


-- Jacob

Jacob Bachmeyer

unread,
Jul 18, 2017, 10:01:23 PM7/18/17
to Guy Lemieux, Michael Clark, Bruce Hoult, chuanhua.chang, RISC-V ISA Dev
Obviously hardware: software must iterate over the address space, while
hardware can iterate over the (much smaller) set of actual cachelines.

There is a further simplification possible here for non-destructive
operations. Taking WRITEBACK as an example, CACHE.WRITE can be seen as
permission to produce a burst of memory traffic. If executed on a
region larger than the cache, an implementation is permitted to simply
writeback the entire cache and return the upper bound, since WRITEBACK
has certainly been applied to the entire region if the entire cache has
been written back.


-- Jacob

Allen Baum

unread,
Jul 18, 2017, 10:41:56 PM7/18/17
to Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer
How about a primitive that, instead of upper & lower limit, has lower limit and a destination ref that gets updated with lower+ cacheline size ( or even just cacheline size- then it could return other status as well). No state machine needed, no problem with thread migration. 

-Allen

On Jul 18, 2017, at 3:34 PM, Bruce Hoult <br...@hoult.org> wrote:




It's not a question of forgetting to check. You can't realistically check! An app doesn't get any notification of when it is migrated from one core to another. Even if the app polls for the current core's cache line size immediately before using the cache control instruction there is still a possibility of being migrated between any two instructions.

I'm not in favour of implicit looping solutions either i.e. "invalidate everything between lo and hi". It seems far better to me to have an instruction that only promises to invalidate (for example) *something* starting at lo, and hi, and returns how much work it did (e.g. an updated value for lo).

It is then software's responsibility to check if lo is still lower than hi, and loop to do more work if so. 

Some implementations might choose to do everything in one instruction (perhaps only if not interrupted), but I'd expect many or most to only do one cache line at a time.

The important point is that whatever CPU runs the cache invalidate instruction at that moment knows its own cache line size, and thus updates lo appropriately. So it doesn't matter if the process gets migrated in the middle of the loop

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Jacob Bachmeyer

unread,
Jul 18, 2017, 10:58:43 PM7/18/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
Guy Lemieux wrote:
> On Tue, Jul 18, 2017 at 5:15 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> The I-cache is read-only, so writeback and flush make no sense for it.
>>
>
> On an icache, WRITEBACK is essentially a NOP and FLUSH does an INVALIDATE.
>
> The point is that software should not know about separate icaches and
> dcaches. The caches may be unified.
>

A unified cache may be treated as a data cache.

> The 3 cache instructions (writeback, flush, invalidate) are all
> region-based. These instructions should not specify whether to operate
> on dcache or icache, because that would tie them to a specific
> hardware implementation. Instead, by specifying an operation on a
> region, the microarchitecture must automatically know whether to
> inspect an icache or a dcache and do the right thing. For example, an
> INVALIDATE operation, if the memory region happens to include
> instructions then it must flush the icache, and if it happens to
> include data then it must flush the dcache --> in any case, it must
> iterate over both caches to check for the presence of any cache tags
> that match the memory region of interest, and invalidate them.
>
> When software does an INVALIDATE, it is because the underlying memory
> region is about to be modified (Eg by external DMA device). If that
> memory region is in the icache, then the icache must be flushed.
> Software should not say "only flush this from icache" or "only flush
> this from dcache"; it must be removed from both caches.
>
> WRITEBACK does not make sense for an icache, so hardware won't have to
> check the icache during such an operation. The programmer doesn't have
> to specify which cache to writeback; the hardware knows.
>
> Similarly, FLUSH will do both WB+INVAL on dcache, and only INVAL on icache.
>

On icache, WRITEBACK and FLUSH are both simply INVALIDATE. The user ISA
spec specifically says (in commentary) that a simple implementation can
flush the instruction cache when FENCE.I is executed. Since INVALIDATE
is the only meaningful basic control operation on the instruction cache,
and FENCE.I/FENCE.RI essentially have the equivalent effect, there is no
need for new instructions for basic operations on the instruction cache.

Prefetching into the instruction cache is a distinct operation because
it expresses a distinct intent: this memory contains code to execute
rather than data to examine. Likewise, pinning instruction cachelines
is a privileged operation for the same reasons that delivering an IPI is
a privileged operation.

>>> Do you see a reason there must be explicit targets (icache, dcache,
>>> or even both) in the ISA and not let hardware manage this implicitly?
>>>
>> RISC-V requires an explicit FENCE.I to ensure coherency between data access
>> and instruction fetch already, so I believe that making this a hardware
>> burden was already considered and rejected.
>>
>
> I don't understand your response.
>
> My question is: why should cache manipulation instructions have to
> specify whether they are to operate on the icache or dcache? What does
> FENCE have to do with this?
>

Most of the cache-control instructions are D-cache-only, because most of
the interesting operations apply only to the data cache. FENCE is
relevant because RISC-V seems to use a quasi-Harvard model where
instructions and data share an address space, but there are distinct
architectural paths to memory for data access and instruction fetch. To
ensure that instruction fetch will see recent data stores, FENCE.I must
be executed. This is in the base ISA.

>> The primary reason for data cacheline pinning is to reduce timing side
>> channels in general and improve performance in the particular case of using
>> a small lookup table to process a large amount of data. Implementations are
>> allowed to "non-implement" pinning and simply return the base address.
>>
>
> I understand the motivation for pinning.
>
> Pinning on one microarchitecture may improve performance, but pinning
> on another microarchitecture may not improve performance. For example,
> it may help on one system with two-way set associative caches, and
> greatly harm on another system with direct-mapped caches. It is not a
> good, portable mechanism to improve performance.
>
> Scratchpads are portable and work.
>

If an implementation can never benefit from pinning, that implementation
has the option to "non-implement" pinning and always return the base
address.

>>> In contrast, wouldn't a TCM/scratchpad RAM almost always solve all of
>>> these problems? It can be done on a per-system basis (where needed) without
>>> changing the ISA, and without forcing all implementations to carry the
>>> baggage of the extra instruction decode, cache coherence state, etc. One
>>> thing that may be desired is an easy way to query a system whether it
>>> contains a TCM/scratchpad, but this should be done at the OS level, eg in a
>>> device tree or such.
>>>
>> The extra instruction decode can be as little as recognizing a REGION-NOOP
>> (transfer rs1 -> rd; "ADDI rd, rs1, 0") if the feature isn't actually
>> implemented.
>>
>
> Mapping instructions to NOPs may seem trivial, but it adds to the
> logic of the system.... in an FPGA, especially tiny ones like Lattice
> iCE parts, it is important to be as lean as possible. Adding useless
> instructions, and saying hardware can ignore them by adding hardware
> to translate them into NOPs, is like making the gas tank on your car
> really really big and heavy but never filling it up because you can't
> afford to.
>

Once you can recognize the MISC-MEM/REGION space, interpreting
unimplemented operations as quasi-NOPs is easy. Or you can leave it
unimplemented, take an illegal instruction trap, and emulate the
quasi-NOP. There is no requirement that hardware support every
instruction in RISC-V -- only that the defined instructions work
somehow. In an FPGA, where the entire environment is likely to be
non-standard, you do not even need that -- you can ensure that the
unsupported opcodes never appear in your program.

>> The problem I see with relying on a scratchpad RAM is how to
>> expose that scratchpad to user programs? Scratchpads may make more sense in
>> embedded systems, but even there pinning can be useful, so I have kept the
>> "pin" operations thus far.
>>
>
> How do you expose DRAM to user programs?
The primary data segment.
> How do you expose Flash ROM to user programs?
>
The program text segment.

I understand your point, but a scratchpad RAM is a special region of
memory, with properties different from main memory. How to communicate
that to user programs?

> Pin operations are well-intended, just like the "register" keyword in
> C. However, their use can be more destructive than good and limit
> software portability. Nobody writes software with the "register"
> keyword any more.

That is because modern compilers can infer "register" and usually do a
better job of it than most programmers. I doubt hardware will be able
to handle caching similarly well anytime soon.


-- Jacob

Jacob Bachmeyer

unread,
Jul 18, 2017, 11:08:31 PM7/18/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
Guy Lemieux wrote:
>> I believe that FENCE.I and FENCE.RI have "icache invalidate" semantics. For
>> dcache, I likewise believe that FENCE and FENCE.RD can provide "invalidate"
>> semantics (a counterexample requires a situation where "fence" semantics are
>> preserved that is distinguishable from cache invalidation).
>>
>
> FENCE and its variants do not require an invalidation to be performed.
> Do not use these instructions to invalidate cache lines. In
> particular, cache coherent systems will not need to perform an
> invalidation at all, and will keep all data in the cache. On
> non-coherent caches, it just happens that invalidation is the easiest
> way to implement FENCE. Thus, FENCE has different possible side
> effects, depending upon how the cache system is built.
>
> Therefore, FENCE cannot be used to replace an INVALIDATE.
>
> An INVALIDATE operation is optional, as it only accelerates FLUSH
> operations by indicating that a writeback of dirty data is not
> required. It cannot be replaced by a FENCE. INVALIDATE will be useful
> by systems with both coherent and non-coherent caches.
>

How are these distinguishable to software? In a coherent system,
hardware will invalidate the cachelines when needed, so there is no need
for a software INVALIDATE operation at all -- it is purely a performance
optimization. In a non-coherent system, FENCE requires
writeback-and-invalidate. What software-visible difference (aside from
performance) exists between FENCE and FLUSH? Do partially-coherent
systems need a distinct FLUSH?

>> CACHE.DISCARD
>> provides simple invalidation when the current cached data is no longer
>> needed, such as preparing an I/O buffer for a DMA read from external
>> hardware. You make a good point that destructive cache control should unpin
>> cachelines; this will affect CACHE.DISCARD in draft 3.
>>
>> This also might be a
>> good argument for a CACHE.FLUSH ("writeback-and-invalidate") instruction, as
>> requiring CACHE.WRITE+CACHE.UNPIN+FENCE.RD is a bit long for that operation.
>>
>
> You cannot rely upon a FENCE to perform invalidate, so that
> instruction combination is not a valid solution.
>
> I've proposed 3 instructions: WRITEBACK INVALIDATE, and FLUSH.
>
> At minimum, you need just the a FLUSH operation. Or, you can build a
> system with just 2 instructions (WRITEBACK and FLUSH, or WRITEBACK and
> INVALIDATE). But I recommend all 3.
>

So writeback-and-invalidate is CACHE.WRITE+CACHE.DISCARD, with
CACHE.DISCARD performing the implicit unpin operation. So we are back
to two instructions for writeback-and-invalidate. Why should this be a
single instruction? (Note that I am not strongly opposed to adding
CACHE.FLUSH as its own opcode; I just want a good reason why it is worth
a function code.)



-- Jacob

Guy Lemieux

unread,
Jul 18, 2017, 11:25:03 PM7/18/17
to Jacob Bachmeyer, chuanhua.chang, RISC-V ISA Dev
yes there are varying degrees of being coherent.

software should work correctly on as many systems as possible without
making system-specific sections of code.

software should perform an INV when needed, eg in preparing DMA data
buffers. if the IO device is coherent then the CPU can treat INV as a
NOP. if the IO device is noncoherent, then the INV instruction must do
its thing. in this latter case, the other processors may be coherent
or non-coherent, and software doesn't care, all it needs to know is
the IO device is noncoherent. since the DMA will be modifying memory
that is potentially in the dcache and icache, the INV must remove data
from both caches. the programmer should not have to specify icache or
dcache or ucache or whatever -- it should jus work.

software should perform a FENCE when the CPU has written to a code
page. In a noncoherent system, this may trigger a writeback of the
data cache and an invalidate of the instruction cache; it isn't clear
that it needs to also invalidate the data cache. In a coherent system,
this may simply ensure write buffers are flushed. However, the RISC V
spec DOES NOT REQUIRE that a flush and invalidate be done on a fence
in a noncoherent system, it merely states that is one way to achieve
the desired result.

these 2 different use cases require 2 different operations.

FENCE != INVALIDATE.


> So we are back to two instructions
> for writeback-and-invalidate. Why should this be a single instruction?
> (Note that I am not strongly opposed to adding CACHE.FLUSH as its own
> opcode; I just want a good reason why it is worth a function code.)

I think we're talking in circles.

There are 3 basic instructions, but you can get away with specifying
just 1 or just 2 if you want to be "minimal", but 3 is best. I've
given all the arguments already and don't wish to repeat for the sake
of other readers.

Guy

Jacob Bachmeyer

unread,
Jul 18, 2017, 11:45:58 PM7/18/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
After rereading some of your previous messages, CACHE.FLUSH will be in
draft 3. The convincing detail was small implementations that need to
iterate over cachelines for both writeback and for invalidate, and the
likely infeasibility of otherwise combining those operations. The
function code was reserved anyway.


-- Jacob

Allen J. Baum

unread,
Jul 19, 2017, 1:10:13 AM7/19/17
to jcb6...@gmail.com, Michael Clark, Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
At 7:28 PM -0500 7/18/17, Jacob Bachmeyer wrote:
>Michael Clark wrote:
>>
>>>On 19 Jul 2017, at 10:34 AM, Bruce Hoult <br...@hoult.org <mailto:br...@hoult.org>> wrote:
>>>
>>>I'm not in favour of implicit looping solutions either i.e. "invalidate everything between lo and hi". It seems far better to me to have an instruction that only promises to invalidate (for example) *something* starting at lo, and hi, and returns how much work it did (e.g. an updated value for lo).
>>
>>I think we are all in agreement in this respect after sorear first suggested it.

I hadn't notice that he suggested it.
It removes my objections
Does the proposed instruction format have rs1, rs2 and rd available?



--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

Allen J. Baum

unread,
Jul 19, 2017, 1:25:27 AM7/19/17
to Guy Lemieux, Jacob Bachmeyer, Michael Clark, Bruce Hoult, chuanhua.chang, RISC-V ISA Dev
At 6:48 PM -0700 7/18/17, Guy Lemieux wrote:
>
>Let's do the math for a specific example...
>
>For a REGION that spans 1GB, how many clock cycles should take to
>perform a cache operation on an 8kB cache with 32B cache lines?
>
>How many software loop iterations will be required?
>
>How quickly can hardware do it using a hardware iterating mechanism?
>
>.....
>
>Would you rather have the looping done in hardware or in software?

You need to look at the real world.
How often will be performance of this be constrained by memory BW?
How often will it be constained by TLB misses?
How often do you actually need to invalidate a 1GB region?
Multiply that all out.
If that number doesn't make a 1% difference in system perofrmance, then adding a state machine is a mistake.
State machines sound easy - but the corner cases (and there will be waaaay more than you expect, believe me (or not)) will kill youi.

That's one reason I like the feedback if returning the next address after the one that was invalidated - I don't have to implement a state machien.
Yes, a SW implementation using a loop can have corner cas, too - but I don't have to tape out another chip to do it.


I think I missed something from way back: is this instruction limited to M-mode? M+S? M+S+U?

Allen J. Baum

unread,
Jul 19, 2017, 1:34:30 AM7/19/17
to jcb6...@gmail.com, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
At 7:15 PM -0500 7/18/17, Jacob Bachmeyer wrote:
>.....
>Instruction cacheline pinning is proposed to address an issue that was raised with the HiFive board that I suspect other implementations may also have, where writes to flash preclude concurrently running from flash, but there is an instruction cache available that will make this work, if the flash-write code is cached before the process starts.

As a data point, Intel processors will load chunks of code into cache and execute there because at boot they would otherwise be looping on (very) slow serial EProm, and boot would take forever. In that particular case, I can't recall if they actually lock anything or are just very, very careful to make sure the code can't mss (by knowing cache size and wayness, and allocating addresses carefully)

>
>>In contrast, wouldn't a TCM/scratchpad RAM almost always solve all of these problems? It can be done on a per-system basis (where needed) without changing the ISA, and without forcing all implementations to carry the baggage of the extra instruction decode, cache coherence state, etc. One thing that may be desired is an easy way to query a system whether it contains a TCM/scratchpad, but this should be done at the OS level, eg in a device tree or such.

For a small code, a scratchpad can be a very significant amount of area.
I don't see why it should be slower; it basically saying that if there is an eviction, choose someone else. That's off the critical path. Muxing in scratchpad data may actually slow down access to the cache, however, since you've just added a mux and a bunch of logic that has to turn off cache accesses (and if you can't do that fast enough, you haven't saved any power)

Michael Clark

unread,
Jul 19, 2017, 2:26:16 AM7/19/17
to Allen J. Baum, jcb6...@gmail.com, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
On 19 Jul 2017, at 5:34 PM, Allen J. Baum <allen...@esperantotech.com> wrote:

At 7:15 PM -0500 7/18/17, Jacob Bachmeyer wrote:
.....
Instruction cacheline pinning is proposed to address an issue that was raised with the HiFive board that I suspect other implementations may also have, where writes to flash preclude concurrently running from flash, but there is an instruction cache available that will make this work, if the flash-write code is cached before the process starts.

As a data point, Intel processors will load chunks of code into cache and execute there because at boot they would otherwise be looping on (very) slow serial EProm, and boot would take forever. In that particular case, I can’t recall if they actually lock anything or are just very, very careful to make sure the code can't mss (by knowing cache size and wayness, and allocating addresses carefully)

Yes. They disable eviction using the NEM (No Evict Mode) model specific register.

Here is the ChromiumOS coreboot code to do so:


I like the idea of being able to enable read allocate (pma_cache_alloc_read), fault the range into cache by touching lines that are in ROM then disable read_allocate so the range is locked in cache. Not enabling eviction (pma_cache_eviction) is equivalent to no evict mode. Not enabling cache read or write is equivalent to uncached access. Later disabling read_allocate prevents bringing in new lines from the backing store. These bits already exist in the cache controller, it’s just a matter of whether they are exposed to the firmware.


Some of the PMA properties I’ve listed may be global in some implementations, but sophisticated memory controllers may be able to enable them dynamically on ranges, some implementations may bake them in to the config although dynamic is ideal; as of course if one is virtualising, one may label memory as IO, and various other use cases. Sophisticated memory controllers can already set most of these cache policies on ranges or at least globally. It is usually runtime where write back policy is enabled on a per range basis if the device supports PCI device plugging. Write combining is enabled at runtime on an IO regions e.g. video framebuffer. I’ve used framebuffer with and without write combining (dynamically set in the MTRR registers) on the video memory region and it makes a huge difference in speed. Intel supports quite a few dynamic properties via the MTRR table (this is global to all cores, not per thread). MTRRs have 80 baked in ranges for “legacy” device emulation along with a variable amount of flexible user defined entries (read up on the IA32_MTRRCAP VCNT field). VCNT is 8 bits so Intel supports up to 256 MTRR entries. MTRR are usually configured by BIOS but Intel also has PAT which allows runtime configuration on a per range basis of: uncacheable, write combining, write through, write protected, write back and uncached.

I defined write around via the absence of an explicit cache write policy. i.e. uncached writes = write around.

In contrast, wouldn't a TCM/scratchpad RAM almost always solve all of these problems? It can be done on a per-system basis (where needed) without changing the ISA, and without forcing all implementations to carry the baggage of the extra instruction decode, cache coherence state, etc. One thing that may be desired is an easy way to query a system whether it contains a TCM/scratchpad, but this should be done at the OS level, eg in a device tree or such.

For a small code, a scratchpad can be a very significant amount of area.
I don't see why it should be slower; it basically saying that if there is an eviction, choose someone else. That's off the critical path. Muxing in scratchpad data may actually slow down access to the cache, however, since you've just added a mux and a bunch of logic that has to turn off cache accesses (and if you can't do that fast enough, you haven't saved any power)


--
**************************************************
* Allen Baum              tel. (908)BIT-BAUM     *
*                                   248-2286     *     
**************************************************

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

chuanhua.chang

unread,
Jul 19, 2017, 4:13:40 AM7/19/17
to RISC-V ISA Dev, chuanhu...@gmail.com, jcb6...@gmail.com


On Wednesday, July 19, 2017 at 2:01:48 AM UTC+8, glemieux wrote:


What does "VA based" mean? Virtual Address based?

 
“VA” means virtual address. It is used to distinguish from a “cacheline (index, way)” based addressing approach which exposes the cache structure to a programmer.

 

Also, I have not taken care to distinguish between icache, dcache, or both targets. Presumably, invalidate on any address in the dcache would also need to invalidate the icache. Likewise for flush (writeback+invalidate), which does does an invalidate on the icache. Writeback would only operate on dcache. Do you see a reason there must be explicit targets (icache, dcache, or even both) in the ISA and not let hardware manage this implicitly?

Separating cache-control instructions into icache and dcache simplifies hardware design. And this approach will not burden the performance of dcache control instructions with icache access. Accessing icache from the execution stage of the pipeline usually takes longer time than accessing dcache. And I think for majority of use cases, a programmer knows what he (she) wants to work on, either “code” or “data”. So if a programmer wants to control dcache data, there is no need to disturb icache instruction fetching with unnecessary accesses and consuming unnecessary power. Also, even in a cache-coherent system, most implementations have a non-coherent icache. So, to make self-modifying code work, we need to invalidate icache, not dcache. I can see the disadvantage of exposing cacheline size to a programmer. But I cannot see any disadvantage of exposing the icache and dcache distinction to a programmer.

 

Finally, your Lock proposal is similar to Jacob's Pin. Generally, however, I don't see why locking is necessary or attractive when a TCM/scratchpad is superior in most cases. To add locking/pinning of cache lines, someone has to think through all of the cases of processors with coherent caches, non-coherent caches, multithreaded (sharing), VMs/hypervisors, etc. Also, FPGAs almost always use direct-mapped caches, in which case both obeying the hint or ignoring it may have negative performance consequences, so it becomes difficult to decide which to do. Locking almost always has a negative performance consequences which must be guarded against, and almost always use more power than a TCM/scratchpad.

In contrast, wouldn't a TCM/scratchpad RAM almost always solve all of these problems? It can be done on a per-system basis (where needed) without changing the ISA, and without forcing all implementations to carry the baggage of the extra instruction decode, cache coherence state, etc. One thing that may be desired is an easy way to query a system whether it contains a TCM/scratchpad, but this should be done at the OS level, eg in a device tree or such.


Thanks Allen to offer this answer:

 

“For a small code, a scratchpad can be a very significant amount of area. I don't see why it should be slower; it basically saying that if there is an eviction, choose someone else. That's off the critical path. Muxing in scratchpad data may actually slow down access to the cache, however, since you've just added a mux and a bunch of logic that has to turn off cache accesses (and if you can't do that fast enough, you haven't saved any power)”

 

To have a competitive commercial product, our customers use different approaches for their designs. Not everyone wants to have both cache and TCM on their chip. In some use cases, the performance of locked code is more important than the non-locked code. The cases vary a lot. Having a cache locking instruction is a good tool for our customers to tune performance. Sure, locking too much will definitely impact performance negatively. But this is an advanced expert feature and users have to use it with care.


-- Chuanhua

Guy Lemieux

unread,
Jul 19, 2017, 1:09:48 PM7/19/17
to chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer
On Wed, Jul 19, 2017 at 1:13 AM, chuanhua.chang
<chuanhu...@gmail.com> wrote:
> “VA” means virtual address. It is used to distinguish from a “cacheline
> (index, way)” based addressing approach which exposes the cache structure to
> a programmer.

understood -- was just confirming this was the case.

i fully agree.

>> Do you see a reason there must be explicit targets (icache, dcache,
>> or even both) in the ISA and not let hardware manage this implicitly?
>
>
> Separating cache-control instructions into icache and dcache simplifies
> hardware design.

It can simplify design if you have a "classical' separation of icache
and dcache.

But you are assuming a certain microarchitectural structure, rather
than an alternative, such as a shared dual-ported cache.

> And this approach will not burden the performance of dcache
> control instructions with icache access. Accessing icache from the execution
> stage of the pipeline usually takes longer time than accessing dcache.

During a dcache invalidate operation, with range, the icache is
sitting idle anyways. There is no "accessing" of the icache from the
dcache.

With a shared "invalidate" instruction that operates on both icache
and dcache, the two cache structures can use the same time to iterate
through their cache lines in parallel, and use range-based comparators
to invalidate any memory blocks within range. Or the two structures
can go sequentially, sharing the cache line counter and range
comparators, but taking slightly longer to iterate through the two
caches. If a unified cache is used, it is a single structure and you
can potentially iterate through the dual-ported cache twice as fast
from both ports in parallel.

By having separate instructions, invalidate.d and invalidate.i, you
are relying upon the programmer to know something about the underlying
system and to choose the right instruction. This goes against the RISC
V design philosophy.

I contend the overhead of a combined instruction is minimal, and worth
it from a programmer convenience perspective.

> And I
> think for majority of use cases, a programmer knows what he (she) wants to
> work on, either “code” or “data”.

The same argument could be made about cache line size, and yet bugs pop up :)

> So if a programmer wants to control dcache
> data, there is no need to disturb icache instruction fetching with
> unnecessary accesses and consuming unnecessary power.

This is not a frequent operation, so power overhead of accessing the
icache is minimal.

There is no icache fetching, because the whole CPU is stalled.

> Also, even in a
> cache-coherent system, most implementations have a non-coherent icache. So,
> to make self-modifying code work, we need to invalidate icache, not dcache.

Self-modifying code usually writes to instruction region using a store
instruction, which sends the instructions out to main memory through
the dcache. The instruction to make this work this is FENCE.I, not
INVAL.D or INVAL.I. We don't need INVAL.I to support self-modifying
code -- it won't flush instructions already fetched in the pipeline,
for example, but FENCE.I will do that.

> I can see the disadvantage of exposing cacheline size to a programmer. But I
> cannot see any disadvantage of exposing the icache and dcache distinction to
> a programmer.

I'm not sure it's a "disadvantage" to expose, but it does go against
the RISC-V philosophy of trying to choose an ISA that does not closely
tie itself to a microarchitecture.

Our discussion on this thread looks at adding a (hopefully minimal)
set of cache manipulation instructions that are required in CPUs that
have caches (particularly those with non-coherent caches, which will
be the most frequent use case). So, we can assume caches, but we
should not assume whether they are coherent (even though the most
common use for these instructions will be for systems with noncoherent
caches) and we should not assume a specific cache structure.

But back to your question.. what is the disadvantage of exposing this
in the ISA? Let me create a hypothetical examples and see if they are
of concern.

A programmer may assume that an INVAL.D will only invalidate a dcache,
but a CPU implementation with a unified cache will also invalidate
instructions in the same region, leading to performance degradation on
a unified cache system but not on a split cache system. Not a big
deal, but an inconvenience and an unnecessary performance hit that is
unexpected. (With a unified INVAL instruction, the performance hit
would be seen by both unified and split cache systems, but the
programmer would be able to immediately identify this and fix it by
splitting the INVAL to separate regions that exclude the code region
of interest.)

Some embedded systems may want to INVAL.I a region of memory for a
hotpluggable flash ROM (think: game cartridge, or even a user
personality for a photocopier with user-installed apps), to load the
new code. In this case, they must also remember to INVAL.D the dcache.
A unified INVAL instruction does both without introducing bugs that
depend upon which previous ROM was installed.

(I know i'm grasping here, I don't have a very strong "disadvantage"
case, but I can see quirky cases like the above that mostly deal with
correctness issues due to wrong assumptions of a programmer)

> Thanks Allen to offer this answer:
> “For a small code, a scratchpad can be a very significant amount of area. I
> don't see why it should be slower; it basically saying that if there is an
> eviction, choose someone else. That's off the critical path. Muxing in
> scratchpad data may actually slow down access to the cache, however, since
> you've just added a mux and a bunch of logic that has to turn off cache
> accesses (and if you can't do that fast enough, you haven't saved any
> power)”

I disagree with Allen almost completely.

Locking individual cache lines seems to be a "free" way to get a
scratchpad. There are other ways, such as splitting a cache in half,
with half scratch and half cache. This is not the same as locking half
the lines, because you actually modify the cache index function so
that all of memory maps into half of the cache. Thus, you do not get
the extreme negative performance of locking where certain regions of
RAM simply have no place to fit in the cache because the line is
locked.

The arguments about critical path muxing etc are constraints that chip
designers and FPGA designers know how to deal with and is not a major
concern.

Locking cache lines costs an extreme amount of power:

1) every cache holds far more data bits in width than a scratchpad;
often a cache access will fetch 256b of data and 30b of tag/state. on
an icache, when executing sequentially, you save a bit because of row
buffers. but on random data fetches, or jumpy instruction code, you
burn extreme power. in contrast, a scratchpad can be made very power
efficient and doesn't have all of the ported-ness issues of the cache.
even still, you could segment a cache into half scratch / half cache.

2) when a cache line is locked, memory addresses that map to that line
miss more often. in a direct-mapped cache, they always miss. fetching
outside of the cache in these cases uses extreme power.

I think locking is an "easy fix" to a problem that isn't of high
importance, has other "more interesting, more portable" solutions, and
will lock RISC-V into a legacy way of thinking.

> To have a competitive commercial product, our customers use different
> approaches for their designs. Not everyone wants to have both cache and TCM
> on their chip.

And not all vendors want to use a split icache/dcache. As an FPGA
designer, unified caches are very attractive to us because our RAM
blocks are naturally dual-ported anyways.

> In some use cases, the performance of locked code is more
> important than the non-locked code. The cases vary a lot. Having a cache
> locking instruction is a good tool for our customers to tune performance.

This is fine if you want your customers to re-tune their code every
time the cache structure changes. Or maybe you want to lock your
customers into your particular implementation and cache structure,
because their performance will be severely degrated on alternative
systems?

> Sure, locking too much will definitely impact performance negatively. But
> this is an advanced expert feature and users have to use it with care.

Yes it is an expert feature. I'm not saying that you should not create
cache locking operations for your own CPUs at Andes. I'm saying that,
at this point, I think they are a bad idea to lock into a
general-purpose ISA to be shared by a community of CPU implementors
because it presupposes a certain way of doing things that not all
vendors want to support. Sure, it's easy for other vendors to turn
locking instructions into NOPs, but then those vendors need to create
their own alternative to locking, and then Andes would need to turn
those other instructions into NOPs.

My final thought is that, unless everyone speaks up and says we need
to do cache locking in a particular way, that it is a bad decision to
do early at this point because it locks in too much of history into
the ISA.

Thanks,
Guy

Alex Elsayed

unread,
Jul 19, 2017, 3:32:01 PM7/19/17
to isa...@groups.riscv.org
On Wednesday, 19 July 2017 10:09:04 PDT Guy Lemieux wrote:
> On Wed, Jul 19, 2017 at 1:13 AM, chuanhua.chang
>
> <chuanhu...@gmail.com> wrote:
>
> > Thanks Allen to offer this answer:
> > “For a small code, a scratchpad can be a very significant amount of area.
> > I
> > don't see why it should be slower; it basically saying that if there is an
> > eviction, choose someone else. That's off the critical path. Muxing in
> > scratchpad data may actually slow down access to the cache, however, since
> > you've just added a mux and a bunch of logic that has to turn off cache
> > accesses (and if you can't do that fast enough, you haven't saved any
> > power)”
>
> I disagree with Allen almost completely.
>
> Locking individual cache lines seems to be a "free" way to get a
> scratchpad. There are other ways, such as splitting a cache in half,
> with half scratch and half cache. This is not the same as locking half
> the lines, because you actually modify the cache index function so
> that all of memory maps into half of the cache. Thus, you do not get
> the extreme negative performance of locking where certain regions of
> RAM simply have no place to fit in the cache because the line is
> locked.

On the topic of "other ways" of getting a scratchpad:

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-131.html
signature.asc

Andrew Waterman

unread,
Jul 19, 2017, 4:02:13 PM7/19/17
to Alex Elsayed, RISC-V ISA Dev
Recent SiFive/Rocket cores also support dynamically reprovisioning the
I$ as a scratchpad (both for data & instructions, but only fast for
instruction fetch), at the granularity of a cache line. We view it as
more robust and more useful than pinning. The design was inspired by,
but is pretty different than, Henry's work on VLS that Alex mentioned.
> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2319368.TBkSdHU6Et%40arkadios.

Michael Clark

unread,
Jul 19, 2017, 4:03:46 PM7/19/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer
A partitioning problem.

The forward looking trend is towards dynamically reconfigurable caches. If you have an 8-way set associative cache, and can partition it with 1/8, 2/8 or 4/8 pinned the cost is only at the backend, on evict/allocate. The 8-ways are a sunk cost. I think dictating an architecture that assumes cache+scratchpad and precludes cache partitioning is the direction that limits implementation choices. If pinning operations are NOPs then there is no cost for implementations that choose scratchpad.

There is a lot of forward looking research into reconfigurable shared L2 (*) that can boost performance over legacy cache architecture. By nature the lines need more sophisticated address decoding than scratchpad SRAM i.e. CAM along with the ability to pin bitlines to specific cores. Dynamically partitioned CAM is the future not the past. 46% speed increase and 36% reduction in power usage (from going off-chip to DRAM).


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Bruce Hoult

unread,
Jul 19, 2017, 4:16:21 PM7/19/17
to Andrew Waterman, Alex Elsayed, RISC-V ISA Dev
What address does the code then have?

The nice thing about icache preload (with or without pinning) is that the code doesn't have to be specially linked (or made PIC) to run at one address while it's physically present in the executable/flash at another address.

> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

Andrew Waterman

unread,
Jul 19, 2017, 4:20:39 PM7/19/17
to Bruce Hoult, Alex Elsayed, RISC-V ISA Dev
On Wed, Jul 19, 2017 at 1:16 PM, Bruce Hoult <br...@hoult.org> wrote:
> What address does the code then have?
>
> The nice thing about icache preload (with or without pinning) is that the
> code doesn't have to be specially linked (or made PIC) to run at one address
> while it's physically present in the executable/flash at another address.

Of course, the code must be at a different address if it's in
scratchpad vs. cacheable memory. But this is a mere matter of
software :-)
>> > an email to isa-dev+u...@groups.riscv.org.
>> > To post to this group, send email to isa...@groups.riscv.org.
>> > Visit this group at
>> > https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
>> > To view this discussion on the web visit
>> > https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/2319368.TBkSdHU6Et%40arkadios.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "RISC-V ISA Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to isa-dev+u...@groups.riscv.org.

Allen J. Baum

unread,
Jul 19, 2017, 5:34:24 PM7/19/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev, Jacob Bachmeyer
At 10:09 AM -0700 7/19/17, Guy Lemieux wrote:

>This is not a frequent operation, so power overhead of accessing the
>icache is minimal.

This sort of kills the argument that having this be high performance with hardware sequencing is necessary

>I disagree with Allen almost completely.

Huh. No one has ever disagreed with me before :-)

>Locking individual cache lines seems to be a "free" way to get a
>scratchpad. There are other ways, such as splitting a cache in half,
>with half scratch and half cache. This is not the same as locking half
>the lines, because you actually modify the cache index function so
>that all of memory maps into half of the cache. Thus, you do not get
>the extreme negative performance of locking where certain regions of
>RAM simply have no place to fit in the cache because the line is
>locked.

I think this analysis is so wrong, that I believe we have some fundamental differences in our starting assumptions.
I'm assuming a 2way (at least) cache, and if I can read between the lines above, you're thinking of direct mapped cache.

In that we may agree: I don't think I would try to implement any kind of cache line locking in a direct map cache (but I could be talked into it if the circumstances were pretty proscribed.

If you assume more than one way, then you can lock line-by-line or way-by-way (as long as you're careful to avoid locking all the ways in a set.
For a 2-way set associative cache, all of the address space is mapped into one of the ways (making it direct mapped), and whatever address range you've locked into the other way is also direct mapped (pretty much just like a scratchpad).
The lines you've locked have to be carefully set up - but that's done once.
They could be a fixed address range, which limits flexibility and could speed up cache access or slow it down (not a casual remark - I've looked at it) depending on how you implement it.
It could save power or take more power, depending on how you implement it.

But, yes you can have bugs in SW that does that. Those are fixable. HW bugs are a bit tougher.

>The arguments about critical path muxing etc are constraints that chip
>designers and FPGA designers know how to deal with and is not a major
>concern.

Um, how many chips have you taped out?
A critical path is a critical path.
You fix it by throwing power and area and money at it - oh, and tapeout schedule. Or you give up and everything runs a little slower.
If cache access has already been tuned to within an inch of it's life, you've just made someone's job very difficult indeed - and that's one of places you look for a critical path.
If cache access isn't the critical path, then you're golden.
But a "no-evict-mode" doesn't cost time or power - it adds no logic into the cache access critical path - as mentioned before (for the multi-way cache architecture) it only affects who is evicted, which is off the critical path.

On the other hand: if all this is being implemented on an FPGA, it's already slower and using more power than a custom implementation, so that is lost in the noise, I suspect. I wouldn't make ISA decisions based on that.

>
>Locking cache lines costs an extreme amount of power:
>
>1) every cache holds far more data bits in width than a scratchpad;
>often a cache access will fetch 256b of data and 30b of tag/state.

or 64B = 512b with 30b of tag/state, another possible differeing assumption, though I don't think it matters.

>on an icache, when executing sequentially, you save a bit because of row
>buffers. but on random data fetches, or jumpy instruction code, you
>burn extreme power. in contrast, a scratchpad can be made very power
>efficient and doesn't have all of the ported-ness issues of the cache.
>even still, you could segment a cache into half scratch / half cache.

You're looking at it from a 6ft level, rather than a 10,00 foot level.
That extra array costs extra power, even if it is more efficient.
And if it doesn't because its turned off? Then your cache is half the size, and you're wasting even more power because of the number misses increases, and going off chip to DRAM is way more power hungry than those 30 tag bits.
Numbers: a 2 way 8K L1 Dcache, one way is 4K =64 lines of 30 bits = 2kb =256B.

>2) when a cache line is locked, memory addresses that map to that line
>miss more often. in a direct-mapped cache, they always miss. fetching
>outside of the cache in these cases uses extreme power.

We agree on that one. It hurts when you do that. Don't do that.
Conversely, if you don't lock - your cache is twice the size with all the power and performance benefits that gives you.
so on an apples-to-apples comparison, the configurations are:
1a. NKB Cache +NKB Scratchpad
1b. NKB Cache*2+NKB Scratchpad, or maybe
vs.
2a. NKB Cache +NKB Scratchpad when you need it, or
2b. NKB Cache*2

Questions you have to ask yourself:
Is the scratchpad addressed by virtual address?
If so, a fixed or configurable virtual address?
If so, is it possible to inhibit searching the cache in time to save power?
>
>I think locking is an "easy fix" to a problem that isn't of high
>importance, has other "more interesting, more portable" solutions, and
>will lock RISC-V into a legacy way of thinking.
>
>> To have a competitive commercial product, our customers use different
>> approaches for their designs. Not everyone wants to have both cache and TCM
>> on their chip.
>
>And not all vendors want to use a split icache/dcache. As an FPGA
>designer, unified caches are very attractive to us because our RAM
>blocks are naturally dual-ported anyways.

Whew - you're complaining about the high cost of power of something in the order of 256B, and yet you're willing to use dual ports SRAMS???

> > In some use cases, the performance of locked code is more
>> important than the non-locked code. The cases vary a lot. Having a cache
>> locking instruction is a good tool for our customers to tune performance.
>
>This is fine if you want your customers to re-tune their code every
>time the cache structure changes. Or maybe you want to lock your
>customers into your particular implementation and cache structure,
>because their performance will be severely degrated on alternative
>systems?

That argument cuts both ways.
The only reason this would be an issue would be if cache sizes got smaller
But that's no worse a problem than a scratchpad getting smaller - they'd have to re-tune in that case as well.
They may want to re-tune if cache or scratchpad gets larger to take advantage of getting performance gains - no difference there, no scratchpad advange there.

> > Sure, locking too much will definitely impact performance negatively. But
> > this is an advanced expert feature and users have to use it with care.
>
>Yes it is an expert feature. I'm not saying that you should not create
>cache locking operations for your own CPUs at Andes. I'm saying that,
>at this point, I think they are a bad idea to lock into a
>general-purpose ISA to be shared by a community of CPU implementors
>because it presupposes a certain way of doing things that not all
>vendors want to support. Sure, it's easy for other vendors to turn
>locking instructions into NOPs, but then those vendors need to create
>their own alternative to locking, and then Andes would need to turn
>those other instructions into NOPs.
>
>My final thought is that, unless everyone speaks up and says we need
>to do cache locking in a particular way, that it is a bad decision to
>do early at this point because it locks in too much of history into
>the ISA.

Cool - something else we might agree on.
Do we lock individual cache line? Individual ways? Only one specific way?
I can't say any of the above is the "right" way to do it.

Now I'll have to read the VLS paper....

Jacob Bachmeyer

unread,
Jul 19, 2017, 7:37:40 PM7/19/17
to Allen J. Baum, Guy Lemieux, Michael Clark, Bruce Hoult, chuanhua.chang, RISC-V ISA Dev
Allen J. Baum wrote:
> At 6:48 PM -0700 7/18/17, Guy Lemieux wrote:
>
>> Let's do the math for a specific example...
>>
>> For a REGION that spans 1GB, how many clock cycles should take to
>> perform a cache operation on an 8kB cache with 32B cache lines?
>>
>> How many software loop iterations will be required?
>>
>> How quickly can hardware do it using a hardware iterating mechanism?
>>
>> .....
>>
>> Would you rather have the looping done in hardware or in software?
>>
>
> You need to look at the real world.
> How often will be performance of this be constrained by memory BW?
> How often will it be constained by TLB misses?
> How often do you actually need to invalidate a 1GB region?
> Multiply that all out.
> If that number doesn't make a 1% difference in system perofrmance, then adding a state machine is a mistake.
> State machines sound easy - but the corner cases (and there will be waaaay more than you expect, believe me (or not)) will kill youi.
>
> That's one reason I like the feedback if returning the next address after the one that was invalidated - I don't have to implement a state machien.
> Yes, a SW implementation using a loop can have corner cas, too - but I don't have to tape out another chip to do it.
>
>
> I think I missed something from way back: is this instruction limited to M-mode? M+S? M+S+U?
>

Most cache instructions are available to all modes in this proposal;
instruction cache pinning is limited to M-mode.


-- Jacob

Jacob Bachmeyer

unread,
Jul 19, 2017, 7:38:50 PM7/19/17
to Allen J. Baum, Michael Clark, Bruce Hoult, Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
Allen J. Baum wrote:
> At 7:28 PM -0500 7/18/17, Jacob Bachmeyer wrote:
>
>> Michael Clark wrote:
>>
>>>> On 19 Jul 2017, at 10:34 AM, Bruce Hoult <br...@hoult.org <mailto:br...@hoult.org>> wrote:
>>>>
>>>> I'm not in favour of implicit looping solutions either i.e. "invalidate everything between lo and hi". It seems far better to me to have an instruction that only promises to invalidate (for example) *something* starting at lo, and hi, and returns how much work it did (e.g. an updated value for lo).
>>>>
>>> I think we are all in agreement in this respect after sorear first suggested it.
>>>
>
> I hadn't notice that he suggested it.
> It removes my objections
> Does the proposed instruction format have rs1, rs2 and rd available?
>

All proposed MISC-MEM/REGION instructions are R-type, so yes.


-- Jacob

Jacob Bachmeyer

unread,
Jul 19, 2017, 8:02:19 PM7/19/17
to Guy Lemieux, chuanhua.chang, RISC-V ISA Dev
Guy Lemieux wrote:
> On Wed, Jul 19, 2017 at 1:13 AM, chuanhua.chang
> <chuanhu...@gmail.com> wrote:
>
[...]
>> So if a programmer wants to control dcache
>> data, there is no need to disturb icache instruction fetching with
>> unnecessary accesses and consuming unnecessary power.
>>
>
> This is not a frequent operation, so power overhead of accessing the
> icache is minimal.
>
> There is no icache fetching, because the whole CPU is stalled.
>

Not necessarily the case, since cache operations can be asynchronous.

>> Also, even in a
>> cache-coherent system, most implementations have a non-coherent icache. So,
>> to make self-modifying code work, we need to invalidate icache, not dcache.
>>
>
> Self-modifying code usually writes to instruction region using a store
> instruction, which sends the instructions out to main memory through
> the dcache. The instruction to make this work this is FENCE.I, not
> INVAL.D or INVAL.I. We don't need INVAL.I to support self-modifying
> code -- it won't flush instructions already fetched in the pipeline,
> for example, but FENCE.I will do that.
>

For this reason, I do not currently propose an I-cache invalidate
instruction. In what situation do INVAL.I and FENCE.I produce different
software-visible results, ignoring timing?

>> I can see the disadvantage of exposing cacheline size to a programmer. But I
>> cannot see any disadvantage of exposing the icache and dcache distinction to
>> a programmer.
>>
>
> I'm not sure it's a "disadvantage" to expose, but it does go against
> the RISC-V philosophy of trying to choose an ISA that does not closely
> tie itself to a microarchitecture.
>

I disagree because I see the RISC-V memory model as already making that
distinction. The existence of FENCE.I indicates a memory model with
distinct data access and instruction fetch paths to memory.
As I understand it, RISC-V ties instruction fetch very firmly to data
access (instructions and data in the same address space) but allows the
instruction fetcher to see old memory contents until FENCE.I is
executed. So hot-swapping game cartridges (which every cartridge-based
console I have seen did not support) would require a single
CACHE.DISCARD for the region where the cartridge is mapped, followed by
FENCE.I to purge the instruction fetch buffers and I-cache.

>> Thanks Allen to offer this answer:
>> “For a small code, a scratchpad can be a very significant amount of area. I
>> don't see why it should be slower; it basically saying that if there is an
>> eviction, choose someone else. That's off the critical path. Muxing in
>> scratchpad data may actually slow down access to the cache, however, since
>> you've just added a mux and a bunch of logic that has to turn off cache
>> accesses (and if you can't do that fast enough, you haven't saved any
>> power)”
>>
>
> I disagree with Allen almost completely.
>
> Locking individual cache lines seems to be a "free" way to get a
> scratchpad. There are other ways, such as splitting a cache in half,
> with half scratch and half cache. This is not the same as locking half
> the lines, because you actually modify the cache index function so
> that all of memory maps into half of the cache. Thus, you do not get
> the extreme negative performance of locking where certain regions of
> RAM simply have no place to fit in the cache because the line is
> locked.
>

This is part of the reason that pinning cachelines can fail --
implementations are expected to only permit pinning in cases where the
cache remains effective. In other words, with an N-way cache, at most
N-1 ways can contain pinned cachelines.

If certain regions of RAM would have no place in the cache afterwards,
the implementation is expected to refuse the request.

> The arguments about critical path muxing etc are constraints that chip
> designers and FPGA designers know how to deal with and is not a major
> concern.
>
> Locking cache lines costs an extreme amount of power:
>
> 1) every cache holds far more data bits in width than a scratchpad;
> often a cache access will fetch 256b of data and 30b of tag/state. on
> an icache, when executing sequentially, you save a bit because of row
> buffers. but on random data fetches, or jumpy instruction code, you
> burn extreme power. in contrast, a scratchpad can be made very power
> efficient and doesn't have all of the ported-ness issues of the cache.
> even still, you could segment a cache into half scratch / half cache.
>
> 2) when a cache line is locked, memory addresses that map to that line
> miss more often. in a direct-mapped cache, they always miss. fetching
> outside of the cache in these cases uses extreme power.
>

An implementation with direct-mapped caches is expected to
"non-implement" the pinning instructions.

>> Sure, locking too much will definitely impact performance negatively. But
>> this is an advanced expert feature and users have to use it with care.
>>
>
> Yes it is an expert feature. I'm not saying that you should not create
> cache locking operations for your own CPUs at Andes. I'm saying that,
> at this point, I think they are a bad idea to lock into a
> general-purpose ISA to be shared by a community of CPU implementors
> because it presupposes a certain way of doing things that not all
> vendors want to support. Sure, it's easy for other vendors to turn
> locking instructions into NOPs, but then those vendors need to create
> their own alternative to locking, and then Andes would need to turn
> those other instructions into NOPs.
>
> My final thought is that, unless everyone speaks up and says we need
> to do cache locking in a particular way, that it is a bad decision to
> do early at this point because it locks in too much of history into
> the ISA.
>

The proposed cache pinning operations are intended to be a generic
interface to permit portions of a multi-way or fully-associative cache
to be used temporarily as a scratchpad. I expect that this will be
useful for a variety of systems. An implementation is permitted to
"pin" more data than requested, so your suggestion of splitting the
cache into half-scratchpad and half-cache would be a valid
implementation of CACHE.PIN (apply split, copy main memory to
scratchpad, map scratchpad) and CACHE.UNPIN (flush scratchpad to main
memory, unmap scratchpad, reunify cache).



-- Jacob

Jacob Bachmeyer

unread,
Jul 19, 2017, 8:05:39 PM7/19/17
to Andrew Waterman, Bruce Hoult, Alex Elsayed, RISC-V ISA Dev
Andrew Waterman wrote:
> On Wed, Jul 19, 2017 at 1:16 PM, Bruce Hoult <br...@hoult.org> wrote:
>
>> What address does the code then have?
>>
>> The nice thing about icache preload (with or without pinning) is that the
>> code doesn't have to be specially linked (or made PIC) to run at one address
>> while it's physically present in the executable/flash at another address.
>>
>
> Of course, the code must be at a different address if it's in
> scratchpad vs. cacheable memory. But this is a mere matter of
> software :-)
>

I mentioned in another response the possibility of using CACHE.PIN and
CACHE.UNPIN to control dynamic reprovisioning of part of a cache as a
scratchpad. The major difference seems to be that using CACHE.PIN to
provision the scratchpad would require the ability to map the scratchpad
overlaying main memory. Is this something feasible for Rocket?


-- Jacob

Andrew Waterman

unread,
Jul 19, 2017, 8:29:22 PM7/19/17
to Jacob Bachmeyer, Bruce Hoult, Alex Elsayed, RISC-V ISA Dev
For a variety of reasons, we found it undesirable to have the
scratchpads overlay main memory. (E.g., there might not be any main
memory; and we don't want to dynamically change who the
cache-coherence home-node for a given memory address is when the
scratchpad is reconfigured. These concerns aren't fundamental or
insurmountable, but we didn't want to deal with them.)

Since we didn't view the reconfigurable instruction scratchpads as a
potentially standard ISA feature, we didn't extend the ISA to support
it; instead, the scratchpad configurations are MMIO devices. So, it's
different than, but not incompatible with, your and others' proposals.

>
>
> -- Jacob

Allen J. Baum

unread,
Jul 19, 2017, 9:01:04 PM7/19/17
to jcb6...@gmail.com, Guy Lemieux, Michael Clark, Bruce Hoult, chuanhua.chang, RISC-V ISA Dev
At 6:37 PM -0500 7/19/17, Jacob Bachmeyer wrote:
>>Allen J. Baum wrote:
>>>....
>>
>>Does the proposed instruction format have rs1, rs2 and rd available?
>>
>
>All proposed MISC-MEM/REGION instructions are R-type, so yes.

Very good

>>I think I missed something from way back: is this instruction limited to M-mode? M+S? M+S+U?
>>
>
>Most cache instructions are available to all modes in this proposal; instruction cache pinning is limited to M-mode.

Good - works for me.
Reply all
Reply to author
Forward
0 new messages