Thanks, Jacob, for shepherding the discussion on this topic and
generating this draft.
I'd like to propose some renaming and reorganizing, although mostly
keeping the same spirit, I hope.
In this message, I'll leave the FENCE and CACHE.PIN/UNPIN instructions
untouched. For the others, I first suggest a different set of names, as
follows:
MEM.PFn -> MEM.PREP.R (reads)
MEM.PF.EXCL ->
MEM.PREP.RW (reads/writes)
MEM.PF.ONCE -> MEM.PREP.INCR (reads at increasing addresses)
MEM.PF.TEXT -> MEM.PREP.I (instruction execution)
MEM.REWRITE -> MEM.PREP.W (writes) or
MEM.PREP.INCW (writes at increasing addresses)
CACHE.WRITEBACK -> CACHE.CLEAN
CACHE.FLUSH -> CACHE.FLUSH (name unchanged)
MEM.DISCARD -> CACHE.INV
The "prep" instructions (MEM.PREP.*) are essentially hints for what
the software expects to do, giving the hardware the option to prepare
accordingly---"prep" being short for _prepare_. Note that, when rd
isn't x0, these instructions aren't true RISC-V hints, because the
hardware at a minimum must still write rd, though it need not do
anything else. I'm currently proposing collapsing the four MEM.PFn
instructions down to one, although I'm open to further discussion about
that. I'm also proposing a major overhaul of MEM.REWRITE, reconceiving
it as two different MEM.PREP instructions that can optionally be used in
conjunction with CACHE.INV.
The explicit cache control instructions (CACHE.*) are fairly standard.
These cannot be trivialized in the same way as the first group, unless
the entire memory system is known to be coherent (including any device
DMA).
I've broken up Jacob's "data obsolescence" section that has MEM.REWRITE
and MEM.DISCARD together, with the consequence that some of the text
there would no longer be correct.
--------------------
Memory access hints
I'm proposing to collapse Jacob's four levels of MEM.PFn prefetch
instructions into a single MEM.PREP.R, because I don't see how software
will know which of the four levels to use in any given circumstance. If
a consensus could be developed for heuristics to guide this decision for
programmers, tools, and hardware, then I could perhaps see the value of
having multiple levels. In the absence of such guidance, I see at best
a chicken-and-egg problem between software and hardware, where nobody
agrees or understands exactly what the different levels should imply.
In my view, the Foundation shouldn't attempt to standardize multiple
prefetch levels unless it's prepared to better answer this question.
It's still always possible for individual implementations to have their
own custom instructions for different levels of prefetch, if they see a
value in having their own custom answer to the question.
I propose some minor tweaks to how the affected region is specified. My
version is as follows: Assume A and B are the unsigned integer values
of operands rs1 and rs2. If A < B, then the region covers addresses A
through B - 1, inclusive. If B <= A, the region is empty. (However,
note that, because these MEM.PREP instructions act only as hints, an
implementation can adjust the bounds of the affected region to suit its
purposes without officially breaking any rules.)
Except for MEM.PREP.INCR and MEM.PREP.INCW, I'm proposing to remove
the option to have anything other than x0 for rd. It's not clear to me
that we can realistically expect software to make use of the result that
Jacob defines, and removing it is a simplification. If everyone feels
I'm mistaken and the loss of this feature is unacceptable, it could be
restored.
If implemented at all, the MEM.PREP instructions do not trap for any
reason, so they cannot cause page faults or access faults. (This is no
different than Jacob already had.)
Here I've attempted to summarize the intention proposed for each
instruction:
MEM.PREP.I
Software expects to execute many instructions in the region.
Hardware might respond by attempting to prefetch the code into the
instruction cache.
MEM.PREP.R
Software expects to read many bytes in the region, but not to
write many bytes in the region.
Hardware might respond by attempting to acquire a shared copy of
the data (prefetch).
MEM.PREP.RW Software expects both to read many bytes and to write many bytes
in the region.
Hardware might respond by attempting to acquire a copy of the
data along with (temporary) exclusive rights to write the data.
For some implementations, the effect of
MEM.PREP.RW might be no
different than MEM.PREP.R.
MEM.PREP.INCR
Software expects to read many bytes in the region, starting first
with lower addresses and progressing over time into increasingly
higher addresses, though not necessarily in strictly sequential
order. If software writes to the region, it expects usually to
read the same or nearby bytes before writing.
Hardware might respond by applying a mechanism for sequential
prefetch-ahead. For some implementations, this mechanism might
be ineffective unless the region is accessed solely by reads at
sequential, contiguous locations.
MEM.PREP.W
Software expects to write many bytes in the region. If software
reads from the region, it expects usually to first write those
same bytes before reading.
Hardware might respond by adjusting whether a write to a
previously missing cache line within the region will cause the
remainder of the line to be brought in from main memory.
MEM.PREP.INCW
Software expects to write many bytes in the region, starting first
with lower addresses and progressing over time into increasingly
higher addresses, though not necessarily in strictly sequential
order. If software reads from the region, it expects usually to
first write those same bytes before reading.
Hardware might respond by applying a mechanism for sequential
write-behind. For some implementations, this mechanism might
be ineffective unless the region is accessed solely by writes at
sequential, contiguous locations.
For the non-INC instructions (MEM.PREP.I, MEM.PREP.R,
MEM.PREP.RW, and
MEM.PREP.W), if the size of the region specified is comparable to or
larger than the entire cache size at some level of the memory hierarchy,
the implementation would probably do best to ignore the prep instruction
for that cache, though it might still profit from applying the hint to
larger caches at lower levels of the memory hierarchy.
In my proposal, MEM.PREP.INCR and MEM.PREP.INCW are unique in allowing
destination rd to be something other than x0. If a MEM.PREP.INCR/INCW
has a non-x0 destination, the implementation writes register rd with an
address X subject to these rules, where A and B are the values of rs1
and rs2 defined earlier: If B <= A (region is empty), then X = B.
Else, if A < B (region is not empty), the value X must be such that
A < X <= B. If the value X written to rd is less than B (which can
happen only if the region wasn't empty), then software is encouraged to
execute MEM.PREP.INCR/INCW again after it believes it is done accessing
the subregion between addresses A and X - 1 inclusive. For this new
MEM.PREP.INCR/INCW, the value of rs1 should be X and the value of rs2
should be B as before. The process may then repeat with the hardware
supplying a new X.
Software is not required to participate in this iterative sequence of
MEM.PREP.INCR/INCW instructions, as it can always simply give rd as x0.
However, read-ahead or write-behind might not be as effective without
this iteration.
The minimal hardware implementation of the non-INC instructions
(MEM.PREP.I, MEM.PREP.R,
MEM.PREP.RW, and MEM.PREP.W) would be simply to
ignore the instructions as no-ops. For MEM.PREP.INCR and MEM.PREP.INCW,
the minimum is to copy rs2 (B) to rd (X) and do nothing else. Since
the non-INC instructions require rd to be x0, these minimal behaviors
can be combined, so that the only thing done for all valid MEM.PREP.*
instructions is to copy rs2 to rd (which may be x0).
--------------------
Explicit cache control
The three cache control instructions I propose are:
CACHE.CLEAN (was CACHE.WRITEBACK)
CACHE.FLUSH
CACHE.INV (was MEM.DISCARD)
I expect these will be familiar to anyone who has used similar
instructions on other processors, except possibly for the name "clean"
instead of "writeback" or "wb". I'm not proposing changing the
fundamental semantics of the instructions, although I think the
description of CACHE.INV can be simplified a bit.
It's important to remember when talking about caches that writebacks
of dirty data from the cache are allowed to occur automatically at
any time, for any reason, or even for no apparent reason whatsoever.
Likewise, a cache line that isn't dirty can be automatically invalidated
at any time, with or without any apparent reason. Therefore, our
description of CACHE.INV doesn't need to give the implementation
explicit license to flush whole cache lines to handle partial lines at
the start and end of the specified region, as it would already have the
authority to perform such flushes at will.
Here's how I might rewrite the description for CACHE.INV:
CACHE.INV [was MEM.DISCARD]
CACHE.INV.
Comment:
If the region does not align with cache line boundaries and the
cache is incapable of invalidating or flushing only a partial cache
line, the implementation may need to flush the whole cache lines
overlapping the start and end of the region, including bytes next to
but outside the region.
The region to be cleaned/flushed/invalidated is specified the same as
I wrote above for the MEM.PREP instructions: Assuming A and B are the
unsigned integer values of operands rs1 and rs2, then if A < B, the
region covers addresses A through B - 1, inclusive, and, conversely, if
B <= A, the region is empty. The hardware does not guarantee to perform
the operation for the entire region at once, but instead returns a
progress in the destination rd. The value written to rd is an address X
conforming to these rules: If B <= A (region is empty), then X = B.
Else, if A < B (region is not empty), the value X must satisfy
A <= X <= B. If the result value X = B, then the operation is complete
for the entire region. If instead X < B (which can only happen if
the region wasn't empty), then software must execute the same CACHE
instruction again with rs1 set to X and with rs2 set to the same B as
before.
As long as no other memory instructions are executed between each CACHE
instruction, an implementation must guarantee that the original cache
operation will complete in a finite number of iterations of this
algorithm. It is possible for an implementation to make progress on a
cache operation yet repeatedly return the original A as the result X,
until eventually returning B.
A CACHE instruction may cause a page fault for any page overlapping the
specified region.
When an implementation's entire memory system is known to be coherent,
including any device DMA, then the CACHE instructions may be implemented
in a minimal way simply by copying operand rs2 to destination rd. On
the other hand, if the memory system is not entirely coherent, the CACHE
instructions cannot be implemented trivially, because their full effects
are needed for software to compensate for a lack of hardware-enforced
coherence.
--------------------
Concerning instructions for data obsolescence
Jacob defines two instructions for declarating data obsolescence:
MEM.DISCARD and MEM.REWRITE. I've renamed MEM.DISCARD to CACHE.INV,
and otherwise tweaked its behavior in only minor ways. Concerning
MEM.REWRITE, it's my belief that the proposed uses for this instruction
can each be satisfied by one of the following:
MEM.PREP.W
MEM.PREP.INCW
CACHE.INV followed by MEM.PREP.W
CACHE.INV followed by MEM.PREP.INCW
For instance, I believe Jacob's stack frame example could be rewritten
as:
A function prologue could use CACHE.INV and MEM.PREP.W to allocate
a stack frame, while a function epilogue could use CACHE.INV to