Paul A. Clayton wrote:
> It may be useful to make a distinction between prefetch for write
> where reads are not expected but the region is not guaranteed to
> be overwritten. An implementation might support general avoidance
> of read-for-ownership (e.g., finer-grained validity indicators) but
> still benefit from a write prefetch to establish ownership.
>
In other words, something in-between MEM.PF.EXCL (which prefetches the
current data in main memory) and MEM.REWRITE (which destroys the current
data in main memory)?
>> MEM.PF.STREAM ("prefetch stream")
>> {opcode, funct3, funct7} = {$MISC-MEM, $REGION, 7'b0001110}
>> Initiate streaming prefetch of the region, expecting the prefetched
>> data to be used at most once and in sequential order, while minimizing
>> cache pollution. This operation may activate a prefetch unit and
>> prefetch the region incrementally if rd is x0. Software is expected to
>> access the region sequentially, starting at the base address.
>>
>
> It may be useful to include a stride for stream prefetching.
>
Could the stride be inferred from the subsequent access pattern? If
words at X, X+24, and X+48 are subsequently read, skipping the
intermediate locations, the prefetcher could infer a stride of 24 and
simply remember (1) the last location actually accessed and (2) the next
expected access location. The reason to remember the last actual access
is to still meet the "mimimize cache pollution" goal when a stride is
mispredicted.
>> And two M-mode-only privileged instructions:
>
> Why M-mode-only?
>
The I-cache pins are M-mode-only because that is the only mode where
context-switch can be guaranteed to not occur. These were added to
allow using the I-cache as temporary program memory on implementations
that normally execute from flash but cannot read from flash while a
flash write is in progress. The issue was seen on one of the HiFive
boards that was also an M-mode-only implementation.
> TLB prefetching also seems worth considering.
>
Any suggestions?
I can see uses for U-mode pinning, but the problem goes back to the original motivation for I-cache pins: holding some (small) amount of code in the I-cache to handle a known period where the main program store is not accessible. Interrupts must be disabled for this to work and that means that I-cache pinning must be restricted to M-mode, since no other mode can truly disable interrupts. Frequent use of MEM.PF.TEXT can allow less-privileged modes to "quasi-pin" certain code but cannot provide the same guarantee.
The difference stems from what happens when a cache pin is broken: for a data cache pin, the relevant information will be reloaded when needed and re-pinned on the next iteration of some loop; for an instruction cache pin, an interrupt may result in a branch to program text that is temporarily inaccessible until the (interrupted) pinned code completes, which will not happen because an interrupt occurred, leading to a temporal contradiction (interrupt handler cannot be fetched until pinned code completes; pinned code cannot continue until interrupt is handled) that deadlocks (reads block) or crashes (reads return garbage) the system.
> TLB prefetching also seems worth considering.
Any suggestions?
Definitely seems useful... just another flavor of MEM.PF, I think, but with no worries about getting ownership, I think.
Perhaps MEM.PF.MAP?
TLB pinning could also be useful, IMO.
Cache pins provide an otherwise-unavailable ability to use (part of) the cache as a scratchpad. What new ability do we get from pinning TLB entries?
One thing still missing from RISC-V (unless I myself missed something along the way) is a streaming-store hint. Right now I believe the only way to avoid cache pollution is through PMA's but that's not a very fine-grained tool (if it exists at all) in all systems.
For the case of a packed streaming-store (every octet will be overwritten), there is MEM.REWRITE, but that is also a "prefetch constant" and allocates cachelines. Could a combination of MEM.PF.STREAM and MEM.WRHINT address this generally? Or is a MEM.SPARSEWRITE a better option? How to (conceptually) distinguish MEM.SPARSEWRITE and MEM.WRHINT?
For that matter, is a general rule that "reads are prefetched while writes are hinted" a good dividing line?
Should MEM.PF.STREAM be more of a modifier to another prefetch or hint instead of its own prefetch instruction?
-- Jacob
Note that cache pins are broken upon context switch unless the cache is
ASID-partitioned -- each task must appear to have the entire cache
available. User tasks that use D-cache pins are expected to re-pin
their working buffer frequently. Of course, if the working buffer is
"marching" through the address space, the problem solves itself as the
scratchpad advances (unpin old, pin new) through the address space.
...
"non-partitioned caches must be flushed on every context switch."
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5B32F83B.2090501%40gmail.com.
I'd like to go up a level and imagine that we've defined a function that
does a cacheop, say CACHE.FLUSH, over an address range. Given that
various things in OS-land deal with base/size combinations, I'll assume
that and suggest that we want to implement:
void flush_dcache_range(void *ptr, unsigned size);
I will make the following assumptions about the function:
* ptr==0 is allowed. The address range starts at zero
* size==0 always means do nothing.
* the last byte of address space can be included - ptr+size
wrap to zero
* otherwise ptr+size is not allowed to wrap-around, but this
is not checked
* size is XLEN and so cannot quite represent the entire
address space but must skip at least one byte
* very large sizes are not handled in a fully portable
way. They may trap and flush the entire cache, for
example. If they operate naively, they will be
extremely slow.
I will also assume that CACHE.FLUSH ensures forward progress. That is,
as long as rs2 > rs1, rd > rs1 (except for wrap around to zero).
Below is the code I expect to see with and without two conditions. I
also want to show two possible code sequences for each case. The first
is a short code sequence that always works. The second is a bit longer
and also always works, but is optimized for the case where the
micro-architecture has CACHE.FLUSH affecting only a single line and/or
lacks branch prediction. The two conditions are:
1) When the range has completed, rd is one byte past the end of
the region requested (rather than returning one byte past
the end of the last cache line affected).
2) rs2 is exclusive (rather than inclusive) and rs2==0 means the
range ends with the last byte of memory.
With the two conditions, the code that might be expected is:
flush_dcache_range: // a0=ptr, a1=size
ADD t0, a0, a1
loop: CACHE.FLUSH a0, a0, t0
BNE a0, t0, loop
RET
If rs2==rs1 is a nop, then the following code may be used as optimal for
CACHE.FLUSH affecting only a single line and no branch prediction.
flush_dcache_range: // a0=ptr, a1=size
ADD t0, a0, a1
loop: CACHE.FLUSH a0, a0, t0
CACHE.FLUSH a0, a0, t0
CACHE.FLUSH a0, a0, t0
CACHE.FLUSH a0, a0, t0
BNE a0, t0, loop
RET
If I understand correctly, you're agreeing with the two concepts I'd like to see (rd is related to inputs not to cache lines and rs2 is exclusive) and suggesting a way to take away the ugly part I have because wrap-to-zero needs to be allowed. The suggestion is to use x0 to demarcate the wrap. I think everything else follows from that.
I certainly see the value of not
having wrap-to-zero. But it seems to me that having two
functions requires a caller to test the required endpoint
dynamically and call one function or the other. That makes the
ugliness show itself at a higher level.
Putting the ugliness at an intermediate level might mean one function call with an internal test and two loops using the instruction level semantics you've suggested for one that ends at the end of memory. But I would rather see the wrap-to-zero than two loops.
I would rather cover any ugliness at the lowest reasonable level. Thus one routine with begin and size (without built-in wrap involved with end). And instructions which allow a single loop.
Bill
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a53fe19d-e4cd-4300-b9f2-fae0d58740ae%40groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/68a35f3c-200a-3ff7-121d-4f32164d9a64%40cadence.com.
If I understand correctly, you're agreeing with the two concepts I'd like to see (rd is related to inputs not to cache lines and rs2 is exclusive) and suggesting a way to take away the ugly part
/* * MM Cache Management * =================== * * The arch/arm64/mm/cache.S implements these methods. * * Start addresses are inclusive and end addresses are exclusive; start * addresses should be rounded down, end addresses up. * * See Documentation/cachetlb.txt for more information. Please note that * the implementation assumes non-aliasing VIPT D-cache and (aliasing) * VIPT I-cache. * * flush_cache_mm(mm) * * Clean and invalidate all user space cache entries * before a change of page tables. * * flush_icache_range(start, end) * * Ensure coherency between the I-cache and the D-cache in the * region described by start, end. * - start - virtual start address * - end - virtual end address * * invalidate_icache_range(start, end) * * Invalidate the I-cache in the region described by start, end. * - start - virtual start address * - end - virtual end address * * __flush_cache_user_range(start, end) * * Ensure coherency between the I-cache and the D-cache in the * region described by start, end. * - start - virtual start address * - end - virtual end address * * __flush_dcache_area(kaddr, size) * * Ensure that the data held in page is written back. * - kaddr - page address * - size - region size */ extern void flush_icache_range(unsigned long start, unsigned long end); extern int invalidate_icache_range(unsigned long start, unsigned long end); extern void __flush_dcache_area(void *addr, size_t len); extern void __inval_dcache_area(void *addr, size_t len); extern void __clean_dcache_area_poc(void *addr, size_t len); extern void __clean_dcache_area_pop(void *addr, size_t len); extern void __clean_dcache_area_pou(void *addr, size_t len); extern long __flush_cache_user_range(unsigned long start, unsigned long end); extern void sync_icache_aliases(void *kaddr, unsigned long len)
I have because wrap-to-zero needs to be allowed. The suggestion is to use x0 to demarcate the wrap.
I think everything else follows from that.
I certainly see the value of not having wrap-to-zero. But it seems to me that having two functions requires a caller to test the required endpoint dynamically and call one function or the other.
That makes the ugliness show itself at a higher level.
Putting the ugliness at an intermediate level might mean one function call with an internal test and two loops using the instruction level semantics you've suggested for one that ends at the end of memory. But I would rather see the wrap-to-zero than two loops.
I haven’t followed his thread in great detail, but it appears some people are trying to make the software as elegant as possible by saving a new instructions.The overriding goal should be to keep the hardware simple. Using (start,length) is a nice idea, but requires an extra adder in the hardware.As for exclusive vs inclusive, the biggest advantage of inclusive is that it fits in an entire address range in software without using extra address bits to wrap around. eg, in a 64K adddress space, it ranges from 0x0000 to 0xffff inclusive, which makes intuitive sense from a hardware perspective. The exclusive range would be 0x0000 to 0x10000 which needs 17 bits and would be wrapped to 0x0000 in software. In hardware, we can always add an extra bit to the address calculations if it makes sense, but the software and ISA layers don’t have access.The downfall to inclusion, of course, is that returning “one past” the last address affected would return 0x0000, the same as if nothing had been done. This only happens when the entire address range is specified.
I believe the recommendation from Jacob was to have software always split the full address range in half so there is no ambiguity in the result. This is an unfortunate but necessary compromise — as you can see there are many potential variations but each had a downfall one way or the other.I’m a big advocate of keeping hardware simple. Yet, there is a huge performance penalty when large address ranges are specified for cache ops if the ISA does just one cache line and returns an incremented address. Hence, even in small processors, I advocate fully handling the address range in hardware. Only hardware knows the precise cache structure, and can keep things optimized by only iterating through the cache lines once (possibly handling multiple sets in parallel every cycle) while applying the full address range filter.