-- Jacob
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/5A6A9158.80502%40gmail.com.
> Consequently the order and design priorities have changed a bit. It
> now looks like this:
>
> 1) Create a prefetch buffer TileLink2 widget and do something
> equivalent to tagged next line prefetching.
> 2) Add a stream prefetcher.
> 3) Add BO, ASP, and VLDP to see if they can be made to work in an L1
> prefetch buffer.
> 4) Begin snooping the L1 data cache reads and make a stride
> prefetcher; try a PC-localized VLDP too.
> 5) Add ISB, Domino, and possibly TCP.
> 6) Explore various further refinements available in the literature.
>
> At some future time (i.e. not for the current project), we want to
> work out how to handle virtual address prefetches. (Possibly by
> multi-porting the TLB). That would enable using IMP to prefetch for
> short vector gather instructions in a loop. It would also allow us to
> revisit older irregular prefetching designs to see if they work well
> enough to be used instead of (more expensive) modern designs like
> ISB. Furthermore, once we add victim buffer support, it will be
> possible to implement SMS and compare it to the other options.
Another option, if you have a TLB hierarchy, would be for the prefetcher
to use idle cycles on an outer TLB. This assumes that the main
processor will have most of its TLB searches produce hits in an inner
TLB, leaving idle timeslots on outer TLBs that the prefetcher can use.
-- Jacob
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/dc4c0af3-7efa-4aef-a2ca-c813f0921e78%40groups.riscv.org.
-- Jacob
Max Hayden Chiz wrote:
On Saturday, February 17, 2018 at 10:48:40 PM UTC-6, Jacob Bachmeyer wrote:
A simple stride-advance prefetcher does not need to care about cache hits, only about the stream of memory operations generated by the CPU. It knows that there was a prefetch hit for a pattern when it sees a memory access using an address that it had predicted and can then prefetch the next address in that sequence on the next opportunity.
The complexity of prefetchers that can be implemented inside a load/store unit is very limited, and I suspect that stride-advance-per-register (three memory accesses using x15 have hit A, A+9, A+18 -> prefetch A+27; fourth access using x15 hits A+27 -> prefetch A+36; etc.) is close to the reasonable upper end. Tracking all 31 registers this way is probably too much, but tracking the N most-recently-used registers might be useful and choosing an appropriate value for N is also an interesting research question. N should be at least two, to allow the prefetcher to handle memcpy(3), but a loop that iterates over more than two arrays would use more slots, as would a loop that performs other memory accesses while iterating over arrays. What access patterns will actual code match? This is a question that can be answered with ISA simulation and a modified spike, so this could be a bit of "low-hanging fruit" for a research group on a small budget. Since this type of prefetch relies on the register number used in an instruction, it must be implemented inside the processor.
The more complex prefetch algorithms you have mentioned will require their own prefetch units outside of the main pipeline.
So by having a prefetch buffer (or multiple levels of prefetch buffer), we can manage our own metadata and not have to worry about the complexities of the cache's control path timing. I initially didn't like this, but I think that Dr. Waterman may be on to something as there seem to be some promising advantages to this approach. For example, it makes us independent of the cache hierarchy, so we can have a per-core L2 prefetch buffer even on chips that only have an L1 and a chip-wide LLC.
Avoiding side-channels with a set-associative cache effectively forces the prefetcher to have its own cache that is only searched when the inner primary caches miss. (As many ways as there are "prefetch channels" must be allocated for prefetching in the cache to avoid interactions -- giving the prefetcher its own cache quickly becomes simpler.) While prefetching cannot be entirely independent of the cache hierarchy (after all, prefetch hits must be satisfied from the prefetch buffers and should be loaded into caches), keeping prefetch as separate from the main caches as possible may help with both coverall complexity and side channel prevention.
The key questions that determine the feasibility of this are:
(1: availability of data prefetch slots in the pipeline itself) What
fraction of dynamic RISC-V instructions in common workloads access
memory?
Put another way, the only contact with the processor's critical path should be to "tap" certain internal buses and feed extra copies to the prefetch units. The delay of these "bus taps" contributes to prefetch latency, but otherwise does not affect the processor.
Linear run-ahead prefetch can be implemented *in* a set-associative L1 I-cache: whenever a set matches a tag, probe the next set for the same tag and initiate fetch if this second probe misses. If the second probe wraps around to set 0, increment the tag for the second probe. This does not require a second CAM search port, since adjacent cachelines fall into adjacent sets and sets are assumed to be independent. Linear run-ahead instruction prefetch therefore requires *no* additional state and will hit whenever a basic block spans a cacheline boundary or a taken branch targets the next cacheline. (Additional questions: (5a) How often do basic blocks in real programs span cacheline boundaries for various cacheline sizes?
(5b) (related to 2b) How often do branches target the next cacheline?) Because linear run-ahead prefetch is independent of the actual instructions executed, it does not introduce a further side channel and need not be isolated from the cache proper. (Fetching from cacheline N loads N into the cache, also loading N+1 gives no additional information, since fetching from N+1 then causes N+2 to be loaded into the cache.)
The static branch prefetch I suggested observes the memory side of the L1 I-cache, but is otherwise independent of instruction fetch: whenever a cacheline is fetched into the L1 I-cache, scan it for aligned BRANCH and JAL opcodes and insert those target addresses into the prefetch queue.
This scan is performed on a copy of the data added to the L1 I-cache, so it does not delay the critical path towards instruction decode.
Static branch prefetch requires a prefetch queue and uses timeslots left idle when both instruction fetch itself and linear run-ahead prefetch hit in the L1 I-cache. The prefetch queue should also observe fetches from the L1 I-cache memory side and drop any pending prefetches for cachelines that are actually requested by the L1 I-cache. Since static branch prefetching assumes that every branch observed will be taken, its hit rate will probably be low, so its actual utility could be a good research topic.
On Sun, Feb 18, 2018 at 11:12 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
The key questions that determine the feasibility of this are:
(1: availability of data prefetch slots in the pipeline itself) What
fraction of dynamic RISC-V instructions in common workloads access
memory?I'm hoping that the infrastructure we set up for testing the data cache can get us this kind of information as a tangential benefit. (I'm also hoping it will give us a better handle on where the performance bottlenecks are in general. I'm planning on tackling Icache prefetch after this unless the performance data shows something else being substantially more urgent.)Naively, this only works for Rocket and not for Boom b/c the SPEC workloads are about 30% memory instructions. However, I think they are very unevenly distributed. So it's not like 2-way BOOM only has 9% of cycles free. Whether this uneven distribution makes this worse or better for our usage case is hard to say without targeted research looking at the conditional probabilities of instruction sequences and their relations to cache misses.
Max Hayden Chiz wrote:
> Are you (Jacob) or anyone else lurking in this thread planning/wanting
> to contribute code or otherwise be involved in the implementation?
> Once we get a design and a work plan, I'm planning on taking further
> discussion off-list so that the list only gets hit with emails about
> stuff that we need further input on.
If you do go off-list, please include me; I may not be able to help much
with code, but am interested.
I also believe, for both philosophical
and practical reasons, that keeping the discussion on the hw-dev list
(or perhaps on a public rocket-dev list somewhere) will be beneficial to
the RISC-V project in the long term. (Having these discussions in
public list archives will be a good thing, in my view.)
Frontier System Division
Research Center for Advanced Computer Systems
Institute of Computing Technology
Chinese Academy of Sciences
Beijing, China 100190
HiThis is Bowen Huang from Frontier System Division of ICT (Institute of Computing Technology, Chinese Academy of Sciences).Have to say I'm extremely excited to see that RISCV folks have an ambitious plan to pursue advanced prefetcher design. The Frontier System Division of ICT are now working towards a multicore rocket tapeout with 40nm process, moreover, we started an advanced LLC prefetcher research project with a top mobile SoC team last year, and now we have already evaluated multiple prefetcher design choices for their next-gen smartphone. Therefore I believe we are fully capable of taking charge of this project, and our interests are perfectly aligned.After scanning through previous emails in this issue, I think the info listed below may be helpful:0) We tested 23 benchmarks from SPECint 2006 and SPECfp 2006, 10 of 23 are prefetching-sensitive (which has >10% speedup if every load req hits). The other 13 benchmarks are prefetcher-insensitive, even an ideal prefetcher would have no more than 5% speedup.
1) During our evaluation, we found that DPC2 (most of advanced prefetchers come out of this contest) were using an inaccurate simulator, we roughly calibrated that simulator against a real chip emulator(codenamed Ananke, ARM Cortex-A55) and re-evaluated all those tests, the performance drops 70%~90% compared with what they claimed in the papers.
The biggest impact factor is the presence of L1 cache prefetcher, which filters out many load requests and renders even the most advanced L2 (or LLC) prefetcher far less effective.
However, L1 prefetcher also suffers from the limited capacity of L1 MSHR and read/prefetcher queue, adding prefetch-on-prefetch support at L2/LLC support will increase performance by ~15% on a 3-issue, 128-entry ROB core. This indicates that our chance lies in L1 prefetcher and L2 prefetcher co-design.
2) All the L2 prefetcheres can generate significant num of requests that are only hit on a L1 writeback, and it's known to all that writeback hit doesn't contribute to performance. This indicates that prefetch request should be filtered by coherence directory or something that can track the content of upper level cache.
3) Jacob is right on his idea that extending TLB support to L2. we found a perfect example, milc from SPECfp 06, which suffers from huge num of cache misses, and we can speedup this benchmark by 13.8% in physical address space, if we can get info from virtual address space, the ideal speedup is measured to be ~60% because the majority of cache misses are consecutive in their virtual page nums.
4) We have tested AMPM, VLDP, BOP, SPP, next line, stream prefetcher in our roughly calibrated simulator so far. In average, BOP and SPP are better than the others, but none of those advanced prefetchers can perform consistently better than the others,
this indicates we should introduce set-dueling mechanism in prefetcher design, to invoke different types of prefetcher for different scenarios, and we have proved that this works, and with negligible storage overhead.
My working email is huang...@ict.ac.cnWe have several PhD and master students that can contribute codes, both software and hardware.
Is there any other team that want to work on this ? And if we take in charge of this project, who should we report to, and how often should we report the progress ?
Thanks so much.
Stay in touch.