Data Cache Prefetching for Rocket (and Boom)

1,505 views
Skip to first unread message

Max Hayden Chiz

unread,
Jan 22, 2018, 5:57:33 PM1/22/18
to RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc
Someone asked about possibly taking this over from me as their master's project so I'm going to post the info I've accumulated to this thread so that it's at least public.

The first thing to note is that this might be easier to do via the lowRISC project (they are an implementation of RISC-V), unlike Rocket/Boom they have an L2 cache in tree and L2 prefetchers are both more important and easier to implement. OTOH, I don't know if they are amenable to this or what their plans are for integrating TileLink2 and the new L2 Cache when it becomes available. And ultimately we want this in-tree. (Because the literature leads me to expect that the performance improvement for BOOM on memory intensive workloads would approach 100%. It should be even higher with Rocket due to its in-order design.)

The second thing to note is that you are going to either need the budget to use Amazon's EC2 FPGA instances (hopefully via FireSim) or have a lot of FPGA boards. You can't benchmark prefetchers with CoreMark. Ideally you want to run (all of) SPEC, cloud suite, and TPC's DSS and OLTP benchmarks. *And* you'll want to do it with different cache parameters (sizes, number of ways, and various prefetcher configurations). You'll probably need to do a fractional factorial experiment design to pick the configurations to run and generate a response surface so that you don't have to run every possible combination. Also, this is a lot more benchmarking than has been done to date with our chips, so it may take some work to even get all of this working. (Hopefully you'll contribute whatever work you do back to the FireSim people to make it automatic for future researchers; this project would probably be a good test-case for their tooling and infrastructure.)

A third note is that much of the literature assumes that the L2 cache is part of the core (and that there's an L3 cache that isn't). Consequently many simulators and designs will do things like use the PC of the load instruction or its virtual address even for an L2 prefetcher. We aren't going to be supporting that. (Or at least I don't think this is a good fit for our L2 cache design.) So, for us, anything that uses the PC or virtual addresses operates at L1 by default. We could have an L1 prefetcher issue the prefetches in a way that they only load into the L2 cache though. If that risks saturating the bus, an advanced implementation trick would be to create a separate prefetcher communcation channel between the L1 and L2 caches, so that you can have a high volume of prefetch requests in flight without interfering with demand loads.

A fourth note is that cache replacement policies for the L2 cache ideally need to be made prefetch aware. A good starting point is RRIP with PACman (https://people.csail.mit.edu/emer/papers/2011.12.micro.pacman.pdf). This is reasonably simple. There are more sophisticated ideas in the literature, but that's a research project of its own.

As for the prefetchers themselves, all of them work essentially the same way. They are all state machines operating on some SRAM tables, possibly with CAM lookup. And they all depend on adding one bit of state per cache block to flag whether it is a prefetched cache line.

My original preference was to add the prefetch bits to the data cache in much the same way as the code adds ECC bits. But you could just derive a "PrefetchingCache" from the NBDataCache class and add your code there. (And there are several other ways to do it. E.g. via a trait.) I've been trying for some time now to hash out exactly how this should happen with Henry Cook at SiFive (because he seems to have commit access and work on this part of the code) but I've never gotten a response. (Indeed my question to the list of who was in charge of the cache design has gone unanswered.)

There's a decent, short literature review at the beginning of  Akanksha Jain's 2016 thesis _Exploiting Long-Term Behavior for Improved Memory System Performance_. A longer literature review is _A Survey of Recent Prefetching Techniques for Processor Caches_ by
Sparsh Mittal. An older literature review that may be helpful is Stefan Berg's _Cache Prefetching: Technical Report UW-CSE 02-02-04_. You'll also want to consult the two data prefetching competitions (https://www.jilp.org/vol13/index.html, https://www.jilp.org/dpc/online/DPC-1%20Program.htm, and http://comparch-conf.gatech.edu/dpc2/final_program.html).

The current "baseline" for a modern prefetcher design is Pierre Michaud's _Best-Offset Hardware Prefetching_. That design is our preliminary target and anything we make that deviates from this needs to have either simpler hardware or higher performance. (I've got the parts of his code that implement the prefetchers in that paper if you have an easier time following the code than reading the paper itself.)

The first thing you want to get working is basic tagged next-line prefetching. Basically on a demand miss, you prefetch the next-line. And whenever a cache block marked with the prefetch bit is accessed, you clear that bit and prefetch its next line. (This can be generalized to "next n lines"). This has all of the basic elements that other prefetchers rely on.

Once that is working, you'll want to try a stream prefetcher and a stride prefetcher. These are older designs that are important reference points and would be usable on a resource constrained processor. For the L1 stride prefetcher, the variation on the Chen and Baer's RPT that is reported in Michaud's paper is probably fine (I think the only difference from the original is how it handles the two-bit confidence counter). The most recent stream prefetcher is the one from the 1st DPC (https://www.jilp.org/vol13/v13paper4.pdf).

After that, if you are working on the L2 cache, you want to implement Michaud's Best Offset prefetcher, and compare it to Akanksha Jain's Aggregate Stride Prefetcher (from her thesis). Jain's uses a more complicated design, but should have better performance and the two have never been directly compared. You also want to implement VLDP (https://www.cs.utah.edu/~rajeev/pubs/micro15m.pdf). There is a lot of back and forth in the literature about whether an offset prefetcher or one like VLDP is better, so your publication would want to tackle this stuff head-on. How does having an OoO core impact VLDP's accuracy? Some authors say that it will suffer since misses won't be in program order, others say that OoO reordering doesn't impact the miss stream that much. (But to date, no one has actually attached it to such a core to figure it out.) Does its strategy of multi-degree prefetching really cause problems like Michaud argues? What are the actual implementation costs of each option? Etc.

A further (optional) refinement of the above designs would be to have the L2 designs implemented in the L1 cache (still using its miss stream) but issue prefetches to the L2 alone. This allows you to use virtual addresses and increase the accuracy and perhaps the performance. (It also allows you to filter out "irregular" misses from the training and possibly improve accuracy more.) OTOH, how well this works in a multi-core configuration is anyone's guess.

As for the L1 cache, the general belief in the modern literature is that it's less important because an OoO core can generally hide the latency of an L1 miss. OTOH, for an in-order core like Rocket, improving the L1 prefetch performance is critical. Historically, people used PC-based stride prefetchers because their high accuracy limited cache pollution. But I think that with a modern, large 4-way cache, you could probably get by with a less accurate but more aggressive design. One possibility is the stream prefetcher I suggested above. Another possibility is a PC-localized very of VLDP. Several authors have even suggested that it might work *better* as an L1 design. Between the use of virtual addresses, PC-localization, and the better data available at L1, it *might* work. (Though you could have to disable or tone down some of its more aggressive tendencies such as the offset prediction table and starting prefetching on the basis of a single stride; also, NB, as-is, offset prefetchers don't work at L1 due to lack of accuracy, so anything that makes them work at L1 would be a novel design.)

Whether something like VLDP can be made to work at L1, the most important refinement, and one that you should definitely add, is Xiangyao Yu's _IMP: Indirect Memory Prefetcher_. This is able to detect indirect array references (such as gather instructions from a vector unit) and greatly improves the code coverage.

If you are ambitious, you could try implementing the tag correlated prefetching (TCP) for the L2 (http://ieeexplore.ieee.org/document/1183549/) as well.

You could also look at incorporating some of the feedback ideas from _Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers_. (This used a normal stream prefetcher, but the idea of adjusting the parameters for the prefetcher dynamically is a general one.) There are also other feedback-type ideas, e.g. _Timing Local Streams: Improving Timeliness in Data Prefetching_, _Performance Oriented Prefetching Enhancements Using Commit Stalls_, _SPAC: A Synergistic Prefetcher Aggressiveness Controller for Multi-core Systems_, _Reducing DRAM Latencies with an Integrated Memory Hierarchy Design_, and _Memory Prefetching Using Adaptive Stream Detection_ among many others.

If you could get all of the above working and do a response surface analysis, that would be outstanding. If you want to implement more stuff, read on.

The most aggressive L1 design is SMS (http://www.jilp.org/vol13/v13paper8.pdf). A 32KB version is ridiculous, but 8KB is not impossible if the benefits over the other options are significant. At some point we'll want to implement this and see. In particular, if VLDP or some other "advanced" prefetcher works at L1 we want to compare it to SMS. (which to date is the only advanced prefetcher with reported L1 results.)

AMPM won the first DPC, but it's very complicated and I don't think it is worth implementing because it is L2-only and is dominated by the newer designs like Best Offset and VLDP. OTOH, if you get some surprising results that are out of line with what the literature reports, then revisiting it might make sense. (Similarly, the Global History Buffer dominated research a while back, but because using it requires a linked-list traversal, it's not really ideal for a hardware implementation.)

The main design refinements to try at this point are for prefetching irregular streams. (mainly linked data structures). This is a very active area of research where the best results change annually.

There's some stuff in the older literature that attempts to generalize the RPT to handle linked structures (cited from the older literature review I linked above) and with some refinements (maybe 2-delta strides in the state machine and something like VPLD for the offsets) you could probably get a reliable state machine for this that didn't use much area. The problem with all of these (older) designs is that they require that the amount of work per loop iteration exceed the memory latency. (Say you are following a linked list, in these designs you can't prefetch the next node until the memory returns the current one. So the loop you are in will have to have enough work to cover that latency so that the prefetcher can "get ahead" of the code.) That doesn't mean you shouldn't implement them (and indeed these ideas haven't really been properly evaluated and could work well for embedded chips), but I wouldn't prioritize them. (Though a few of them are simple enough that it would be trivial to add them and test them.)

Instead I would look at irregular prefetchers that try to memorize the data structure and turn it into a linear access. The leading candidate here is Akanksha Jain's irregular stream buffer (her thesis and also a paper she published). That design has to live at the L1 in our chip because it requires the PC (even though she calls it an L2 prefetcher). It also requires around 8KB of state and some interfacing with the TLB. But it is *very* accurate and offers about twice the performance of TCP (NB: if you implement TCP at L1 like ISB, you could also localize it based on the PC just like ISB does. But I'm not sure how much increased accuracy and performance that buys you.)

A possible darkhorse contender worth examining further would be the (L1) Domino prefetcher (http://ieeexplore.ieee.org/document/7820158/ and http://cs.ipm.ac.ir/~plotfi/papers/domino_hpca18.pdf) which is in between the two extremes of ISB and TCP. In any event, the research question would be whether the benefits of these more complex systems justify their higher costs. (That said, both ISB and Domino are fairly involved designs so this might be a separate masters project of its own.)

There are also further refinements in the literature that you can explore. For example the stream chaining idea from _Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching_ was presented as working with the GHB, but could we modify it to work with other data structures? The L1 prefetcher from the 1st DPC also used the GHB, but its ideas are generalizable. There are ideas that use machine learning techniques like _Semantic Locality and Context-based Prefetching Using Reinforcement Learning_.

I hope this helps and gives you some ideas. Let me know if you take this up and keep me posted on how it goes.

--Max Chiz
New Orleans, LA



Stefan O'Rear

unread,
Jan 22, 2018, 9:26:57 PM1/22/18
to Max Hayden Chiz, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc
On Mon, Jan 22, 2018 at 2:57 PM, Max Hayden Chiz <max....@gmail.com> wrote:
> My original preference was to add the prefetch bits to the data cache in
> much the same way as the code adds ECC bits. But you could just derive a
> "PrefetchingCache" from the NBDataCache class and add your code there. (And
> there are several other ways to do it. E.g. via a trait.) I've been trying
> for some time now to hash out exactly how this should happen with Henry Cook
> at SiFive (because he seems to have commit access and work on this part of
> the code) but I've never gotten a response. (Indeed my question to the list
> of who was in charge of the cache design has gone unanswered.)

If you contacted Henry more than a week ago and haven't gotten
anything I'd add Wesley Terpstra and maybe Andrew Waterman to your
contact list. I spoke with Wesley a bit at the fifth workshop about
deadlock problems with the old L2 cache (still present in lowRISC) and
he would definitely have been involved in the TL2 and coherency
aspects of the new L2. I think it exists in some capacity since
documentation for its control registers has been released by SiFive
but I don't know what the plans are for it.

-s

Max Hayden Chiz

unread,
Jan 25, 2018, 1:16:44 PM1/25/18
to RISC-V HW Dev, max....@gmail.com, ngvan...@gmail.com, cuon...@gmail.com
Okay, I spoke to Andrew Waterman about this. He doesn't know when the L2 will be available, but we talked extensively about the L1 data cache and prefetching.

He doesn't want the prefetcher to make any internal changes to the cache. So we agreed that for an initial implementation, we'll use a prefetch buffer design. This has pros and cons. It wasn't my first choice of designs, but it should be more than adequate for Rocket and it allows us to explore using "L2" prefetchers without having an L2 cache. If they end up not working with a prefetch buffer for the L1 cache, the code won't go to waste since it can easily used to make a prefetch buffer for the eventual L2. (Down the road, we may use this additional storage to enable a simpler L1 cache by using it as a victim buffer or a working set filter, but that's not something you need to worry about.)

So our initial prefetcher design will be a TileLink2 widget that sits between the L1 cache and the rest of memory. It will handle all of its own state internally. Instead of putting prefetches into the cache, they will sit in a prefetch buffer in the prefetch unit. Consequently a prefetch "hit" will look to the cache like a fast cache miss. You'll need to deal with coherency issues, but Waterman says this should be fairly simple. "[I]t should suffice for the B-channel to snoop the prefetch buffer and invalidate any entries." Furthermore, if we want, we can have the prefetcher snoop the L1 data cache access stream (in addition to the miss stream) by taking the PC and effective address as an additional input. We can't however utilize the TLB to translate virtual addresses, so that rules out some designs and means that we either need to stop prefetching at a page boundary or assume that the next virtual page is also adjacent in physical memory (a common occurrence).

Consequently the order and design priorities have changed a bit. It now looks like this:

1) Create a prefetch buffer TileLink2 widget and do something equivalent to tagged next line prefetching.
2) Add a stream prefetcher.
3) Add BO, ASP, and VLDP to see if they can be made to work in an L1 prefetch buffer.
4) Begin snooping the L1 data cache reads and make a stride prefetcher; try a PC-localized VLDP too.
5) Add ISB, Domino, and possibly TCP.
6) Explore various further refinements available in the literature.

At some future time (i.e. not for the current project), we want to work out how to handle virtual address prefetches. (Possibly by multi-porting the TLB). That would enable using IMP to prefetch for short vector gather instructions in a loop. It would also allow us to revisit older irregular prefetching designs to see if they work well enough to be used instead of (more expensive) modern designs like ISB. Furthermore, once we add victim buffer support, it will be possible to implement SMS and compare it to the other options.

As for the current project, #1 & #2 are preliminary requirements, but after that, #3, #4, and #5 are all independent. So you can do some subset based on what interests you and how much time you have. To deal with the cache coherency stuff, I *think* you will have to make the buffer fully associative. But if you come up with a creative storage structure that works for the various prefetchers, feel free to use it. You may want to look at Yoav Etsion's working set filter design. He uses a fully associative buffer for a different purpose, but speeds up access and saves power by having an 8-entry direct-mapped access table. (But I wouldn't worry about such optimizations in your initial design.)

The ideas in #3 are the most important for performance if they can be made to work in a prefetch buffer scheme. If they work, you may also want to explore having them work in cooperation with a stream or stride prefetcher. In the literature this improves performance, but that was for processors with both L1 and L2 caches each with separate prefetchers.

#4 reduces inaccurate prefetches vs #2-3, but it isn't clear whether we will benefit. Because of the prefetch buffer, we are much less worried about cache pollution. OTOH, the L1-to-memory bus only has limited capacity, so they may still have some impact.

One question is how well #4 works unmodified on BOOM. Ideally you'd modify the PC/effective address snooping (i.e. the training step) to happen at retirement instead of at access time so that it happens in-order. But whether this is even necessary is an open question. Also, due to the hardware cost of maintaining the PC in the load-store unit, the performance benefits would need to be pronounced for BOOM to use either of these.

#5 has a potentially big impact, especially since we are delaying implementing IMP. But I worry that TCP will be substantially less effective in this configuration. Also, as I said before, the other two are complex designs and so may be a master's project of their own. (And if you focus purely on this, you probably want to add some of the older designs from in the literature like STMS.)

If you are unsure what to do, I'd focus on #1-3. This approach will work with both Rocket and Boom with no additional work and so you'll have an easier time testing them on a wide variety of configurations. Also, there isn't much research on prefetching for smaller processors and microcontrollers that only have an L1 cache. And prefetch buffers haven't been studied as a design point in a long while. So your project would be unique in multiple ways. I'd omit #4 unless you turn out to have the time. And I'd leave #5 for some future project unless it really interests you. (And if you run into unexpected problems, you can always drop #3 or parts of it.)

I hope this helps. Let me know what you decide to do for your master's project.

Jacob Bachmeyer

unread,
Jan 25, 2018, 9:24:30 PM1/25/18
to Max Hayden Chiz, RISC-V HW Dev, ngvan...@gmail.com, cuon...@gmail.com
I say stop prefetching at a page boundary. Unless you can be absolutely
certain that spuriously prefetched data cannot leak from the prefetch
buffer, assuming the next page opens a possible window for Spectre attacks.

> Consequently the order and design priorities have changed a bit. It
> now looks like this:
>
> 1) Create a prefetch buffer TileLink2 widget and do something
> equivalent to tagged next line prefetching.
> 2) Add a stream prefetcher.
> 3) Add BO, ASP, and VLDP to see if they can be made to work in an L1
> prefetch buffer.
> 4) Begin snooping the L1 data cache reads and make a stride
> prefetcher; try a PC-localized VLDP too.
> 5) Add ISB, Domino, and possibly TCP.
> 6) Explore various further refinements available in the literature.
>
> At some future time (i.e. not for the current project), we want to
> work out how to handle virtual address prefetches. (Possibly by
> multi-porting the TLB). That would enable using IMP to prefetch for
> short vector gather instructions in a loop. It would also allow us to
> revisit older irregular prefetching designs to see if they work well
> enough to be used instead of (more expensive) modern designs like
> ISB. Furthermore, once we add victim buffer support, it will be
> possible to implement SMS and compare it to the other options.

Another option, if you have a TLB hierarchy, would be for the prefetcher
to use idle cycles on an outer TLB. This assumes that the main
processor will have most of its TLB searches produce hits in an inner
TLB, leaving idle timeslots on outer TLBs that the prefetcher can use.


-- Jacob

Andrew Waterman

unread,
Jan 25, 2018, 11:27:37 PM1/25/18
to Jacob Bachmeyer, Max Hayden Chiz, RISC-V HW Dev, ngvan...@gmail.com, cuon...@gmail.com
Also, the outer TLB is likely off the critical path, making it simpler to share without affecting cycle time.  And systems with a unified L2 TLB already need to deal with arbitration between I and D.



-- Jacob

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/5A6A9158.80502%40gmail.com.

Max Hayden Chiz

unread,
Jan 26, 2018, 2:42:58 AM1/26/18
to RISC-V HW Dev, max....@gmail.com, ngvan...@gmail.com, cuon...@gmail.com, jcb6...@gmail.com
My thinking is that for an initial attempt we want to allow for as much flexibility as possible to explore the design space. If we get a ton of performance from crossing page boundaries, then we know that dealing with virtual addresses, or using large pages, or otherwise handling the issue "properly" is important.

I haven't thought deeply about how Meltdown and Spetre work in this context, but my assumption was that a production-ready prefetcher would have to be flushed on a context swap just like the branch predictors to avoid a side-channel attack. Once we get this thing working, maybe someone more expert than I can come along and come up with a better option. (For all I know, it could be that memory layout randomization is sufficient to avoid any real problems or that we can rely on the whatever guarantees the other parts of the system provide.) But again, for an initial research attempt, I think it's important to first get the core functionality to the point where we can evaluate the design. Once it has basic functionality and we kind of know the direction we are headed, then we can focus on the refinements we need to make it production-ready.


> Consequently the order and design priorities have changed a bit. It
> now looks like this:
>
> 1) Create a prefetch buffer TileLink2 widget and do something
> equivalent to tagged next line prefetching.
> 2) Add a stream prefetcher.
> 3) Add BO, ASP, and VLDP to see if they can be made to work in an L1
> prefetch buffer.
> 4) Begin snooping the L1 data cache reads and make a stride
> prefetcher; try a PC-localized VLDP too.
> 5) Add ISB, Domino, and possibly TCP.
> 6) Explore various further refinements available in the literature.
>
> At some future time (i.e. not for the current project), we want to
> work out how to handle virtual address prefetches. (Possibly by
> multi-porting the TLB). That would enable using IMP to prefetch for
> short vector gather instructions in a loop. It would also allow us to
> revisit older irregular prefetching designs to see if they work well
> enough to be used instead of (more expensive) modern designs like
> ISB. Furthermore, once we add victim buffer support, it will be
> possible to implement SMS and compare it to the other options.

Another option, if you have a TLB hierarchy, would be for the prefetcher
to use idle cycles on an outer TLB.  This assumes that the main
processor will have most of its TLB searches produce hits in an inner
TLB, leaving idle timeslots on outer TLBs that the prefetcher can use.

This is actually a really good idea.
 


-- Jacob

Max Hayden Chiz

unread,
Feb 16, 2018, 5:22:42 PM2/16/18
to RISC-V HW Dev, Max Hayden Chiz, Linh Nguyen Van, Cuong Pham Quoc, bhiman...@gmail.com
Since multiple people are interested in working in this area, I'm posting a proposed work breakdown along with some updates in light of Jacob Bachmeyer's TLB idea and some discussions I've had with various prefetching researchers.

Because Boom and Rocket are chip generators instead of specific processors, the wide range of microarchitectural possibilities is likely to result in different prefetch configurations being needed for different situations. Therefore, the goal of the project is to create a decent library of prefetching algorithms that can be used individually or collectively in various chips made using the code.

I've organized this into "preliminary" work that needs to be done but probably isn't publishable on its own, and projects that could result in publication. My thinking is that whoever gets there first does the necessary preliminaries and that anyone could start them now.

For each of the others, #1 is the main project and the others are follow-ups. We could have up to 3 people (or groups) proceeding mostly in parallel.

If anyone has any suggested improvements or thinks that some of this needs to be shuffled around, please let me know. I'd like get everyone wanting to work on this on the same page. If we agree on this breakdown, we can let whoever is ready to get started first start on the preliminaries while we hash out who is doing what for their masters or other work.

I. Preliminaries:

1) Create an L1 D$ (data cache) prefetch buffer TileLink2 widget and do something equivalent to tagged next line prefetching

2) Add a stream prefetcher

3a) Begin snooping the L1 D$ reads (and PC) and make a stride prefetcher that works on physical addresses
3b) Interface with the L2 TLB in the page walker to allow for using virtual addresses in the prefetcher

4a) Add in some "debugging" performance counters to track prefetcher accuracy, coverage, etc.
4b) (Prob. working with the FireSim people) automate the benchmarking of the prefetchers for SPECint and SPECfp
4c) Expand the benchmarking to cloudsuite, TPC's DSS and OLTP benchmarks, and parsec (in that order)

II. Core "modern" prefeching algorithms  (requires preliminaries 1, 2, 4a, & 4b)

1) Implement BO, aggregate stride prefetching, VLDP, and AMPM

2a) Examine PC-localized VLDP and, for the others, using the stride prefetcher's state counter in reverse to filter out unreliable PCs (also needs preliminary 3a)
2b) Examine using virtual addresses (also needs preliminary 3b) 

III. Heuristic-based data structure prefetching (requires all preliminaries):

1) Implement IMP and Liu's link data structure prefetcher

2) See if using these to filter out prefetches from the algorithms in II is effective (requires project II)

IV. Correlation based prefetching (requires all preliminaries except 3b)

1) Implement TCP, ISB, and Domino (ISB requires some interface modifications to the L2 TLB)

2a) Implement SMS (this probably requires modifying the cache interface to be able to observe clean evictions)
2b) See if SMS works better using TLB residency instead of cache residency

3) See if ISB's data structure can be modified to accommodate Domino

----

Note on project II:

The main research point in this project is to see how effective these prefetchers are in a buffer configuration and to resolve the debate about the relative merits of VLPD vs offset prefetching.

The concern is that the modern prefetchers may require unreasonable amounts of buffer storage. There is also the concern that they may not be timely. If most of the prefetches are arriving late, putting them into the buffer doesn't save you much over just putting them in the L2 cache or an L2 prefetch buffer. To make them timely, you may have to pipeline the prefetcher so that the tables are accessed immediately when a load/store accesses the L1 D$. But the prefetch is only issued on an L1 miss (or a first hit to a prefetched block), but not an MSHR hit (or other hit-on-miss). Seeing what is or isn't necessary here is the other major goal.

I've included AMPM because we'll need something like its the data structure to track the contents of the cache for filtering redundant prefetches. It was also specifically designed for our prefetching situation whereas the others weren't. Contra what I said initially, its performance is within the margin of error for the others. You probably want to include, at least as options, some of the refinements from the "slim-AMPM" design in the second DPC. And you probably want to specifically test a similar combined configuration using VLDP instead of the PC/DC prefetcher they originally used.

It may be possible to improve AMPM's results by making its access map structure more accurate. E.g. Monitoring TileLink's C channel and updating it on dirty evictions. (We could, if necessary, alter the cache interface to also give us updates on clean evictions as well if these are significant sources of errors.) Instead of flushing it completely periodically, we could tie it more closely to the TLB and track cache lines from mapped pages, flushing the map whenever a page got evicted from the TLB.

Note on project III:

Although most of the research for irregular prefetching is directed at correlated prefetchers, both of these designs have reported results comparable to the best correlation prefetcher at much less cost. If their performance generalizes to a wider array of benchmarks, using them (and allowing for the virtual address prefetching they require) should be more cost effective for most configurations. Additionally, prefetching virtual addresses may eliminate TLB misses and allow us to avoid making a separate TLB prefetcher.

The linked data structure prefetcher is documented in Gang Liu's "Semantics-Aware, Timely Prefetching of Linked Data Structure" and in his thesis _Accurate, Timely Data Prefetching for Regular Stream, Linked Data Structure and Correlated Miss Pattern_". His thesis is also the original source of our stream prefetcher's design but the paper he published for the DPC that I cited earlier is better. (NB: his correlated miss pattern design requires compressed main memory and so isn't something we can use at this juncture despite being an interesting idea.)

Note on project IV:

The main research point here is to examine ISB vs Domino across all benchmarks. Domino has only been examined in the cloudsuite context and ISB only in the context of SPEC. They use fundamentally different approaches and their designers disagree about much. It's possible that in the context of cloudsuite, you'll need to delay ISB's prefetches due to the large number of instructions between PCs reoccurring. (But whether you need to do this is a point of contention.)

General notes:

What works better will depend on a host of factors: whether you are in a multicore configuration, the cache configuration, the buffer size, etc. Most studies only examine one configuration. We'd like to look at as wide a range of configurations as possible so that people know what to use in various configurations. (For example, offset prefetchers which do well at degree 1 will probably be better for multi-core configurations than multi-degree prefetchers like AMPM due to more efficient use of the memory bus.) This is why we want to automate the benchmarking as much as possible. Testing large numbers of configurations will become very tedious otherwise.

For now we are focused on data prefetching. Cloudsuite and the TPC benchmarks have lots of I$ misses as well. We could use a next-2-line prefetcher or a stream prefetcher to help the I$ somewhat, but generally dealing with this requires something elaborate like Boomerang (cited in graduate research thread). We should note this in our benchmarks and do the best we can for now. Down the line once we have I$ prefetching we can rerun the benchmarks that have this problem and see how that impacts the D$ prefetcher results.

Since we aren't tied to the cache, there's nothing that stops us from having a per-core L2 prefetcher that prefetches for both the I$ and D$ or an L2 prefetcher that works chip-wide. In general most of the research on L2 prefetchers assumes a per-core prefetcher that listens to the L1 D$ miss stream and fetches into the L2 cache. Once we have the algorithms in place, we can examine other configurations easily.

Once we've got a library of algorithms and an idea of what works better with various configurations, we can work on more marginal refinements like adding feedback, filtering redundant prefetches, and supporting short streams. But for now, the goal is to evaluate the prefetchers to better understand what they would benefit from in this direction. Hopefully our benchmarking will be good enough to point us in the direction of what specifically needs investigating further.

Finally, for BOOM, the literature thinks that it's important to "train" some of the prefetchers in program order (e.g. by using committed instructions). If someone wants to work on modifying the BOOM pipeline to allow for this, that's fine. But otherwise I'm leaving it under the "future refinements" category. If one of the prefetchers experiences a huge accuracy drop in moving from Rocket to BOOM we'll know this is probably why.

--Max

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Max Hayden Chiz

unread,
Feb 17, 2018, 2:31:48 PM2/17/18
to RISC-V HW Dev, Max Hayden Chiz, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani
P.S. I put AMPM in project II and SMS in project IV because they fit the prerequisites and the theme of those sections. But we could move them to a separate project V on the basis that their data structures are similar to each other and fairly different from everything else. Thus they might be able to share code. This would also make it so that no project had more than 3 algorithms to implement. What do you guys think?

Jacob Bachmeyer

unread,
Feb 17, 2018, 11:48:40 PM2/17/18
to Max Hayden Chiz, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani
Max Hayden Chiz wrote:
> P.S. I put AMPM in project II and SMS in project IV because they fit
> the prerequisites and the theme of those sections. But we could move
> them to a separate project V on the basis that their data structures
> are similar to each other and fairly different from everything else.
> Thus they might be able to share code. This would also make it so that
> no project had more than 3 algorithms to implement. What do you guys
> think?
>
> On Fri, Feb 16, 2018 at 4:22 PM, Max Hayden Chiz <max....@gmail.com
> <mailto:max....@gmail.com>> wrote:
>
> Since multiple people are interested in working in this area, I'm
> posting a proposed work breakdown along with some updates in light
> of Jacob Bachmeyer's TLB idea and some discussions I've had with
> various prefetching researchers.
>

I have since thought about this a bit more and come to the conclusion
that further study could be helpful: since RISC-V is a load/store
architecture, most instructions do not access memory. Exactly what
fraction of the dynamic instructions access memory for various workloads
is an interesting question.


An in-order implementation can take advantage of RISC-V's load/store
design to perform data prefetches on pipeline timeslots that carry
non-memory instructions. This would allow a data prefetch unit
integrated into the main load/store unit to take full advantage of the
innermost TLBs and (if some fraction of cachelines is reserved for
prefetch use to avoid side-channels) the L1 D-cache. This is probably
best for simple stride-advance prefetchers that only fire if they
recognize a recent (virtual) memory-access pattern and prefetch at most
one predicted access for each pattern recognized. More complex data
prefetchers can be located outside the main processor and use idle
timeslots on the outer caches and TLBs, bringing data nearer, but not
into, the L1 D-cache. Since outer caches are usually (much) larger, the
cost of a useless prefetch (chance of evicting a useful cacheline) can
be reduced if more-aggressive/less-accurate prefetch algorithms feed
outer caches. This also allows more-complex prefetch units to be more
loosely-coupled to the processor core, another advantage.


Instruction prefetch falls into two main types: the linear run-ahead
fetch that predicts that the next cacheline after (pc) will be requested
for execute (and, being independent of the instruction stream, is safe
with respect to side channels) and a branch-prediction fetch that
attempts to bring branch targets near or into the I-cache. Note that
prediction prefetch does not have to actually use branch prediction
state if sufficient instruction fetch bandwidth is available; an
unconditional prefetch of possible branch targets into outer cache
levels may have benefits including the prevention of branch-prediction
side channels. In this model, the instruction fetch unit always
predicts linear advance and fetches entire cachelines into L1 I-cache.
The instruction decoder passes possible branch targets to a branch
prefetch queue which feeds a branch prefetcher that brings possible
branch targets into the L2 cache, using bus timeslots left idle due to
L1 I-cache and ITLB hits. The key insight here, as with my earlier TLB
suggestion, is that cache hits at an inner level leave idle timeslots at
an outer level that a prefetcher can use.

This works for conditional branches (whose targets are unconditionally
prefetched) and direct jumps, but leaves register-indirect jumps
unprefetched. Prefetching these is difficult, since the register value
might not be available immediately. This, however, is a general
indirect branch prediction problem. An in-order implementation might be
able to bypass a register number directly to writeback such that a write
to that register will be fed to the branch target queue with an offset,
but it is unlikely that any prefetch driven from a value already in the
pipeline will complete before the jump is reached, so this can only save
a few cycles of latency at most.

The key requirement for this to work is that instruction cachelines must
be wide enough for prefetches to complete before the next instruction
cacheline is needed. Similarly, the density of cross-cacheline branches
must be limited such that the branch prefetcher can actually keep up and
does not attempt to fill the entire L2 cache with speculative branch
prefetches. Since JAL and BRANCH are specific major opcodes and
baseline RVI instructions must be aligned on 32-bit boundaries, the L1
I-cache could search for direct branch targets while loading a
cacheline. RVC complicates this significantly, but simply omitting
prefetch for RVC branches and misaligned RVI branches is one way to
ignore the problem for a first implementation. If the L1 I-cache
locates possible branch targets, then the instruction decoder need not
do anything special and prefetching can start well before a branch
actually enters the pipeline. To prevent the prefetch process from
"running away" and evicting actually-useful cachelines, instruction
cachelines prefetched as possible branch targets can be placed into an
outer cache instead of the L1 I-cache itself, such that only cachelines
actually fetched for execute (and linear run-ahead) drive branch
prefetching.


The key questions that determine the feasibility of this are:
(1: availability of data prefetch slots in the pipeline itself) What
fraction of dynamic RISC-V instructions in common workloads access memory?
(2: feasibility of I-cache-driven branch prefetch)
(a) How frequent are branches in static RISC-V code?
(b) What fraction of branches in static RISC-V code point to the
same cacheline, or more precisely, how many "duplicate prefetches" would
a naive implementation issue? (Obviously, real implementations would
merge multiple prefetches of the same cacheline and avoid "prefetching"
lines already in the cache. What fraction of "naive prefetches" would
thus be quashed?)
(3: benefit from indirect branch prediction) How frequent are dynamic
register-indirect jumps in RISC-V for common workloads?
(4: interaction with RVC) Does omitting prefetch for misaligned RVI
branches (and possibly RVC branches not on a 32-bit boundary) lead to a
useful option, permitting programmers or compilers to tacitly indicate
"do not prefetch this branch; it is unlikely to be taken"? At what cost
in RVC NOPs for alignment purposes?

I think that at least a minor paper can be written answering these
questions with data. Issue (4) will be more complex, requiring branch
profiling or other analysis to determine which branches should not be
prefetched. Only a few programs use GCC's __builtin_expect, although
much of the Linux kernel does use it, so there may be enough there for
an analysis.


-- Jacob

Max Hayden Chiz

unread,
Feb 18, 2018, 12:12:32 PM2/18/18
to RISC-V HW Dev, max....@gmail.com, ngvan...@gmail.com, cuon...@gmail.com, bhiman...@gmail.com, jcb6...@gmail.com
I had initially planned on doing it this way (and in fact in Boom the skeleton code is even set up like this). But after discussion with Andrew Waterman, it was decided to go the prefetch buffer route instead and not fetch into the cache at all. The main issue is the timing restrictions in the cache make it so that we can't modify the control path to allow for the interfacing that the prefetcher would need if it was part of the load/store unit. For example, accurate prefetchers require distinguishing between hits to regular cache lines and hits to prefetched ones. But adding those bits and the control logic to deal with this is currently off the table.

So by having a prefetch buffer (or multiple levels of prefetch buffer), we can manage our own metadata and not have to worry about the complexities of the cache's control path timing. I initially didn't like this, but I think that Dr. Waterman may be on to something as there seem to be some promising advantages to this approach. For example, it makes us independent of the cache hierarchy, so we can have a per-core L2 prefetch buffer even on chips that only have an L1 and a chip-wide LLC.
I'll note this for the future when we get to instruction prefetching. The main issue with instruction prefetching in general is that we probably can't avoid touching the processor's critical path, so working out exactly how to integrate the code with Rocket/Boom will be a challenge.
 


-- Jacob

Jacob Bachmeyer

unread,
Feb 19, 2018, 12:13:04 AM2/19/18
to Max Hayden Chiz, RISC-V HW Dev, ngvan...@gmail.com, cuon...@gmail.com, bhiman...@gmail.com
A simple stride-advance prefetcher does not need to care about cache
hits, only about the stream of memory operations generated by the CPU.
It knows that there was a prefetch hit for a pattern when it sees a
memory access using an address that it had predicted and can then
prefetch the next address in that sequence on the next opportunity.

The complexity of prefetchers that can be implemented inside a
load/store unit is very limited, and I suspect that
stride-advance-per-register (three memory accesses using x15 have hit A,
A+9, A+18 -> prefetch A+27; fourth access using x15 hits A+27 ->
prefetch A+36; etc.) is close to the reasonable upper end. Tracking all
31 registers this way is probably too much, but tracking the N
most-recently-used registers might be useful and choosing an appropriate
value for N is also an interesting research question. N should be at
least two, to allow the prefetcher to handle memcpy(3), but a loop that
iterates over more than two arrays would use more slots, as would a loop
that performs other memory accesses while iterating over arrays. What
access patterns will actual code match? This is a question that can be
answered with ISA simulation and a modified spike, so this could be a
bit of "low-hanging fruit" for a research group on a small budget.
Since this type of prefetch relies on the register number used in an
instruction, it must be implemented inside the processor.

The more complex prefetch algorithms you have mentioned will require
their own prefetch units outside of the main pipeline.

> So by having a prefetch buffer (or multiple levels of prefetch
> buffer), we can manage our own metadata and not have to worry about
> the complexities of the cache's control path timing. I initially
> didn't like this, but I think that Dr. Waterman may be on to something
> as there seem to be some promising advantages to this approach. For
> example, it makes us independent of the cache hierarchy, so we can
> have a per-core L2 prefetch buffer even on chips that only have an L1
> and a chip-wide LLC.

Avoiding side-channels with a set-associative cache effectively forces
the prefetcher to have its own cache that is only searched when the
inner primary caches miss. (As many ways as there are "prefetch
channels" must be allocated for prefetching in the cache to avoid
interactions -- giving the prefetcher its own cache quickly becomes
simpler.) While prefetching cannot be entirely independent of the cache
hierarchy (after all, prefetch hits must be satisfied from the prefetch
buffers and should be loaded into caches), keeping prefetch as separate
from the main caches as possible may help with both coverall complexity
and side channel prevention.
The concept that I have been working with for prefetching uses prefetch
units that each only observe a stream of actions from the processor and
generate additional fetch operations to fill idle bus timeslots. Put
another way, the only contact with the processor's critical path should
be to "tap" certain internal buses and feed extra copies to the prefetch
units. The delay of these "bus taps" contributes to prefetch latency,
but otherwise does not affect the processor.

Linear run-ahead prefetch can be implemented *in* a set-associative L1
I-cache: whenever a set matches a tag, probe the next set for the same
tag and initiate fetch if this second probe misses. If the second probe
wraps around to set 0, increment the tag for the second probe. This
does not require a second CAM search port, since adjacent cachelines
fall into adjacent sets and sets are assumed to be independent. Linear
run-ahead instruction prefetch therefore requires *no* additional state
and will hit whenever a basic block spans a cacheline boundary or a
taken branch targets the next cacheline. (Additional questions: (5a)
How often do basic blocks in real programs span cacheline boundaries for
various cacheline sizes? (5b) (related to 2b) How often do branches
target the next cacheline?) Because linear run-ahead prefetch is
independent of the actual instructions executed, it does not introduce a
further side channel and need not be isolated from the cache proper.
(Fetching from cacheline N loads N into the cache, also loading N+1
gives no additional information, since fetching from N+1 then causes N+2
to be loaded into the cache.)

The static branch prefetch I suggested observes the memory side of the
L1 I-cache, but is otherwise independent of instruction fetch: whenever
a cacheline is fetched into the L1 I-cache, scan it for aligned BRANCH
and JAL opcodes and insert those target addresses into the prefetch
queue. This scan is performed on a copy of the data added to the L1
I-cache, so it does not delay the critical path towards instruction
decode. Static branch prefetch requires a prefetch queue and uses
timeslots left idle when both instruction fetch itself and linear
run-ahead prefetch hit in the L1 I-cache. The prefetch queue should
also observe fetches from the L1 I-cache memory side and drop any
pending prefetches for cachelines that are actually requested by the L1
I-cache. Since static branch prefetching assumes that every branch
observed will be taken, its hit rate will probably be low, so its actual
utility could be a good research topic.


-- Jacob

Max Hayden Chiz

unread,
Feb 19, 2018, 10:09:33 AM2/19/18
to jcb6...@gmail.com, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani
On Sun, Feb 18, 2018 at 11:12 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
Max Hayden Chiz wrote:
On Saturday, February 17, 2018 at 10:48:40 PM UTC-6, Jacob Bachmeyer wrote:
 
A simple stride-advance prefetcher does not need to care about cache hits, only about the stream of memory operations generated by the CPU.  It knows that there was a prefetch hit for a pattern when it sees a memory access using an address that it had predicted and can then prefetch the next address in that sequence on the next opportunity.

The complexity of prefetchers that can be implemented inside a load/store unit is very limited, and I suspect that stride-advance-per-register (three memory accesses using x15 have hit A, A+9, A+18 -> prefetch A+27; fourth access using x15 hits A+27 -> prefetch A+36; etc.) is close to the reasonable upper end.  Tracking all 31 registers this way is probably too much, but tracking the N most-recently-used registers might be useful and choosing an appropriate value for N is also an interesting research question.  N should be at least two, to allow the prefetcher to handle memcpy(3), but a loop that iterates over more than two arrays would use more slots, as would a loop that performs other memory accesses while iterating over arrays.  What access patterns will actual code match?  This is a question that can be answered with ISA simulation and a modified spike, so this could be a bit of "low-hanging fruit" for a research group on a small budget.  Since this type of prefetch relies on the register number used in an instruction, it must be implemented inside the processor.

The more complex prefetch algorithms you have mentioned will require their own prefetch units outside of the main pipeline.

Okay. I misunderstood what you had in mind. This register-based idea is a good one and probably worth exploring. It's also worth noting that this method makes detecting major categories of indirect and linked data structure code far easier. Whether we have the bandwidth for using that information is another question though. (I wonder what the cost would be to pipe some limited version of that information, say a 1-bit indirect indicator, to the external prefetcher in the same way we are thinking of piping it the PC of the load instruction.)
 


So by having a prefetch buffer (or multiple levels of prefetch buffer), we can manage our own metadata and not have to worry about the complexities of the cache's control path timing. I initially didn't like this, but I think that Dr. Waterman may be on to something as there seem to be some promising advantages to this approach. For example, it makes us independent of the cache hierarchy, so we can have a per-core L2 prefetch buffer even on chips that only have an L1 and a chip-wide LLC.

Avoiding side-channels with a set-associative cache effectively forces the prefetcher to have its own cache that is only searched when the inner primary caches miss.  (As many ways as there are "prefetch channels" must be allocated for prefetching in the cache to avoid interactions -- giving the prefetcher its own cache quickly becomes simpler.)  While prefetching cannot be entirely independent of the cache hierarchy (after all, prefetch hits must be satisfied from the prefetch buffers and should be loaded into caches), keeping prefetch as separate from the main caches as possible may help with both coverall complexity and side channel prevention.

Traditionally the concern has been chip area, wiring/layout complexity, and coherency difficulties. But I think you are probably right about benefits given modern design concerns. I had not considered the side-channel angle, but that's a very compelling reason in and of itself.

I'm still interested in how to deal with side-channel attacks on a prefetch buffer and whether we can find a better solution then just flushing on a context switch. (I suppose you could do what some branch predictors do and associate a process id with each prefetched line.)
 


    The key questions that determine the feasibility of this are:
    (1:  availability of data prefetch slots in the pipeline itself) What
    fraction of dynamic RISC-V instructions in common workloads access
    memory?

I'm hoping that the infrastructure we set up for testing the data cache can get us this kind of information as a tangential benefit. (I'm also hoping it will give us a better handle on where the performance bottlenecks are in general. I'm planning on tackling Icache prefetch after this unless the performance data shows something else being substantially more urgent.)

Naively, this only works for Rocket and not for Boom b/c the SPEC workloads are about 30% memory instructions. However, I think they are very unevenly distributed. So it's not like 2-way BOOM only has 9% of cycles free. Whether this uneven distribution makes this worse or better for our usage case is hard to say without targeted research looking at the conditional probabilities of instruction sequences and their relations to cache misses.
I understand that, but I think you misunderstood me. That's my fault because I was very unclear.

For the workloads where Icache misses are a major a problem, you also need to prefetch the BTB. Reasonable implementations (ones that do not require 100KB+ of additional state) load the BTB with pre-decode information when the lines are prefetched.

The BTB is on the critical path, but there are some straightforward ways around that design problem. Nonetheless, they still touch *code* that is currently part of our BTB. So the problem is redesigning the code so that you can have Rocket's current configuration and a prefetching one from the same code base.

Similarly even though your cache suggestion isn't on the critical path, the cache in general is. So again, it touches *code* that there's an aversion to changing.

IOW, it's not an issue with your idea or even a design issue at all. It's purely a software engineering problem.
 
  Put another way, the only contact with the processor's critical path should be to "tap" certain internal buses and feed extra copies to the prefetch units.  The delay of these "bus taps" contributes to prefetch latency, but otherwise does not affect the processor.

Linear run-ahead prefetch can be implemented *in* a set-associative L1 I-cache:  whenever a set matches a tag, probe the next set for the same tag and initiate fetch if this second probe misses.  If the second probe wraps around to set 0, increment the tag for the second probe.  This does not require a second CAM search port, since adjacent cachelines fall into adjacent sets and sets are assumed to be independent.  Linear run-ahead instruction prefetch therefore requires *no* additional state and will hit whenever a basic block spans a cacheline boundary or a taken branch targets the next cacheline.  (Additional questions:  (5a)  How often do basic blocks in real programs span cacheline boundaries for various cacheline sizes?

For processors similar to RISC-V the SPEC basic block is between 6 and 7 instructions. So generally speaking next-2-line prefetching is quite effective. That's why I suggested re-purposing a data cache version of that algorithm for the Icache as a temporary fix until we could come up with a dedicated solution.

That said, there's some skeleton code in the Icache that Kritik was looking at that could maybe be modified to do what you have in mind. However, I'm unsure about whether such a patch would be merged into the main code given the expressed concern with keeping the caches' control paths as simple as possible.
 
  (5b)  (related to 2b)  How often do branches target the next cacheline?)  Because linear run-ahead prefetch is independent of the actual instructions executed, it does not introduce a further side channel and need not be isolated from the cache proper.  (Fetching from cacheline N loads N into the cache, also loading N+1 gives no additional information, since fetching from N+1 then causes N+2 to be loaded into the cache.)

I think that 5b depends on the type of code you are running. For SPEC-type workloads, next-2-line seems to be nearly perfect on similar processors. But for Cloudsuite, OLTP, and DSS workloads there are lots of branches to far-flung areas of the code. (And since these areas recur only after hundreds of thousands of instructions, you really don't want to cache them at all. This is part of the idea behind using the buffer as a working set filter that I mentioned.)
 

The static branch prefetch I suggested observes the memory side of the L1 I-cache, but is otherwise independent of instruction fetch:  whenever a cacheline is fetched into the L1 I-cache, scan it for aligned BRANCH and JAL opcodes and insert those target addresses into the prefetch queue.

This isn't that big of deal design-wise, but as mentioned, it requires restructuring some of our existing code. So the process of getting it merged into the main source repository is more involved. We'll have to talk to whoever handles the commits for that part of the code and see.
 
This scan is performed on a copy of the data added to the L1 I-cache, so it does not delay the critical path towards instruction decode.

Right. I didn't mean to imply that it did. I just meant that there was a software engineering problem of how to get it integrated into the current code.

I'm confident it can be resolved for BOOM at least. The front-end it uses now could just be rewritten to always use prefetching and you'd probably end up using less pipeline stages and fewer resources as well.

But we'd like to keep Rocket and Boom drawing on the same "pieces" so that the code doesn't diverge too much. I'm currently thinking that we should have a front-end generator that can put the pieces together in various ways with some configuration flags, but we'll cross that bridge when we get there.
 
  Static branch prefetch requires a prefetch queue and uses timeslots left idle when both instruction fetch itself and linear run-ahead prefetch hit in the L1 I-cache.  The prefetch queue should also observe fetches from the L1 I-cache memory side and drop any pending prefetches for cachelines that are actually requested by the L1 I-cache.  Since static branch prefetching assumes that every branch observed will be taken, its hit rate will probably be low, so its actual utility could be a good research topic.

It's worth looking into and it would be straightforward to try since the code for your approach would be a special case of anything more elaborate. (Boomerang for example is similar to your idea but also uses the branch predictor to only prefetch for predicted taken branches.)

The worry would be that profile-based code layout optimizations generally try to string together lots of basic blocks with biased not-taken branches. So your trick to distinguish the two may become too expensive in practice. If that's a problem, since taken branches are the rare case, it could make more sense to only prefetch the *unaligned* branches, at least if we are willing to assume that people will use a profile-driven layout optimization for their high miss rate code. (Assuming that the profile driven optimizations even *work* on the code I'm concerned about. The research is not that convincing on this point.)

That said, this thread and the current project are about D-cache prefetching. We'll have to revisit this issue when we get to the I-cache. (I expect the I-cache will be next, so keep thinking on this idea.)

Max Hayden Chiz

unread,
Feb 19, 2018, 12:14:32 PM2/19/18
to jcb6...@gmail.com, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani
Are you (Jacob) or anyone else lurking in this thread planning/wanting to contribute code or otherwise be involved in the implementation? Once we get a design and a work plan, I'm planning on taking further discussion off-list so that the list only gets hit with emails about stuff that we need further input on.

Also, I made an error in my previous email:

On Mon, Feb 19, 2018 at 9:09 AM, Max Hayden Chiz <max....@gmail.com> wrote:
On Sun, Feb 18, 2018 at 11:12 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:


    The key questions that determine the feasibility of this are:
    (1:  availability of data prefetch slots in the pipeline itself) What
    fraction of dynamic RISC-V instructions in common workloads access
    memory?

I'm hoping that the infrastructure we set up for testing the data cache can get us this kind of information as a tangential benefit. (I'm also hoping it will give us a better handle on where the performance bottlenecks are in general. I'm planning on tackling Icache prefetch after this unless the performance data shows something else being substantially more urgent.)

Naively, this only works for Rocket and not for Boom b/c the SPEC workloads are about 30% memory instructions. However, I think they are very unevenly distributed. So it's not like 2-way BOOM only has 9% of cycles free. Whether this uneven distribution makes this worse or better for our usage case is hard to say without targeted research looking at the conditional probabilities of instruction sequences and their relations to cache misses.

That should say that 2-way Boom only has ~50% cycles free (.7*.7 is not 9%). This might not be enough for many strided workloads (where the coverage exceeds 70%). But it might not be a problem in practice (you could use a larger prefetch distance instead of larger degree) or it could be a bigger deal than we think. Averages over the entire benchmark don't really tell us much about the behavior of the specific loops we are interested in. (It's also a worry that out of order handling of the instructions could mess up your heuristics. How big of an issue out of order is for such prefetchers is one of our research goals.)

The other question is whether this really helps us that much vs fetching into the prefetch buffer and using the L2 TLB (which I honestly didn't realize we had until it was pointed out). A 4KB page has 512 64-bit words. Even if the stride is so large that it only accesses 10% of them, you still only have to cross a page boundary (and thus touch the TLB) once every ~50 accesses. Even if it touches only 1% of words, the result is that you still have 5 accesses over which to deal with page boundary. So your approach would probably have to also improve timeliness or some other factor to be worth the added complexity. Fortunately, this should be easy enough to evaluate and test once we've got our preliminaries taken care of.

The only prefetchers for which virtual address translation is potentially expensive are the linked and indirect ones. A naive implementation would access the TLB for every prefetch and a smart one may effectively require its own L1 TLB to be effective. That's part of what project III aims to figure out. If they do prove to be too expensive, maybe we can try a variation that is in-pipeline like you propose. (That said, even if they require a dedicated L1 TLB, they are probably still worthwhile given the performance ramifications and considering that the next-best option requires at least 8KB of on-chip state in the form of two 4KB fully associative memories.)

Jacob Bachmeyer

unread,
Feb 19, 2018, 10:48:26 PM2/19/18
to Max Hayden Chiz, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani
As a simple example for array-stride prefetching, have a table of 30
rows, indexed by register number (x0 and the link register x1 are
ignored by the prefetch hardware) (indexing the table and covering all
registers allows the stride-prefetch state to use only SRAM; N MRU will
require CAM columns for register number; this trade-off differs between
FPGA and full-custom implementations) with three columns: MRA, D1, D2
("most recent address", "delta-1", "delta-2"). Each memory access
generates a {REG, ADDR} tuple. REG is used to select a row in the
table. If ADDR is equal to MRA, no further processing occurs and
nothing is predicted. If ADDR and MRA differ, the row is updated:
{MRA, D1, D2}' := {ADDR, ADDR - MRA, D1}. After the update, D1 and D2
are compared; if they are equal, an access to ADDR+k*D1 is predicted and
a prefetch initiated. The constant "k" is another parameter to adjust
depending on relative latencies in an implementation. If "k" is too
small, prefetch will not occur "far enough" ahead, but if "k" is too
large, prefetch will load data that is never used. The deltas may be
narrower than XLEN, if so, an out-of-range delta can be stored as zero,
and a prediction only made if the deltas match and are non-zero.

The idea is that very simple prefetching like this can provide a
baseline for comparison to more advanced algorithms. Also, array-stride
is very common in some workloads, so this could itself be useful. For a
prefetcher integrated into the load/store unit in an in-order pipeline,
bandwidth is simple: each instruction that does not access memory
provides a prefetch slot.

> (I wonder what the cost would be to pipe some limited version of that
> information, say a 1-bit indirect indicator, to the external
> prefetcher in the same way we are thinking of piping it the PC of the
> load instruction.)

Since an inverter array can be used to isolate bus loading at the point
where the signal is tapped, the "flight time" of the signals that feed
the (non-integrated) prefetcher merely contributes to prefetch latency.
The analysis side of the prefetcher (which observes events sent from the
internals of the main processor) can use a clock extracted along with
the signals, while the action side (which generates prefetch operations
on the memory bus) can be synchronous to the memory bus clock. This
requires a dual-port memory in the middle that can implement a prefetch
queue that (1) accepts prefetch requests from the analysis side, (2)
presents prefetch requests to the action side as timeslot availability
permits, and (3, optional) can cancel a prefetch request if the analysis
side so determines and the action side has not yet seen the
to-be-canceled request.
Ideally, the prefetch logic should associate ASIDs with all prefetch
activity (and be partitioned along ASIDs); this also permits prefetching
to continue in the background while a different task is running. (Task
A generates several prefetch events before being preempted for task B.
Task B (including its prefetches) is not making full use of the memory
bus. Prefetches initiated during task A can continue to complete using
timeslots that task B has left idle. Of course, prefetches initiated
for task B would also preempt "leftover" prefetches pending from task
A. I expect that the prefetch queue would normally be empty before the
supervisor can complete the context switch, but tagging all prefetches
with an ASID allows another task's prefetches to continue.) If the
caches and prefetch buffers are partitioned by ASID, such background
prefetches are *not* wasted -- they utilize resources that are not
available to the current foreground task and the prefetched cachelines
will then be waiting when the previous task is resumed.

> The key questions that determine the feasibility of this are:
> (1: availability of data prefetch slots in the pipeline
> itself) What
> fraction of dynamic RISC-V instructions in common
> workloads access
> memory?
>
>
> I'm hoping that the infrastructure we set up for testing the data
> cache can get us this kind of information as a tangential benefit.
> (I'm also hoping it will give us a better handle on where the
> performance bottlenecks are in general. I'm planning on tackling
> Icache prefetch after this unless the performance data shows something
> else being substantially more urgent.)
>
> Naively, this only works for Rocket and not for Boom b/c the SPEC
> workloads are about 30% memory instructions. However, I think they are
> very unevenly distributed. So it's not like 2-way BOOM only has 9% of
> cycles free. Whether this uneven distribution makes this worse or
> better for our usage case is hard to say without targeted research
> looking at the conditional probabilities of instruction sequences and
> their relations to cache misses.

Issues (1), (2), (3), and (4) are (directly, at least) "low-hanging
fruit" in the sense that they are questions purely about software -- a
modified spike can be used to answer them, without requiring access to
FPGA development resources, so these could be good starting points for a
research group on a low budget. Of course, anything spike can be
modified to track, Rocket and BOOM can also be modified to track, along
with other details such as interactions among the functional units,
which is an especially-interesting question for BOOM.

On a related note, the array-stride data prefetch that I have been
describing could be integrated into Rocket's load/store unit, but has
more interesting interactions with BOOM implementations with multiple
load/store units. One option could be to effectively partition memory
access by register among the load/store units, such that with two LSUs,
all accesses using an odd register are sent to LSU 1 and all accesses
with an even register are sent to LSU 0. Another option is to move the
stride-prefetcher out of the core, but memory re-ordering may reduce the
effectiveness of the prediction.
If the BTB is used only for indirect branches (for which the target is
unknown until the branch is executed), this is an interesting problem.
Otherwise, the static branch prefetching that I suggest could also be
used to load the BTB instead of directly feeding a prefetch queue.

> Similarly even though your cache suggestion isn't on the critical
> path, the cache in general is. So again, it touches *code* that
> there's an aversion to changing.
>
> IOW, it's not an issue with your idea or even a design issue at all.
> It's purely a software engineering problem.

More specifically, a process problem -- I have always understood that
the point of revision control and automated testing is to make changes
cheap: each change can be easily verified and a change that proves bad
can be backed out, but I admit having little experience in real-world
hardware design, so perhaps those aversions to change are reasonable.
It is simple: you have both a non-prefetching I-cache and a
linear-prefetching I-cache (and functional tests for an I-cache that can
verify both). Having both is desirable to allow for comparisons, along
with the possibility that some outer prefetch unit may be developed that
performs better *without* linear run-ahead prefetch.

>
> (5b) (related to 2b) How often do branches target the next
> cacheline?) Because linear run-ahead prefetch is independent of
> the actual instructions executed, it does not introduce a further
> side channel and need not be isolated from the cache proper.
> (Fetching from cacheline N loads N into the cache, also loading
> N+1 gives no additional information, since fetching from N+1 then
> causes N+2 to be loaded into the cache.)
>
>
> I think that 5b depends on the type of code you are running. For
> SPEC-type workloads, next-2-line seems to be nearly perfect on similar
> processors. But for Cloudsuite, OLTP, and DSS workloads there are lots
> of branches to far-flung areas of the code. (And since these areas
> recur only after hundreds of thousands of instructions, you really
> don't want to cache them at all. This is part of the idea behind using
> the buffer as a working set filter that I mentioned.)

How often branches target the next cacheline is also a way of asking how
many static branch prefetch targets will be quashed because linear
run-ahead has already made the same request -- or they might not be
quashed: static branch prefetch scans lines as they are loaded into the
L1 I-cache, while linear run-ahead fires when the L1 I-cache has a hit.
If cacheline N+1 contains a branch to cacheline N+2, static branch
prefetch would prefetch N+2 when N+1 is loaded into the cache by linear
run-ahead prefetch. While linear run-ahead prefetches directly into the
L1 I-cache, static branch prefetch either prefetches into the L2 cache
or its own prefetch buffer, which decreases the latency for a subsequent
L1 cache miss.

> The static branch prefetch I suggested observes the memory side of
> the L1 I-cache, but is otherwise independent of instruction
> fetch: whenever a cacheline is fetched into the L1 I-cache, scan
> it for aligned BRANCH and JAL opcodes and insert those target
> addresses into the prefetch queue.
>
>
> This isn't that big of deal design-wise, but as mentioned, it requires
> restructuring some of our existing code. So the process of getting it
> merged into the main source repository is more involved. We'll have to
> talk to whoever handles the commits for that part of the code and see.
>
>
> This scan is performed on a copy of the data added to the L1
> I-cache, so it does not delay the critical path towards
> instruction decode.
>
>
> Right. I didn't mean to imply that it did. I just meant that there was
> a software engineering problem of how to get it integrated into the
> current code.
>
> I'm confident it can be resolved for BOOM at least. The front-end it
> uses now could just be rewritten to always use prefetching and you'd
> probably end up using less pipeline stages and fewer resources as well.

The option to omit prefetching is probably important for comparisons if
nothing else.

> But we'd like to keep Rocket and Boom drawing on the same "pieces" so
> that the code doesn't diverge too much. I'm currently thinking that we
> should have a front-end generator that can put the pieces together in
> various ways with some configuration flags, but we'll cross that
> bridge when we get there.

Ideally, both should have a choice of prefetch or non-prefetch frontends.

> Static branch prefetch requires a prefetch queue and uses
> timeslots left idle when both instruction fetch itself and linear
> run-ahead prefetch hit in the L1 I-cache. The prefetch queue
> should also observe fetches from the L1 I-cache memory side and
> drop any pending prefetches for cachelines that are actually
> requested by the L1 I-cache. Since static branch prefetching
> assumes that every branch observed will be taken, its hit rate
> will probably be low, so its actual utility could be a good
> research topic.
>
>
> It's worth looking into and it would be straightforward to try since
> the code for your approach would be a special case of anything more
> elaborate. (Boomerang for example is similar to your idea but also
> uses the branch predictor to only prefetch for predicted taken branches.)

I like to think that the prefetch queues I keep mentioning would
themselves be a reusable hardware element and perhaps generally the
"in-the-middle" that connects a prefetch analysis element (running from
a clock tapped at the point where its data input was tapped) to a
prefetch action element (running from a memory bus clock). A prefetch
unit thus contains a prefetch analysis element (implementing some
prefetch algorithm), a (standard) prefetch queue, and a (standard)
prefetch action element (using entries in the prefetch queue to fill
idle memory bus timeslots).

> The worry would be that profile-based code layout optimizations
> generally try to string together lots of basic blocks with biased
> not-taken branches. So your trick to distinguish the two may become
> too expensive in practice. If that's a problem, since taken branches
> are the rare case, it could make more sense to only prefetch the
> *unaligned* branches, at least if we are willing to assume that people
> will use a profile-driven layout optimization for their high miss rate
> code. (Assuming that the profile driven optimizations even *work* on
> the code I'm concerned about. The research is not that convincing on
> this point.)

A long run of biased not-taken ("do not prefetch") branches can simply
be encoded as a run of RVI branches, all misaligned. Similarly, a long
run of branches, all of which are likely to be taken at some point
("prefetch all") can be encoded as a run of RVI branches, all aligned.
It is only when branches are interleaved, *some* of which should be
prefetched and none of which are within range for RVC branches, that the
RVC NOPs are likely to become a problem.

> That said, this thread and the current project are about D-cache
> prefetching. We'll have to revisit this issue when we get to the
> I-cache. (I expect the I-cache will be next, so keep thinking on this
> idea.)

If nothing else, it is in the public list archives with a datestamp; it
*might* just burn a "but with RISC-V" patent troll some day. :-)


-- Jacob

Jacob Bachmeyer

unread,
Feb 19, 2018, 11:10:57 PM2/19/18
to Max Hayden Chiz, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani
Max Hayden Chiz wrote:
> Are you (Jacob) or anyone else lurking in this thread planning/wanting
> to contribute code or otherwise be involved in the implementation?
> Once we get a design and a work plan, I'm planning on taking further
> discussion off-list so that the list only gets hit with emails about
> stuff that we need further input on.

If you do go off-list, please include me; I may not be able to help much
with code, but am interested. I also believe, for both philosophical
and practical reasons, that keeping the discussion on the hw-dev list
(or perhaps on a public rocket-dev list somewhere) will be beneficial to
the RISC-V project in the long term. (Having these discussions in
public list archives will be a good thing, in my view.)
I expect that, as long as loop iterations cannot be reordered (memory
operations cannot be reordered across backward direct branches), that
keying prefetch by the register number used to form the address will
untangle the reordering. So keeping in-LSU stride prefetch effective
imposes a small ordering constraint: memory accesses using the same
base register must be executed in program order by a single LSU. If the
LSU can overlap multiple memory accesses, the prefetcher must observe as
the accesses are submitted, and the accesses must be submitted in
program order, although they need not complete in program order.

> The other question is whether this really helps us that much vs
> fetching into the prefetch buffer and using the L2 TLB (which I
> honestly didn't realize we had until it was pointed out). A 4KB page
> has 512 64-bit words. Even if the stride is so large that it only
> accesses 10% of them, you still only have to cross a page boundary
> (and thus touch the TLB) once every ~50 accesses. Even if it touches
> only 1% of words, the result is that you still have 5 accesses over
> which to deal with page boundary. So your approach would probably have
> to also improve timeliness or some other factor to be worth the added
> complexity. Fortunately, this should be easy enough to evaluate and
> test once we've got our preliminaries taken care of.

A prefetcher does not have to follow normal rules for TLB accesses, so a
stride prefetcher could generate TLB prefetch operations independently
of memory prefetches, particularly if the prefetcher tracks its internal
state in terms of cachelines, rather addressing quanta (bytes in RISC-V
and all but a handful of obscure systems). (Which also saves bits in
the prefetch state table, and will not cause the prefetcher to fail if a
program uses a sub-cacheline stride, since multiple accesses to the same
address do not change the prefetcher state.) Extending the description
in my previous message, an additional "l" constant determines how far
ahead to prefetch TLB entries, such that prefetching ADDR+k*D1 also
probes the TLB for ADDR+l*D1.

> The only prefetchers for which virtual address translation is
> potentially expensive are the linked and indirect ones. A naive
> implementation would access the TLB for every prefetch and a smart
> one may effectively require its own L1 TLB to be effective. That's
> part of what project III aims to figure out. If they do prove to be
> too expensive, maybe we can try a variation that is in-pipeline like
> you propose. (That said, even if they require a dedicated L1 TLB, they
> are probably still worthwhile given the performance ramifications and
> considering that the next-best option requires at least 8KB of on-chip
> state in the form of two 4KB fully associative memories.)

I expect that linked and indirect prefetchers would be too complex to
integrate into the pipeline, with a need for its own TLB being a good
indication that the prefetcher cannot fit into the LSU. Stride-advance
prefetch is very simple, but still may make more sense as a near-core
prefetch unit, particularly for multi-LSU BOOM.


-- Jacob

Max Hayden Chiz

unread,
Feb 20, 2018, 2:01:31 PM2/20/18
to RISC-V HW Dev, max....@gmail.com, ngvan...@gmail.com, cuon...@gmail.com, bhiman...@gmail.com, jcb6...@gmail.com
Re: the work plan, I'm going to give people another week or so to look over it, think about it, and get back to me. (Especially since it is lunar new year season in east Asia so some people on the list may be out of pocket.) Unless I hear an objection, whoever is ready to start coding first can start tackling the initial preliminaries and we'll hash out who is doing what project a bit further down the road. In the interim I suggest that people familiarize themselves with the relevant literature and with the TileLink2 protocol.

Again, this is based on the assumption that the preliminaries aren't going to lead to much in the way of a publication. Thus this it won't impede the ability of the guys who emailed me about a Masters' project to do their research and should actually make it easier. If that's a bad assumption for some reason, people need to tell me soon. (E.g. if you are planning to do a master's project on the actual benchmarking bits.)

More below:

On Monday, February 19, 2018 at 10:10:57 PM UTC-6, Jacob Bachmeyer wrote:
Max Hayden Chiz wrote:
> Are you (Jacob) or anyone else lurking in this thread planning/wanting
> to contribute code or otherwise be involved in the implementation?
> Once we get a design and a work plan, I'm planning on taking further
> discussion off-list so that the list only gets hit with emails about
> stuff that we need further input on.

If you do go off-list, please include me; I may not be able to help much
with code, but am interested.

People can contribute to this even if they don't have a hardware design background. There are plenty of software pieces you can help with.

Even if Chisel code isn't your thing, parts of the preliminaries are about setting up benchmarking suites on Linux instances, scripting the runs across instances in parallel, and other Linux sys-admin type stuff. And if you have a background in statistics and data plotting, helping to automate the testing and plotting of the results would be good. Furthermore, once we get the basics in place, there will be lots of opportunities to profile the benchmark code and figure out where and why the various prefetchers differ in order to improve them.
 
I also believe, for both philosophical 
and practical reasons, that keeping the discussion on the hw-dev list
(or perhaps on a public rocket-dev list somewhere) will be beneficial to
the RISC-V project in the long term.  (Having these discussions in
public list archives will be a good thing, in my view.)

I'm not opposed to keeping this on list. I'm just not sure that the list is the best place for this, and I don't want to fill up everyone's mailbox with stuff they don't want to see. But if no one else chimes in against the idea, we can keep much of the discussion on-list. (Obviously, if we have a Google chat to hash something out, that won't be on the list.)

Re: the issue of touching critical paths. It isn't as simple as you paint it. In particular, there is no automated way to test for regressions of the sort we are worried about. Furthermore, only people with access to expensive proprietary tools and tech libraries can currently even do it manually. There are people working on both of these problems, but until they progress further, contributing to certain parts of the code is likely to remain difficult.

Re: the linked and indirect prefetchers. These aren't actually that complicated. Indirect prefetching only uses ~700 bytes of state and in a single issue in-order processor with 1 TLB port, there are port conflicts only 1% of the time. So I think that this will work as I've outlined it in the work plan (simply using the L2 TLB to do the address translations naively). The real issue is not the TLB *bandwidth*, but whether the delay from the slower L2 TLB proves problematic. Because indirect and linked data structure prefetchers have limited look-ahead ability, timeliness is a potential issue.

Re: your data prefetching idea, it is outside the parameters that Dr. Waterman and I agreed on for a first-cut at implementing this feature and getting it integrated into the code base. (It's also quite similar to my initial plan that was rejected.) So for now, I'm going to recommend that it be put in the "further refinements" category which are all things that require further study or are expected to have a performance impact of 10-15% (which is what a good L1 prefetcher adds to a modern L2 one). The stuff that is currently put in a project has an expected performance impact in the 50-100% range and is stuff for which step #1 is ready-to-go.

That said, if someone wants to use a simulator or otherwise show that my extrapolation from available data is wrong, or if they just want to implement it directly once the preliminaries are done, we can put it into a project VI (since it won't be *that* much work to move our stride prefetcher to the LSU and have it indexed by register instead of PC).

Beyond what you've outlined above, we need to figure out how much OoO execution actually impacts stride-based prefetchers (something that our preliminaries will address). Much of the benefit of OoO execution comes from reordering loops, but whether that reordering impacts the memory reference stream is a different matter. It may be possible to follow Chris Celio's skeleton code and do something at the decode stage to feed the prefetcher in-order despite out of order execution. (Or we could have the training step be part of the retirement stage as the literature suggests.)

However, your idea *is* worth looking into as a future refinement. A plausible situation where it could have a bigger impact would be in a low-power processor that uses an L0 cache. To the extent that it can increase the coverage of the L0 cache with lines that will be hit multiple times, it saves energy. Furthermore, dealing with timeliness would not be as complicated since the latency to load the L0 from the L1 is 1 cycle in our design. And an L0 cache is small enough that worrying about prefetch distance/degree greater than 1 is pointless. The same considerations also favor this design in a *wide* processor that uses the L0 cache to avoid having to multi-port the L1. The main question is whether using the chip area for the prefetcher state gets you better results than just increasing the size of the L0. In contrast, whether it will work pulling from the prefetch buffer into the L1 (or from the rest of the memory subsystem) is going to be much more dependent on microarchitectural particulars. Hence my listing it as a further refinement.

Bowen Huang

unread,
Mar 2, 2018, 2:55:04 AM3/2/18
to RISC-V HW Dev, max....@gmail.com, ngvan...@gmail.com, cuon...@gmail.com, bhiman...@gmail.com, jcb6...@gmail.com
Hi

This is Bowen Huang from Frontier System Division of ICT (Institute of Computing Technology, Chinese Academy of Sciences).

Have to say I'm extremely excited to see that RISCV folks have an ambitious plan to pursue advanced prefetcher design. The Frontier System Division of ICT are now working towards a multicore rocket tapeout with 40nm process, moreover, we started an advanced LLC prefetcher research project with a top mobile SoC team last year, and now we have already evaluated multiple prefetcher design choices for their next-gen smartphone. Therefore I believe we are fully capable of taking charge of this project, and our interests are perfectly aligned. 

After scanning through previous emails in this issue, I think the info listed below may be helpful: 

0) We tested 23 benchmarks from SPECint 2006 and SPECfp 2006, 10 of 23 are prefetching-sensitive (which has >10% speedup if every load req hits). The other 13 benchmarks are prefetcher-insensitive, even an ideal prefetcher would have no more than 5% speedup.

1) During our evaluation, we found that DPC2 (most of advanced prefetchers come out of this contest) were using an inaccurate simulator,  we roughly calibrated that simulator against a real chip emulator(codenamed Ananke, ARM Cortex-A55) and re-evaluated all those tests, the performance drops 70%~90% compared with what they claimed in the papers. The biggest impact factor is the presence of L1 cache prefetcher, which filters out many load requests and renders even the most advanced L2 (or LLC) prefetcher far less effective. However, L1 prefetcher also suffers from the limited capacity of L1 MSHR and read/prefetcher queue, adding prefetch-on-prefetch support at L2/LLC support will increase performance by ~15% on a 3-issue, 128-entry ROB core. This indicates that our chance lies in L1 prefetcher and L2 prefetcher co-design.

2) All the L2 prefetcheres can generate significant num of requests that are only hit on a L1 writeback, and it's known to all that writeback hit doesn't contribute to performance. This indicates that prefetch request should be filtered by coherence directory or something that can track the content of upper level cache. 

3) Jacob is right on his idea that extending TLB support to L2. we found a perfect example, milc from SPECfp 06, which suffers from huge num of cache misses, and we can speedup this benchmark by 13.8% in physical address space, if we can get info from virtual address space, the ideal speedup is measured to be ~60% because the majority of cache misses are consecutive in their virtual page nums.

4) We have tested AMPM, VLDP, BOP, SPP, next line, stream prefetcher in our roughly calibrated simulator so far. In average, BOP and SPP are better than the others, but none of those advanced prefetchers can perform consistently better than the others,  this indicates we should introduce set-dueling mechanism in prefetcher design, to invoke different types of prefetcher for different scenarios, and we have proved that this works, and with negligible storage overhead.

My working email is huang...@ict.ac.cn
We have several PhD and master students that can contribute codes, both software and hardware.

Is there any other team that want to work on this ? And if we take in charge of this project, who should we report to, and how often should we report the progress ?

Thanks so much.

Bowen Huang

Frontier System Division

Research Center for Advanced Computer Systems

Institute of Computing Technology

Chinese Academy of Sciences

Beijing, China 100190

Max Hayden Chiz

unread,
Mar 5, 2018, 3:43:04 PM3/5/18
to Bowen Huang, huang...@ict.ac.cn, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani, Jacob Bachmeyer
Thank you for your interest in the prefetching project. We'll be happy to have you. As-is, I'm going to be working on it in my spare time. Several other people have expressed an interest in contributing code either simply as open source developers or as part of a master's project. The University of Texas at Austin is looking at helping as well and providing us with access to their FPGA cloud computing infrastructure for testing. (They are very interested in using this for their research and would like to work with the other developers to co-author a paper showcasing any novel results we find re: their existing work.)

I'm trying to keep everyone coordinated so that we don't have people doing duplicate work and so that whatever work gets done can be merged back in with Rocket.

As things stand, we have a plan for the L1 cache that involves using a prefetch buffer. As you probably know, Rocket doesn't currently have an L2 cache. So for the time being, we are going to try to get the advanced prefetchers working at L1 and failing that, we can fall back on a large secondary prefetch buffer.

When the L2 cache eventually becomes available, we can examine doing something similar to the configuration from the DA-AMPM paper. (i.e. prefetching into the cache in a way that doesn't require modifying its control path). TileLink2 has several ways of handling just this situation, so if we need to go that route, it should be straightforward to add that capability.

Re: your comments, they were very helpful because they confirmed my own assessment and gave me more confidence in the design I've sketched out.

I'll comment more below:


On Fri, Mar 2, 2018 at 1:55 AM, Bowen Huang <huan...@gmail.com> wrote:
Hi

This is Bowen Huang from Frontier System Division of ICT (Institute of Computing Technology, Chinese Academy of Sciences).

Have to say I'm extremely excited to see that RISCV folks have an ambitious plan to pursue advanced prefetcher design. The Frontier System Division of ICT are now working towards a multicore rocket tapeout with 40nm process, moreover, we started an advanced LLC prefetcher research project with a top mobile SoC team last year, and now we have already evaluated multiple prefetcher design choices for their next-gen smartphone. Therefore I believe we are fully capable of taking charge of this project, and our interests are perfectly aligned. 

After scanning through previous emails in this issue, I think the info listed below may be helpful: 

0) We tested 23 benchmarks from SPECint 2006 and SPECfp 2006, 10 of 23 are prefetching-sensitive (which has >10% speedup if every load req hits). The other 13 benchmarks are prefetcher-insensitive, even an ideal prefetcher would have no more than 5% speedup.

There is a paper by Aamer Jaleel, _Memory Characterization of Workloads Using Instrumentation-Driven Simulation_ where he goes into great detail about the nature of the spec workloads. (There are similar papers for Cloudsuite and others, but I don't think they are as comprehensive). Many researchers only report results for the 10 or so benchmarks that he identifies as being potentially sensitive to prefetching.

That said, for a true evaluation of a design, I think it is important to run the full benchmark suite both to get a realistic assessment of the overall performance impact and to make sure that the prefetcher is not *harming* performance on other workloads.

Furthermore, I think that it is important to test beyond just SPEC. Initially, this will probably mean adding Cloudsuite. But the other benchmarks I mentioned in my draft work plan should be evaluated eventually as well since they all have different characteristics. I'd honestly like an even wider range of test cases, but I think this is the best we'll be able to do for now.
 

1) During our evaluation, we found that DPC2 (most of advanced prefetchers come out of this contest) were using an inaccurate simulator,  we roughly calibrated that simulator against a real chip emulator(codenamed Ananke, ARM Cortex-A55) and re-evaluated all those tests, the performance drops 70%~90% compared with what they claimed in the papers.

I would be interested in reading more about this. There are discussions in various published papers about the limitations of that simulator (and others). For example the developers of VLDP complain that the DPC2 simulator doesn't show any benefits from multi-degree prefetching and so gives poorer results for VLDP than it should. Others note that it doesn't account for instruction reordering that can happen with out of order execution and so *favors* designs like VLDP and SPP.

Either way, your performance drop is much more than anyone else has reported. So I'd be interested in a detailed breakdown of the differences. (This would make a great paper.) It would probably be helpful to know to what extent these issues have been addressed in the newer version of that simulator, ChampSim. (Available at https://github.com/ChampSim/ChampSim) To the extent that you can improve the accuracy of that simulator, I think many people would be interested. (And once we have a working FPGA design, you should be able to verify that your changes are giving correct performance estimates.)

I haven't run any of my own simulations, but I did a very basic statistical analysis of the existing results and concluded that the variability between different simulators was too high to be a reliable performance indicator. At best they can detect relative differences on the order of 15% or so. But I wouldn't expect absolute performance and smaller relative differences to be reliably predicted because there are just too many small micro-architectural changes that can impact the results in unexpected ways.

This is one of the reasons why I want to have a library of different prefetchers available. It will allow us to give a more realistic test of these (and future) designs. And it will allow for users of Rocket and Boom to pick the right prefetcher configuration for their workload. (Since as you point out, no one prefetcher is ideal.)

My hope is that it will also allow for interesting combinations of parts of the existing designs in ways that are more efficient than just running all of the prefetchers together. (For example, using AMPM's data structure with aggregate stride's stride detection method.)
 
The biggest impact factor is the presence of L1 cache prefetcher, which filters out many load requests and renders even the most advanced L2 (or LLC) prefetcher far less effective.

This is a good example of where different simulators give very different results. You simulator has a huge drop-off from adding an L1 prefetcher. Michaud's Best Offset paper (https://hal.inria.fr/hal-01254863/document), shows about a 15% *gain* from including the L1 vs just using Best Offset at the L2 alone.

This sort of thing is why I think that having a usable testing framework will really improve the state of research. If it becomes easy enough to just design a new prefetcher using the RISC-V infrastructure, we'll get much more reliable performance results.

 
However, L1 prefetcher also suffers from the limited capacity of L1 MSHR and read/prefetcher queue, adding prefetch-on-prefetch support at L2/LLC support will increase performance by ~15% on a 3-issue, 128-entry ROB core. This indicates that our chance lies in L1 prefetcher and L2 prefetcher co-design.

That's good to know. The current plan is that the prefetcher will be per-core and will handle both L1 and L2 in a unified manner.

2) All the L2 prefetcheres can generate significant num of requests that are only hit on a L1 writeback, and it's known to all that writeback hit doesn't contribute to performance. This indicates that prefetch request should be filtered by coherence directory or something that can track the content of upper level cache. 

This isn't part of our "first-cut" design, but it's on the list of important improvements once the basics are working. There are several other important improvements that promise similar benefits.


3) Jacob is right on his idea that extending TLB support to L2. we found a perfect example, milc from SPECfp 06, which suffers from huge num of cache misses, and we can speedup this benchmark by 13.8% in physical address space, if we can get info from virtual address space, the ideal speedup is measured to be ~60% because the majority of cache misses are consecutive in their virtual page nums.

It has always been my intention to test this configuration because it has widely varying results in the literature. Thankfully Jacob pointed out how to do it in the context of Rocket's design.
 

4) We have tested AMPM, VLDP, BOP, SPP, next line, stream prefetcher in our roughly calibrated simulator so far. In average, BOP and SPP are better than the others, but none of those advanced prefetchers can perform consistently better than the others,

Is there a reason you haven't tested Jain's aggregate stride prefetcher? While it is more complicated than Best Offset, it seems to offer some accuracy and coverage advantages.

Also, does your simulator account for the possible reordering of memory instructions? If so, how much of an impact does that have on SPP and VLDP?

 
 this indicates we should introduce set-dueling mechanism in prefetcher design, to invoke different types of prefetcher for different scenarios, and we have proved that this works, and with negligible storage overhead.

I'm not sure that we'll be able to use a set-dueling mechanism because that would require some modifications of the cache control path. (And in any event, we don't currently have an L2 cache to modify.) But I was thinking we could use something along the lines of a timeliness aware sandbox scheme to achieve the same effect.

My hope however is that we can use our tests to achieve a better understanding of why different prefetchers work better or worse on certain workloads and come up with a more efficient hybrid since having multiple prefetchers can be costly in terms of processor state and chip area.
 

My working email is huang...@ict.ac.cn
We have several PhD and master students that can contribute codes, both software and hardware.

I was unsure which email you wanted me to use in this reply, so I used both. Let me know which one you want me to use going forward.
 

Is there any other team that want to work on this ? And if we take in charge of this project, who should we report to, and how often should we report the progress ?

If you are willing to contribute development resources, I'd love to have your help. Since much of the work is parallel, we can just divide up the pieces between everyone who is interested. And if you guys have the man power and the time to do the bulk of the work, that's fine.

What time-frame are you looking at for getting started coding and how much do you expect you'll be able to contribute? Are you interested in coding up all the prefetchers I've outlined in this thread, or are you going to focus on the one or two that you really want to test for the SoC company you are working with?

Also, as things stand, we can develop the code for this and do some basic tests, but testing it thoroughly is going to require waiting for FireSim to be released so that we can run the benchmarks in parallel on an FPGA cloud. Similarly, we'll need to wait for the L2 cache to fully test some configurations. FireSim should be out before the summer, but I don't know SiFive's plans for their L2 cache. I would hope that it comes out when they release the RTL for the HiFive Unleashed in June, but I don't know. (What are you using for the L2 cache in the multicore rocket you taped out?)

 

Thanks so much.

Thank you for your interest and offer of help. I look forward to hearing from you.

黄博文

unread,
Mar 6, 2018, 12:52:38 AM3/6/18
to Max Hayden Chiz, RISC-V HW Dev, Linh Nguyen Van, Cuong Pham Quoc, kritik bhimani, Jacob Bachmeyer
Thanks you, It’s good to hear that we can help on this project.

Please note that our research is about LLC (L3 cache) prefetcher on real mobile phone, and our simulator is roughly calibrated against the latest design (cortex-A55) of ARM, it may has some differences with the design of rocket/BOOM

0)Our evaluation results and the influence of L1 prefetcher to L2/LLC prefetcher

Our results indicate that, if you shutdown the L1 prefetcher and open the L2/LLC prefetcher, you can see a speedup over +40% on prefetch-sensitive benchmarks, and that’s what most of previous papers claimed. However, if you open the L1 prefetcher, the speedup brought by L2/LLC prefetcher is ~10%, it’s because, like what I said before, many load requests are filtered out by L1 prefetcher. An example is the libquantum from SPECint 06, our breakdown analysis for libquantum shows that, if an ideal L2/LLC prefetcher is working with an 1024-entry IP stride prefetcher on L1, and this ideal L2/LLC prefetcher can provide cache hit to every demand load request, would bring negligible speedup. Because most of the demand load requests are filtered out by the IP stride prefetcher on L1. That’s why I said that nearly all the prefetchers we tested show far less speedup compared to what they claimed in papers. However, if this ideal L2/LLC prefetcher can provide cache hit to every prefetch request issued by the L1 prefetcher, that would be helpful, and this is what the BOP paper reported, we call this prefetch-on-prefetch. In reality, the L1 prefetcher is always left open, moreover, according to our simulation, an IP stride prefetcher can bring more benefits than a strong combination of L2/LLC prefetchers in most cases, so we suggest that all L2/LLC prefetchers should be re-evaluate with an active L1 prefetcher. 

Breakdown for 10 prefetch-sensitive benchmarks from SPEC06

Ideal prefetching for every demand load - provides +31.37% speedup
Ideal prefetching for every prefetch issued by L1 prefetcher - an additional 16.9%

Our results are, if we open L1 prefetcher(1024-entry IP stride), and do not allow prefetch-on-prefetch, VLDP/BOP/SPP fall into the range of +5% ~ +10% speedup. If we permit prefetch-on-prefetch, the speedup is +10% ~ +20%, but several benchmarks show performance regression. 

We are planning a full paper submission so I can’t share all the detailed data at this moment, I’m sorry about that. And we are still evolving the simulator due to the requirement of our industry research partner, so the public version of data(several months later) may have some differences with what I reported today. 

1) About the simulator we are using

It’s a modified version of the latest Champsim simulator. We do not touch the core model but heavily modified the cache model to get closer to the real chip emulator result. Also, the DRAM model is totally abandoned because the latency is inaccurate, we connect the LPDDR4 model from Ramultor[1] with our modified Champsim simulator seamlessly.

We use 7 different traces captured from real mobile phone workloads(instant messaging, web browsing) to roughly calibrated the modified simulator. The L2 cache modeling accuracy of the original champsim is problematic due to its non-inclusive design, and as far as we know, Intel & AMD & ARM are all moving to exclusive cache design for their latest product, and it causes about 27% miss rate deviation compared with exclusive design, it’s another source of perfecter performance exaggreation.

L2 Cache miss rate deviation - Original Champsim is 27% - our calibrated Champsim is <3%
DRAM ramdom access Latency - Original Champsim is ~90 cycles - our calibrated Chanpsim is ~230 cycles, a real ARM cortex-A55 with LPDDR4 is about 270 cycles.


2) Misc. issues

2-a I’m sorry that I did not noticed that you have posted this idea in previous email, really sorry about that. And you pointed out that this idea is scattered across some papers, thanks again for this point, I did not notice that either.
2-b We don’t consider aggregate stride prefetcher due to the design constraint posed by ARM and our industry design partner. It’s nearly impossible to get modification license from ARM, and without this license, you can’t modify the entire core/L1/L2/L3, not even a single line of code, or you must have your own core/L1/L2/L3 implementation, like what Apple did in the iPhone. So we just include those prefetchers that can be implemented without modification to the cache itself, I think it’s also the case for rocket right now. Besides that, we also exclude all the prefetchers that require PC to train, there is no PC value can be obtained below L1 in industry implementation.
2-c You can use huan...@gmail.com, I’m always monitoring it, it’s ok but sometimes unstable due to the networking censoring of mainland China.
2-d We use the 2016 version of rocket L2 cache for our tapeout.

3) Our intention and status

As an academic research group, we are unsatisfied about the design constraints in industry implementations, we will be excited to explore many alternative design choices, I believe it’s perfectly aligned with what RISCV community would like to see.

We are busying preparing our paper submission at this moment, and I guess our coding man power will be available in April. Our focus will be primarily Champsim in this month.


Max Hayden Chiz

unread,
Apr 29, 2018, 7:18:28 PM4/29/18
to RISC-V HW Dev
I'm sorry I dropped off the radar on this. I messed up my elbow and had to have surgery. I'm now snowed under trying to catch up with work. If someone else wants to just do this without me, that's great. I'll try to provide whatever knowledge or resources I have as I can. Otherwise I'll get back to this when I get to a stable situation with work.

Stay in touch.

Reply all
Reply to author
Forward
0 new messages