Enhancement: hardware stream prefetcher for L2

56 views
Skip to first unread message

Panat Taranat

unread,
Dec 4, 2020, 10:27:58 AM12/4/20
to black-parrot
We want to implement a hardware stream prefetcher at the L2 level. We also noticed that there is a TODO for a prefetch functionality in the writeback command of CCE messaging module https://github.com/black-parrot/black-parrot/blob/4e829d9a91786bbf99af03fc9d1837540189d4f7/bp_me/src/v/cce/bp_cce_msg.v#L760 

These two don’t seem to be related, so we’d like to ask the Black Parrot team which one is currently more useful for you (stream prefetch or TODO), and which one is reasonable for a class project. In both cases, we’d like guidance on where to start, what relevant files to look at. For example, how do we monitor that there is a miss in L1, so that L2 demand access will train the streams? 

Mark Wyse

unread,
Dec 4, 2020, 11:08:39 AM12/4/20
to Panat Taranat, black-parrot
Hi Panat,

There is a reserved message type for prefetches between the CCE and L2/memory; it is separate from the writeback command. It looks like that comment is at the wrong level, it should be directly above the e_bedrock_mem_wr case entry. Also, there isn't an explicit case for it, but e_bedrock_mem_rd commands are handled by the default case in this case block: https://github.com/black-parrot/black-parrot/blob/4e829d9a91786bbf99af03fc9d1837540189d4f7/bp_me/src/v/cce/bp_cce_msg.v#L752

In a multicore system, memory commands sent by the CCE out to L2/memory to request a cache block occur when the requesting L1 cache misses AND the desired block is not present in any of the other L1 caches in the coherence system. Examining the stream of memory requests (e_bedrock_mem_rd) from the CCE may be a good proxy for all L1 misses since prefetching a block that is already in *some* L1 cache would cause the L2 to duplicate a block in the L1 level. If an L1 cache misses but the block is in another L1, the coherence system will transfer the block from that other L1 to the requestor in most cases.

I would also recommend looking at the bp_cce_fsm module before attempting to add prefetching to the microcoded CCE. The FSM CCE is a more straightforward design implemented entirely with hardware FSMs and no microcoded control.

Mark Wyse
pronouns: he/him
PhD Student
Paul G. Allen School of Computer Science & Engineering
University of Washington


On Fri, Dec 4, 2020 at 10:28 AM Panat Taranat <ptar...@bu.edu> wrote:
We want to implement a hardware stream prefetcher at the L2 level. We also noticed that there is a TODO for a prefetch functionality in the writeback command of CCE messaging module https://github.com/black-parrot/black-parrot/blob/4e829d9a91786bbf99af03fc9d1837540189d4f7/bp_me/src/v/cce/bp_cce_msg.v#L760 

These two don’t seem to be related, so we’d like to ask the Black Parrot team which one is currently more useful for you (stream prefetch or TODO), and which one is reasonable for a class project. In both cases, we’d like guidance on where to start, what relevant files to look at. For example, how do we monitor that there is a miss in L1, so that L2 demand access will train the streams? 

--
You received this message because you are subscribed to the Google Groups "black-parrot" group.
To unsubscribe from this group and stop receiving emails from it, send an email to black-parrot...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/black-parrot/deb99681-effa-44cb-a768-569b9eb62eb9n%40googlegroups.com.

Mark Wyse

unread,
Dec 4, 2020, 11:23:35 AM12/4/20
to Panat Taranat, black-parrot
I realized I didn't actually answer your question, sorry. My opinion is that a hardware stream prefetcher that sits by the L2 cache would be more reasonable and generally useful.

As I said in my first email, the accesses sent to the L2 by the CCE (or UCE in a unicore config) should closely align with the true set of L1 cache misses. In a perfect system, every request to the L2 would be a demand access by definition (i.e., L1 misses and requests from L2). In the multicore configs, I believe our current coherence system closely aligns to the ideal, although it may be possible that an L2 access gets issued for a cache block that is already in an L1 cache other than the requestor, for example when the block is read-only without an owner.

This approach also allows the prefetcher to be applied to both a unicore and multicore system, because the CCE is not used in the unicore setup.


Mark Wyse
pronouns: he/him
PhD Student
Paul G. Allen School of Computer Science & Engineering
University of Washington

Panat Taranat

unread,
Dec 4, 2020, 6:49:16 PM12/4/20
to black-parrot

Thank you for your explanation of CCE memory commands. 

It looks like we will proceed with a hardware stream prefetcher near the L2 cache.
You mentioned that examining the stream of e_bedrock_mem_rd from CCE is a good way to keep track of L1 misses.
And the bp_cce_fsm module is a good design example that we can learn from, when implementing a stream prefetcher module (using hardware FSM?).

So whenever CCE sends an access to L2, this is a signal that we have an L1 cache miss.

How do we capture this signal as an input?
Will we be writing a module like the cce_fsm module (i.e. a new .v file)?
How will our module fit in with the rest of the architecture? E.g what signal do we send to initiate a prefetch?
What steps do we need to take to test that our module works? I assume that we will have to write a test bench where a stream buffer is useful, and another where it is not.

The challenge level for this task is relatively high, not because of the logic but because we are unfamiliar with the codebase and architecture. So we need a little bit of handholding.
If there are helpful documents or resources you could point us to, we'd appreciate it.
Reply all
Reply to author
Forward
0 new messages