Looking for Graduate Research Project ideas related to RISC-V

2,303 views
Skip to first unread message

Muhammad Ali Akhtar

unread,
Jan 18, 2018, 4:22:39 AM1/18/18
to RISC-V HW Dev
Hello All,

Well, as the topic suggests, I am looking for open ideas / issues that can be made part of Graduate (Master's) level research work, related to RISC-V and its FPGA Implementations or Application as FPGA soft processor. 

Any info regarding this will be highly appreciated.


Thanks and Regards,


Muhammad Ali Akhtar
Principal Design Engineer
http://www.linkedin.com/in/muhammadakhtar

高野茂幸

unread,
Jan 18, 2018, 4:42:11 AM1/18/18
to Muhammad Ali Akhtar, RISC-V HW Dev
Indeed Berkeley knows.
2018年1月18日(木) 18:22 Muhammad Ali Akhtar <muhamma...@gmail.com>:
--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/CADmwpy7hjQHbPuj_NXkPOvtctVBgRUOJbjTPaQ%2Bwa7%3DhJ_9YPA%40mail.gmail.com.

Max Hayden Chiz

unread,
Jan 20, 2018, 10:56:57 AM1/20/18
to RISC-V HW Dev
Since one of the goals of the RISC-V project is to get a standardized platform for research, I've been keeping a list of micro-architectural designs that we should have in the tree so that people designing new stuff have a broad basis for comparison. Implementing some of this for BOOM might make for a good research project depending on how confident you are and how much time you have. Also, Chris Celio was keeping a list of "low hanging fruit" that needed addressing in BOOM's design and may be able to make additional suggestions. We also need some tooling and simulator improvements, but since you want to do FPGA work, I'm leaving the ones I've noticed out and focusing on microarchitectural stuff.

1) This paper www.cs.virginia.edu/~skadron/Papers/boyer_federation_taco.pdf shows how to build a very minimal OoO processor that has about 10% area overhead from an in-order one (and about 90% of the performance of a "normal" OoO design). Their design is primarily about aggregating 2 in-order cores into one OoO one, but I think you could just stick with implementing their pieces as alternatives for BOOM's and let someone else worry about the "combining cores" part. Back of the book calculations show that these mini-OoO cores are probably on the efficient frontier for the vast majority of usage cases. (If you do this, you may also want to look at _SEED: Scalable, Efficient Enforcement of Dependences_, that has a newer issue queue design that may be better than the one in the original paper. And if you want to try to work on the register file as well you probably want to start with this paper _A Speculative Control Scheme for an Energy-Efficient Banked Register File_) An advantage of this approach is that there are multiple parts to it, so if you only want to replace the Load-Store unit or just the scheduler and leave the rest for someone else, that's doable.

2) We don't have conditional move instructions but about 1/3rd of TAGE's branch mispredictions can be avoided by predication of some sort. There are a couple of ways of dealing with this. One is a control independence scheme (https://hal.inria.fr/inria-00539647/document). Another option is to predicate in the rename unit by extending the RAT and implementing a select uop. The simplest option is to predicate the instructions and implement the predication in the decoder with uop splitting with 2 uops each like the Alpha's cmove instruction. (IIRC, these are discussed in the paper I linked.) Ideally you'd do all three and compare them. But doing even one of these would be a huge help. The catch is that this would require some minor preliminary work on BOOM's TAGE predictor to add confidence estimates to the branch predictions so that you could only do this on "low" confidence short forward branches. (The paper I linked suggests heuristics that seem plausible.)

3) It would be good if someone could add a (global) statistical corrector unit with an inner most loop iteration component like the newer experimental iterations of TAGE have. (They also have a *local* statistical corrector, and for research purposes we'd ideally want that as well, but I think that the implementation costs are likely to be prohibitive in practice.) It would also be helpful if we had the ability to use branch confidence predictions to avoid check-pointing on very high confidence branches so that we get more bang for the buck out of the remap table check-pointing hardware.

4) You could implement a VTAGE+2D-stride value predictor using Perias' EOLE design. (He's got a paper where he shows how to do this with variable length instructions. BeBoP, IIRC.) This might be a bit too much work though. A related idea is _Continuous Optimization_ by Fahs et. al. (Theoretically value optimization can be combined with value prediction, but no one has done it yet. Still, we have to have both in tree for someone to try combining them. Also there's a related idea from the same people called RENO that used a name-based optimization scheme, but requires modifying the pipeline.) If you do CO or RENO, it requires register sharing which is explained really well in _Cost Effective Physical Register Sharing_ by Perais.

5) If you want something ambitious, you could try implementing the ideas from pages.cs.wisc.edu/~vinay/pubs/isca-hybrid-arch.pdf and research.cs.wisc.edu/vertical/papers/2016/asplos16-exocore.pdf ; ideally you'd have a way to convert regular RISC-V instruction streams into instructions for these special accelerators. But just having them in the tree would be a huge help because it would let people see how useful they were in the presence of a "real" vector unit and if they prove useful, someone else could work on the hardware to detect the loops and do the conversion instead of requiring compiler profiling support. (B/c that's more of a PhD thing.)

6) Another project would be to replicate the ideas from _Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window_ and/or _A Performance-Correctness Explicitly-Decoupled Architecture: Technical Report_ (see also the thesis _Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism_) These have some really promising performance and power numbers. But their designs rely on the presence of an L2 cache. So unless our L2 cache if forthcoming, this might not be a good option. Basically the idea here is to mimic a wide issue processor with a very large instruction window by pairing two narrower and shallower processors together with a queue. This eliminates about 90% of branch mispredictions and cache misses and allows you to simplify the design of the two cores.

7) Another ambitious project would be implementing _Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling_ by Etsion. If this can be made to work, saving that power and area from the L1 cache in smaller RISC-V configurations would be nice. If we get an L2 cache soon, modifying/redoing it to support cache compression would be interesting as well. (I can provide citations for you if you need them.)

8) A very ambitious project would be to take the ideas from _Cache Restoration for Highly Partitioned Virtualized Systems_ and _Extending Data Prefetching to Cope with Context Switch Misses_ and coming up with a way to use them in RISC-V. This seems to require an extension to the privileged ISA. And ideally it would work in a way that doesn't just allow you to save cache state, but also branch prediction, prefetcher, and other speculative state on a per-process basis. So this is a lot of work for a master's project.

9) I've been doing some preliminary design work on adding data cache prefeching to Rocket and Boom. If you want to take that over, I'll share my work (and my literature review) and you can implement (some/all) of it.

Hope this helps and/or inspires some ideas of your own.

--Max

Michael Chapman

unread,
Jan 20, 2018, 1:56:46 PM1/20/18
to Max Hayden Chiz, RISC-V HW Dev
On 20-Jan-18 16:56, Max Hayden Chiz wrote:

> 2) We don't have conditional move instructions

We do. Kind of. e.g. for the conditional statement:-    r6 = (r2 >= r3)
? r4 : r5 ; scratch register s1 slt s1, r2, r3 ; s1 := 0 if r2 >= r3
else 1 addi s1, s1, -1 ; s1 := -1 if r2 >= r3 else 0 and r6, s1, r4 ; r6
:= r4 if r2 >= r3 else 0 xori s1, s1, -1 ; s1 := -1 if !(r2 >= r3) else
-1 and s1, s1, r5 ; s1 := r5 if !(r2 >= r3) else 0 add r6, r6, s1 ; r6
:= (r2 >= r3) ? r4 : r5

Michael Chapman

unread,
Jan 20, 2018, 2:13:30 PM1/20/18
to Max Hayden Chiz, RISC-V HW Dev
Not sure what happened with the formatting.
We do kind of have a conditional move. E.g.

    r6 = (r2 >= r3) ? r4 : r5

can be written as follows with no branching:-

    ; scratch register r1
    slt     s1, r2, r3  ; s1 :=  (r2 >= r3) ?  0 : 1
    addi    s1, s1, -1  ; s1 :=  (r2 >= r3) ? -1 : 0
    and     r6, s1, r4  ; r6 :=  (r2 >= r3) ? r4 : 0
    xori    s1, s1, -1  ; s1 := !(r2 >= r3) ? -1 : 0
    and     s1, s1, r5  ; s1 := !(r2 >= r3) ? r5 : 0
    add     r6, r6, s1  ; r6 :=  (r2 >= r3) ? r4 : r5


Stefan O'Rear

unread,
Jan 20, 2018, 2:18:43 PM1/20/18
to Michael Chapman, Max Hayden Chiz, RISC-V HW Dev
On Sat, Jan 20, 2018 at 10:58 AM, Michael Chapman
<michael.c...@gmail.com> wrote:
> On 20-Jan-18 16:56, Max Hayden Chiz wrote:
>
>> 2) We don't have conditional move instructions
>
> We do. Kind of. e.g. for the conditional statement:- r6 = (r2 >= r3)

That is not germane to this discussion. Those are not conditional
move instructions and implementing them as if they were would be
inappropriate.

-s

Stefan O'Rear

unread,
Jan 20, 2018, 2:21:45 PM1/20/18
to Max Hayden Chiz, RISC-V HW Dev
On Sat, Jan 20, 2018 at 7:56 AM, Max Hayden Chiz <max....@gmail.com> wrote:
> Since one of the goals of the RISC-V project is to get a standardized
> platform for research, I've been keeping a list of micro-architectural
> designs that we should have in the tree so that people designing new stuff
> have a broad basis for comparison. Implementing some of this for BOOM might
> make for a good research project depending on how confident you are and how
> much time you have. Also, Chris Celio was keeping a list of "low hanging
> fruit" that needed addressing in BOOM's design and may be able to make
> additional suggestions. We also need some tooling and simulator
> improvements, but since you want to do FPGA work, I'm leaving the ones I've
> noticed out and focusing on microarchitectural stuff.

While I don't anticipate having time to do anything with it in the
near future, I'd be interested to see more on the tooling and
simulator projects. Maybe material for the riscv-boom wiki? (Thanks
for this list!)

-s

Prof. Michael Taylor

unread,
Jan 20, 2018, 5:40:51 PM1/20/18
to Muhammad Ali Akhtar, RISC-V HW Dev
Hi,

Schematics and detailed description of the operation of various subsystems of Rocket would be very valuable to the community.

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+unsubscribe@groups.riscv.org.

Jacob Bachmeyer

unread,
Jan 20, 2018, 8:46:13 PM1/20/18
to Michael Chapman, Max Hayden Chiz, RISC-V HW Dev
Michael Chapman wrote:
> On 20-Jan-18 16:56, Max Hayden Chiz wrote:
>
>> 2) We don't have conditional move instructions
>>
>
> We do. Kind of. [...remove branchless "logical" conditional move...]

The ISA spec also suggests (in commentary while describing branch
instructions) that microarchitectures could translate short forward
branches into predicated instructions. I would like to see this made
the standard encoding of predicated instructions on RISC-V, so actually
implementing conditional move this way could be a good step in the right
direction.

For example: (pseudo-assembler)

.macro CMOVE <cond>, <crs1>, <crs2>, <rd>, <rs>
B<!cond> <crs1>, <crs2>, 1f
MV <rd>, <rs>
1:
.endm

Hardware can recognize this because the branch offset literal will be 2
(or 1 if RVC is used for the MV instruction), corresponding to a 4-byte
(2-byte if RVC used) forward branch.


-- Jacob

Stefan O'Rear

unread,
Jan 20, 2018, 8:54:10 PM1/20/18
to Jacob Bachmeyer, Michael Chapman, Max Hayden Chiz, RISC-V HW Dev
This is what I have in mind, and I've had several discussions with
Christopher Celio about the feasibility of using it, but idiom
encoding is not a project idea and not germane as an answer to
"Graduate Research Project Ideas Related to RISC-V".

-s

Max Hayden Chiz

unread,
Jan 20, 2018, 9:00:03 PM1/20/18
to RISC-V HW Dev, max....@gmail.com
Well, I'm not a simulator/tools guy so this reads more like a wish list instead of the tight micro-architectural stuff from my first post. (Also, I left out a whole bunch of generic "make it better" type stuff. E.g. improve our documentation and wiki, widen the fetch unit, increase the cache bandwidth, make the LSU capable of more than one instruction per cycle, add an optimizing? trace cache, improve memory level parallelism, do macro-op fusion, increase the pipeline's scalability so that we can have a deeper instruction window, more units, and higher performance, etc.)

As for the tools and simulator stuff:

1) According to Celio, right now BOOM is slower than it should be on an FPGA b/c we can't register retime the FPUs. It would be good if someone could come up with a workaround or a fix for this.

2) For doing design exploration, you often don't need or want to simulate a full processor. We should have a way to replace parts of the chip with dummy parts that simulate in C++ (or on an FPGA) must faster by just approximating their results. E.g. when I'm researching cache prefetching, I don't really care about anything other then the stream of memory addresses and PCs that the cache sees. So we want to speed the design iteration process up as much as possible by having our code support both trace-driven and execution-driven simulations (and otherwise being flexible). Being able to "fast-forward" a large number of warm-up instructions (e.g. 50B) in some way, maybe with check-pointing, and then fully simulate only the 100M or so that you are using as your steady state sample would be helpful too. In my ideal world, we'd have all the features of the good software simulators but in a plug-able way so that you could gradually move from "fast and rough" to "slow and accurate" for individual parts of the processor. (And it would work in a way that you only had to code your design once instead of first writing it in C++ for some design simulator and then redo it with more detail in Chisel so that you could test it for real.) Also, this kind of ties into the next point, but there are lots of papers out there about how to use statistical sampling to get good simulation results with a fraction of the computation. Our tools should make using those techniques straightforward.

3) For *reporting* results, the current academic standards are pretty bad. You usually report performance, area, and power against some arbitrary baseline and there's little in the way of doing hypothesis testing or other statistical technique. I've certainly never seen a meta-analysis done for micro-architecture research. I think what we need is a statistically modelling / convex optimization framework with a database of existing results. This way there's a known efficient frontier in terms of designs and the tool kit would run the right number of configurations and tests with your new design and then show how it impacted the efficient frontier. And it would use all the statistical tricks that the literature reports for creating reliable but efficient samples. After it was done, it would auto-upload your runs to the database so that others don't have to rerun them and can see how their results interact with yours. Thus it would support open science and good statistical practice. Furthermore, chip designers are not statisticians and it would really help everyone if there was a straightforward set of tools that could generate the right comparative statistics and graphs for your publication. (And beyond just reporting results, since the processors are configurable, anyone using them as a jumping off point will need to know what the reasonable configurations are in some scientific way. As we add more options, this problem gets worse unless we have such a statistical framework.) I've got some ideas here, but they are very rough and the literature on this is pretty vast. My general idea is to have something like Henry Cook's GPRS toolkit generate log-convex response surface functions that can then be fed to a convex optimizer.

4) Our current options for testing performance numbers are not all that good. Coremark very limited. SPEC requires running a full OS and is mostly a technical benchmark. And things like cloudsuite or TPC OLTP and DSS seem to require more memory than FPGA boards are likely to have. We need to start building a workable collection of example code sequences and "badly behaved" code so that it's easier to get a broad look at how a new design works. (E.g. lots of cache prefetchers really only shine on cloudsuite or OLTP benchmarks b/c SPEC is too simple in terms of memory access patterns.) Ideally we'd have some well written statistically validated (synthetic?) benchmarks that can run on a raw core or at least can be run on a practical FPGA board and that stress the various parts of the processor in creative ways. (Can you recreate the branching and memory behavior of those complex database benchmarks with a smaller database and SQLite? I don't know...) It kind of defeats the point of doing the design work in Chisel and using RISC-V as a research tool if you can't actually do a reliable FPGA test of your results and have to fall back on a set of software simulator tools.

5) Right now if you want to contribute seriously, you need to own an FPGA board (or several). Setting up an FPGA farm and some open source infrastructure so that people can contribute with just a simulator and then offload the necessary FPGA simulations to the Free Chips Project would help us attract more developers. It needs to be easy to get started.

6) If you aren't at an academic institution, you don't have access to a technology library and other tools that will let you do area, power, and speed estimations. So it isn't possible for someone to e.g. work on 4-way BOOM to optimize its clock-rate in real silicon. I don't know how to solve this, but long-term we need to figure something out so that we can be accessible for contributors.

(5 & 6 have been discussed at length at RISC-V conferences. I don't really have much new to add. I'm just noting that they are big, open issues on which we've so far seen little in the way of progress.)



Max Hayden Chiz

unread,
Jan 20, 2018, 9:34:16 PM1/20/18
to RISC-V HW Dev, jcb6...@gmail.com, michael.c...@gmail.com, max....@gmail.com
There's also the more general issue with this specific case in that even very good compilers have trouble catching all of the situations where if-conversion could be used. So a hardware solution is preferable.

Also, you can't use the idiom to encode a conditional move unless you implement it in one of the three ways I listed. And if you have that implementation, there seems to be no real point to confining yourself to just that idiom instead of being intelligent and using some decent heuristics and dynamic info from the branch predictor to try to identify the ones the compiler missed.

Additionally, I didn't mention it above because I don't know how to deal with it microarchitecturally, but another 30% or so of the TAGE misses (i.e. ~half of the non-if-convertible misses)  are relatively control-independent. (See https://people.engr.ncsu.edu/ericro/publications/conference_MICRO-45.pdf) So with *architectural* support, it's pretty trivial to "vectorize" the computation of the branch condition and then use the stored branch results to get those branches right. Doing this on a micro-architectural level (without having to use the dual core execution trick from idea #6) would be really cool, but I don't see how you would implement it. (At least in a conventional processor. On a dataflow processor this is trivial and basically free.)

It wouldn't be hard to make this an instruction extension (you just need an instruction that computes a branch result and rotates it into a register and another instruction that branches on the least significant bit of the register and rotates it out), but it is not really in the spirit of RISC-V. So coming up with a micro-architectural solution would be ideal.

FWIW, this is one of the reasons I like #6's speculative-access/execute thing -- it gets us a "wide" processor for basically no engineering cost and it takes care of lots of little issues like this that seem otherwise expensive to handle. In theory it's worse than just doubling the cache bandwidth and the issue width and making a huge instruction window (because it can only execute from the front and the end of the window instead of anywhere in between), but all of those things are prohibitively expensive and require scaling structures that are worse than linear. So I don't think that in practice you'd end up better off than double the area, same power, and a 40-50% IPC boost.


-s

Sagar Karandikar

unread,
Jan 20, 2018, 11:48:39 PM1/20/18
to Max Hayden Chiz, RISC-V HW Dev
The FireSim (https://content.riscv.org/wp-content/uploads/2017/12/Wed1724_FireSim_Karandikar.pdf and https://fires.im) and MIDAS (https://bar.eecs.berkeley.edu/projects/2015-midas.html) projects at Berkeley are working towards addressing many of these goals on the FPGA side of things. FireSim is a complete FPGA-accelerated simulator for RocketChip-based systems (including very large clusters of them), while MIDAS provides tools for building FPGA-based simulators like FireSim. While we've demonstrated simulating rocket at this point, we're currently working on adding BOOM support to FireSim. 

1) According to Celio, right now BOOM is slower than it should be on an FPGA b/c we can't register retime the FPUs. It would be good if someone could come up with a workaround or a fix for this.

I believe this has been fixed in the version of BOOM that we're bumping to, assuming that BOOM is sharing FPU components from upstream RocketChip. For instance, for rocket, we pulled in a change from upstream rocket-chip a couple of months ago that addresses this and let us push FPGA frequency from 90 MHz to 190 MHz on the FPGAs on EC2 F1.

> 4) Our current options for testing performance numbers are not all that good...
> 5) Right now if you want to contribute seriously, you need to own an FPGA board (or several)...

While not free, EC2 F1 (which FireSim runs on) is quite cheap compared to building your own FPGA farm, especially when using spot instances, which are usually fine for running benchmarks. Amazon also gives EC2 credits to researchers/academics (https://aws.amazon.com/grants/).

We didn't emphasize this in the linked FireSim talk at the RISC-V workshop that focused on datacenter simulation, but we can "disable" the cycle-accuracy of the simulated network in FireSim, which essentially gets you a big, "functionally networked" cluster of rockets (and soon BOOMs) that you can run workloads on, without paying the performance cost of global-synchronization for network simulation. The simulated nodes still have a NIC and are bridged together into a network, so you can still ssh into them and run workloads as if they were real machines. For rocket, the individual simulations run at ~150 MHz (obviously this will vary based on workload). We also have the scripting/deploy tools necessary to automatically bringup large simulations (users just give a RocketChip config, desired # of nodes, and a list of EC2 F1 Instance IPs) and run large workloads (e.g. running all of the SPECInt2006 benchmarks in parallel on a bunch of F1 instances and automatically collecting results). The FPGAs on EC2 F1 also have relatively large amounts of DRAM (64 GB per FPGA), which should enable running large workloads.

We're hoping to release FireSim sometime before the summer. Among other things, one goal of FireSim is to essentially get a magic "run fast on an FPGA in the cloud" button for RocketChip based systems (and hopefully others in the future). Your wish list item #2 is also one of the goals of FireSim/MIDAS. Currently, these tools do perform heterogeneous simulation (software + FPGA), but users currently can't draw an arbitrary boundary in the RTL and have simulation automatically split between software and FPGA, it has to be done by hand.

-Sagar

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+unsubscribe@groups.riscv.org.
To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Max Hayden Chiz

unread,
Jan 21, 2018, 12:31:17 AM1/21/18
to RISC-V HW Dev, max....@gmail.com
Thanks for the info. This is very good news. I look forward to your release date.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+un...@groups.riscv.org.

To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Max Hayden Chiz

unread,
Jan 21, 2018, 1:35:51 PM1/21/18
to Luke Kenneth Casson Leighton, RISC-V HW Dev


On Sun, Jan 21, 2018 at 2:59 AM, Luke Kenneth Casson Leighton <lk...@lkcl.net> wrote:
On Sat, Jan 20, 2018 at 3:56 PM, Max Hayden Chiz <max....@gmail.com> wrote:

awesome set of suggestions, max!


> 5) If you want something ambitious, you could try implementing the ideas
> from pages.cs.wisc.edu/~vinay/pubs/isca-hybrid-arch.pdf and
> research.cs.wisc.edu/vertical/papers/2016/asplos16-exocore.pdf ; ideally
> you'd have a way to convert regular RISC-V instruction streams into
> instructions for these special accelerators.

how about taking the general-purpose nyuzi compute engine, currently
being used to research 3D graphics, and leaving the (execution)
pipeline as-is, replacing its instruction fetch / decode part with
RISC-V?

Creating a RISC-V GPU with GPGPU capabilities is a good idea. But I was putting it under the general "make it better" category since I was confining my list to microarchitectural things you could do for our existing processors. And because designing a whole new chip is a lot more work than adding a microarchitectural refinement, I think it might be more of a PhD thing. (But could be wrong.) Also, for GPGPU, you'd ideally want it to be as compatible as possible with the forthcoming vector instruction set. So you'd want to look at Hwacha and our eventual vector unit in addition to Nyuzi when you were coming up with the design.

Anyway, to save people from reading the linked papers and and because it lets me link another possible research idea for the OP, the linked papers are about accelerating *non-vectorizable* loops. This a new research frontier because vectorizable ones are a mostly "solved" problem modulo some design variations.

For non-vectorizable loops, we have lots of different possible options and no one knows the best way to handle them or what the trade-offs are. The area is largely unexplored and getting the various research ideas into the tree would help future researchers because it would make it possible to give a solid comparison.

E.g. Ideally you'd want to know how these accelerators stack up to something simpler like generalized chaining (https://snehasish.net/docs/sharifian-micro16.pdf) which could either be an ISA extension or could just be an under-the-hood thing for hot traces picked up by a trace-cache. (The idea here is that you group instructions with single dependencies into chains so that the you amortize the decoding cost and avoid the communication and scheduling cost of handling them separately.)

The "behavior specialized accelerators" in the above papers get more performance, save more power, and work for more loops (in particular loops with unpredictable branches), but as-is they require a compiler to take some profiling information and then generate special dataflow instructions for the loop accelerators. That's probably fine for embedded use. But for general purpose use, you either need to come up with a good ISA extension that isn't directly tied to the hardware implementation or (ideally) we'd have some kind of loop detector that would detect when we were in an acceleratable loop and do the conversion itself.

So if you put them in-tree, there's a lot of different research opportunities they enable, but it isn't something that's going to be usable in industry right away.

Personally, I think that the code coverage and estimated power saving of these BSAs are overstated. First, the papers are normally detecting "vectorizable" loops using simple SIMD instructions instead of real vector units. Second, they aren't considering optimizations that make nested-parallel loops vectorizable. (https://www.cs.cmu.edu/~guyb/papers/sc90.pdf although this version suffers from excessive data copying, ideally, we'd have an ISA extension for *predicated* segmented scan to avoid this.) And third, because they aren't considering reasonable power-saving microarchitectural steps in the general purpose processor.

OTOH, from an amdahl's law perspective the performance boost being reported is too low. They are being selected for 60-80% of the computation time but are throughput saturated. While this gives a significant performance boost for smaller processors, the numbers indicate that provisioning them with more hardware is a net-win. (And doing that might make them beneficial to larger processors as well.)

So I'd like to see some better treatment of the idea and think we'd be able to do that in the RISC-V context. The accelerators clearly have benefits in the embedded context. But working out what to do with them in the context of say 4-way BOOM is going need more research.


one of the challenges will be that it is designed as a "tagged"
[general-purpose] vector processor engine, modelled after the Cray
Supercomputers which in turn were advancements of the CDC Cyber Series
[worth mentioning because it means any patents will have expired].

i know of the concept of "tagged" registers from working with Aspex
Semiconductors "Massively parallel / Deep SIMD" architecture, which
was a 4096-wide SIMD array of 2-bit processors, each processor having
256 bits of Content-Addressable Memory.  "tagged" registers is also
known as vector or SIMD "masking" i.e. you issue an instruction but
there is a "mask" which actually CONDITIONALLY stops one (or none, or
all) of the operations from being executed on one (or none, or all) of
the elements of the vector, regardless of whether the instruction be a
load, store, compare or anything else.

when you also have the ability within the processor for the "tag" to
be modified by the results of former operations (such as carry or EQ
or NE operations) it becomes an extremely powerful technique.  when
you also have the ability within the processor to treat the *tag* as a
general-purpose register it becomes *even more* of a powerful
technique, such as being able to turn a vector into a 512-bit number,
by having the carry of each part of the 32/64-bit vector go into the
"tag", then shifting the "tag" register by 1, now you can
conditionally "add 1" on the next cycle (which will ONLY happen if the
carry from the vector element next door is set to 1) and now you've
got yourself a 512-bit add operation.

the current "problem" with Nyuzi is that due to the addition of the
"mask" to the *actual instructions*, the length of the Nyuzi RISC
instruction *exceeds* that of a RISC-V instruction.

so... one of the first tasks that would need to be carried out would
be to remove that "mask" part from the Nyuzi instructions and make the
"mask" an actual (persistent) general-purpose register.  the result
would be that what previously took 1 instruction would now take 2
(first set up the "tag" then execute the SIMD vector instruction
*using* that tag) but the advantage would be that *subsequent
instructions may not need the tag/mask register to be changed*.

it would be an extremely interesting and quite ambitious project,
which would have the strategic advantage of benefitting the libre
world significantly by bringing an open general-purpose vector
processing compute engine and associated research into 3D Graphics to
the forefront.

there do not exist any complete truly open and libre modern GPUs right
now and this is a severe problem that has far-reaching implications.
*not a single* modern processor available today has a truly open 3D
engine.  not one!  in the embedded (SoC) space they are all entirely
proprietary: the only one that has been successfully and fully
reverse-engineered (but for certain models only) is Vivante (Etnaviv)
and Vivante's *own* software library is so poorly implemented, causing
huge unreliability in end-user deployment, that their reputation right
across the industry, with huge influential companies like Rockchip for
example, has been completely shot to s**t.

nyuzi is not perfect - it is a general-purpose vector-processing
compute engine - but the research that jeff has done at least will
allow decisions to be made about where or whether certain hardware
accelerated hard macros should be developed, and where the best
"return on investment" would be.

it would be a good start.

l.

lkcl .

unread,
Jan 21, 2018, 3:10:47 PM1/21/18
to Max Hayden Chiz, RISC-V HW Dev
[apologies, max, re-posting as the google group list isn't configured
correctly, i've seen this before, it's not set up in the headers
correctly with the right "Reply-to" such that when i hit reply -
ironically with gmail - gmail picks my *default* From address... which
is not one that's subscribed to the list! doh!]

On Sun, Jan 21, 2018 at 6:35 PM, Max Hayden Chiz <max....@gmail.com> wrote:

> Creating a RISC-V GPU with GPGPU capabilities is a good idea.

it would be absolutely amazing, wouldn't it?

> But I was
> putting it under the general "make it better" category since I was confining
> my list to microarchitectural things you could do for our existing
> processors. And because designing a whole new chip is a lot more work than
> adding a microarchitectural refinement, I think it might be more of a PhD
> thing. (But could be wrong.)

the... ah... what's the name... ORSOC Graphics Accelerator guys, they
were a 2-man MSc team. got a heck of a lot done.

i'm a big fan of not redoing work that's already been done, i tend to
find wildly disparate systems and work out the minimum amount of work
needed to join them together. that way you draw on the expertise of
more people, and they appreciate it a lot.

> Also, for GPGPU, you'd ideally want it to be as
> compatible as possible with the forthcoming vector instruction set.

yyeah that's a tough ask in the case of nyuzi. more on this below.

> So you'd
> want to look at Hwacha and our eventual vector unit in addition to Nyuzi
> when you were coming up with the design.

... except nyuzi has been published, and hwacha hasn't. at least,
there's no public repositories that i can easily find - just a couple
of pages. nyuzi on the other hand has two follow-on research papers,
a complete repository and full and complete documentation.

i've been in touch with jeff and he's a lot of fun to talk to,
extremely knowledgeable. he's done some amazing analysis and has a
clear understanding and breakdown of the tasks and number of
instructions per task in each phase of modern (shader-based) 3D
Graphics Procesing. he also makes it clear that the metric to focus
on, for optimisation and evaluation purposes, is "instructions per
pixel".

> For non-vectorizable loops, we have lots of different possible options and
> no one knows the best way to handle them or what the trade-offs are.

well, this is where jeff's approach - and focus - would come in
handy. and whilst i appreciate that hwacha may, technically, have a
better approach, the fact that nobody outside of the group can *look*
at what they're doing means that, in my mind, sadly it is off the
table for consideration.

> The area is largely unexplored

*precisely*. it's... *sigh* we (royal we) kinda left it a bit late
in the game, for the incumbent proprietary companies to get what... a
20 year head start? it reminds me of a professor i met once who left
Seagate because within that *one* company - all of them are
reverse-engineering each others' hard drives down to the molecular
level - they're *literally* 20 years ahead of academia in
electro-magnetism and he said he just couldn't stand how they were
keeping all that knowledge secret. so... he left and has been
publishing papers ever since.

> and getting the various research ideas into the
> tree would help future researchers because it would make it possible to give
> a solid comparison.

very much so.

> E.g. Ideally you'd want to know how these accelerators stack up to something
> simpler like generalized chaining
> (https://snehasish.net/docs/sharifian-micro16.pdf) which could either be an
> ISA extension or could just be an under-the-hood thing for hot traces picked
> up by a trace-cache. (The idea here is that you group instructions with
> single dependencies into chains so that the you amortize the decoding cost
> and avoid the communication and scheduling cost of handling them
> separately.)

niiice. ha, i told jeff about how Ingenic did it, when they added
X-Burst to their ultra-low-power MIPS processor, he loved it but also
had a lot of respect for what they did, i'll explain why X-Burst is a
Vector SIMD pipeline that they usually run at 500mhz when the main
processor is running at 1ghz. get this: they get 30 million triangles
per second.... by running awk/sed macros on the standard mesagl
software library, doing pattern-matching on c code and grafting /
substituting X-Burst assembly code in its place! frickin awesome hack
or _what_? :)

but here's the thing: when you try adding this stuff "properly" to
say gcc, you only have to look at say how long it took to get the
altivec support into gcc to know that it's just.. yeah, it's too much.

so instead jeff focussed on a specialised set of tools, and on adding
LLVM support for the Nyuzi (general-purpose) instruction set, and has
left it at that. sometimes, it's easier to do that, y'know?

> The "behavior specialized accelerators" in the above papers get more
> performance, save more power, and work for more loops (in particular loops
> with unpredictable branches), but as-is they require a compiler to take some
> profiling information and then generate special dataflow instructions for
> the loop accelerators.

yehyeh, which is a whooole research area on its own. this is one of
the reasons why i suggested nyuzi, because jeff and the team he's
worked with did all that, already, some years back. that's not to say
that *everything* is done - far from it: for the published papers the
team *specifically* focussed on the core algorithms of 3D engines (the
inner loops). but they also got actual 3D renderng demos up and
running which is a huge achievement (teapot, quake, others).

so "bang-per-buck" wise (and also "getting up and running fast"-wise)
a conversion of nyuzi's processing front-end to RISC-V would be a
higher "return on investment" than anything else i know of [that's
publicly available]. MIAOW is a totally different focus: it's
*specifically* compatible with ATI (now AMD)'s OpenCL Engine and it
would be... unfair to take that achievement away, because you'd be
throwing away the opportunity to utilise an entire pre-existing
*well-tested* - and proven - toolchain.

apologies, i think in these kinds of pragmatic, practical terms,
taking into consideration both the hardware *and* software aspects,
based on what's available and already been done *right* now.


> That's probably fine for embedded use. But for
> general purpose use, you either need to come up with a good ISA extension
> that isn't directly tied to the hardware implementation or (ideally) we'd
> have some kind of loop detector that would detect when we were in an
> acceleratable loop and do the conversion itself.

the warning here, if i may make one, comes in the form of Imagination
Technologies absolute f****** dog's dinner
cluster****-of-an-architecture. it's. universally. HATED. by.
engineers.

luc verhaegen's "state of free software graphics" talk from i think
2014 is the most informative and insightful, but i also know a little
bit about its background. it was developed as a general-purpose
processor by an Imperial College Professor, some time over 20 years
ago. it was supposed to be "flexible" as well as powerful. however,
the level of control-freak-ism adopted by ImgTec, in combination with
the many many changes *per customer* that were made left the code in
such an absolute mess that not even ImgTec's own engineers could
properly understand it... *even when* they charged customers USD
$150,000 to grant them access to the source code... under NDA... no
access to anyone outside of the company permitted to talk about it.

it's the absolute absolute worst of all worlds, and it comes down to
an attempt to turn a general-purpose processor into a specialist
heavily-customisable 3D-capable one. then try to keep it proprietary,
and prevent and prohibit all and any discussion amongst *top*
researchers and experts in the world who could... y'know... actually
HELP?!?!

so there is a warning, there, which could be, y'know, absolutely
fine, *as long as people communicate*, but also to not, for goodness
sake, try to expect general-purpose compilers like gcc do all the
heavy lifting.

bottom line is, for a first implementation it's *okay* to use
hilarious awk/sed scripts, or raw assembly blocks, or to call out to
DMA-based hard macros. this _is_ 3D after all, y'know?

> OTOH, from an amdahl's law perspective the performance boost being reported
> is too low. They are being selected for 60-80% of the computation time but
> are throughput saturated. While this gives a significant performance boost
> for smaller processors, the numbers indicate that provisioning them with
> more hardware is a net-win. (And doing that might make them beneficial to
> larger processors as well.)

photos of the die area of GPUs show that they're absolutely eeenooormous.

> So I'd like to see some better treatment of the idea and think we'd be able
> to do that in the RISC-V context.

oh, the other thing is: nyuzi does *not* have any specialist
optimised acceleration blocks. it's *very* deliberately focussed at
being a *software-only* general-purpose processor, with performance on
OpenCL that rivals / equals MALI engines.


> The accelerators clearly have benefits in
> the embedded context.

oh! yes, for things like crypto, definitely, an embedded
co-processor is essential for example, as a general-purpose low-power
1ghz processor would just be completely overwhelmed otherwise.

but for video it's actually really really important, the power
savings on an intel laptop for using vaapi make the difference,
particularly on this laptop i'm using which has a 3000 x 1800 LCD, is
insane. i get something like 25% CPU usage @ 800mhz per core when
using vaapi, and something like 50 to 60% CPU usage and cpufreq has to
bump things up to... 1.2 1.6 ghz without. it means i can watch a 2
hour 720p film without the battery running out. or without getting
burned by the back of the laptop!

... and that's a *modern* skylake quad-core i7. accelerators aren't
just good for embedded use-cases, is my point.


> But working out what to do with them in the context of
> say 4-way BOOM is going need more research.

yyeah which is in some ways, i feel, a good reason for keeping things
separated in some way. i'm *partly* talking my way out of
recommending the idea / suggestion that i had, but if someone doesn't
start a 3D / OpenCL Vector Processor for RISC-V we'll never _get_ to
the point where 4-way BOOM research into Vector Processing could even
start, ehn?

:)

oh, before i forget, that reminds me of something, which i should
start as a separate thread, if that's ok.

Max Hayden Chiz

unread,
Jan 21, 2018, 9:56:34 PM1/21/18
to RISC-V HW Dev, max....@gmail.com
I think you are talking past me. (And maybe I'm talking past you too.)

I agree that a GPU is an interesting project, but that has nothing to do with idea #5 and the BSAs (loop accelerators) it is talking about. These work on completely different types of code, and in fact the BSAs in the papers I cited are specifically for accelerating things that *can't* be accelerated by a GPU or a vector unit. They've got nothing to do with graphics or vectors and are just a new research area for a type of code that people have recently started figuring out how to accelerate. They work on code that is *not* data parallel, but still has high potential ILP.

In the first paper it works with nested loops where the control instructions are off of the critical path. In the second paper it adds one that works with inner loops that have a consistent execution trace. (So there are control instructions but they are always resolved the same way.)

The idea is that much of the ILP in "irregular" code still has enough regularities that you can get away with much simpler hardware than a full OoO GPP. So instead of an 8-way processor, you can have a very simple 2-way one with these BSAs and get comparable performance on most code but with much smaller area and using much less power.

As for your idea, I've skimmed the documentation and the papers and broken this down into more concrete steps. Nyuzi is basically the Xeon Phi using RISC cores and a hardware rasterizer. We can certainly do something like it with RISC-V. (In fact, I think the people at Esperanto are actively working on it.)

We would need to:
1) Finish the vector extension and get a working vector unit.
2) Finish and implement the bit manipulation extension.
3) Finish the L2 cache in a way that supports efficient cache coherency with a massive number of cores.
4) Modify Rocket (and possibly Boom) to support a barrel processor configuration (hyperthreading in the case of Boom).
5) Reimplement Nyuzi's System Verilog-based rasterizer in Chisel.

People are already working on #1-3. #4 is something that we should do eventually anyway and would make a good master's project if someone isn't already doing it. #5 is a solid idea, but it won't be worth doing until the others are finished. That's especially true because it isn't really essential to the design. It gets a 5% performance boost over software and saves a lot of power, but we could do without that optimization in an initial iteration.

David Chisnall

unread,
Jan 22, 2018, 5:39:37 AM1/22/18
to jcb6...@gmail.com, Michael Chapman, Max Hayden Chiz, RISC-V HW Dev
On 21 Jan 2018, at 01:46, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> The ISA spec also suggests (in commentary while describing branch instructions) that microarchitectures could translate short forward branches into predicated instructions. I would like to see this made the standard encoding of predicated instructions on RISC-V, so actually implementing conditional move this way could be a good step in the right direction.
>
> For example: (pseudo-assembler)
>
> .macro CMOVE <cond>, <crs1>, <crs2>, <rd>, <rs>
> B<!cond> <crs1>, <crs2>, 1f
> MV <rd>, <rs>
> 1:
> .endm
>
> Hardware can recognize this because the branch offset literal will be 2 (or 1 if RVC is used for the MV instruction), corresponding to a 4-byte (2-byte if RVC used) forward branch.

Implementing good support for this in Rocket would be a good idea for an undergraduate student project. I had a student a couple of years ago who added a conditional move instruction to a Bluespec in-order RISC-V core and found that it got about a 20% speedup for less than a 1% area overhead (and needed about four times as much branch predictor state to get the same benefit), so for simple cores there’s definitely a big win to be had.

The suggestion in the ISA spec effectively amounts to treating the conditional branch encoding as a limited form of Thumb-2’s IT instruction. It would be nice if it could be combined with slightly longer instruction sequences (IT supports up to 4 predicated instructions). For a complex pipeline that already has support for multiple sets of speculated instructions and can merge different speculated paths, there’s little advantage in doing this optimisation, but on low-end cores it’s very useful. PowerPC is particularly annoying in this regard, from a compiler perspective, because different microarchitectures run the cmove or conditional-branch-one-instruction-forward sequence faster, so having a uniform encoding of this makes life easier for the compiler.

David

lkcl .

unread,
Jan 22, 2018, 6:00:59 AM1/22/18
to Max Hayden Chiz, RISC-V HW Dev
On Mon, Jan 22, 2018 at 2:56 AM, Max Hayden Chiz <max....@gmail.com> wrote:
> I think you are talking past me. (And maybe I'm talking past you too.)

this is highly likely, i apologise: if anyone learns from this,
that's cool with me. mind you... open discussion topic with 6+ ideas
suggested so far to choose from, a bit of cross-purposes was bound to
happen :)

so, reading the paragraphs you wrote, and summarising: i think you do
a fantastic job of outlining some state-of-the-art research for
accelerating general-purpose loops, and a second additional idea you
suggest (esperanto partly working on it already) is in effect a
"reimplementation" of an OpenMP / GPU-suitable engine that would fit,
structurally, into pre-existing research that's already underway.

a summary of what i am suggesting is more of an engineer's approach:
to take a pre-existing OpenMP / GPU-suitable (ish) engine (it happens
to be published under the name Nyuzi), and make the absolute minimum
necessary change to shoe-horn it into RISC-V

i think we can say and agree that there are actually three perfectly
valid brainstorm-style ideas here, each with a different focus /
outlook, for anyone (anywhere, any time) with a fixed amount of time /
resources to assess whether to run with any one of them.


> OoO GPP. So instead of an 8-way processor, you can have a very simple 2-way
> one with these BSAs and get comparable performance on most code but with
> much smaller area and using much less power.

veery niiice.


> As for your idea, I've skimmed the documentation and the papers and broken
> this down into more concrete steps. Nyuzi is basically the Xeon Phi using
> RISC cores and a hardware rasterizer.

background: https://en.wikipedia.org/wiki/Xeon_Phi and associated
link to Larrabee Research Project, i don't believe jeff implemented a
hardware rasterizer, he wanted specifically to in effect replicate -
publicly not keep secret - the original work of the Intel team. what
jeff's come up with can definitely be said to have replicated both the
"successful General-Purpose High-End Compute Engine" aspect *and* the
"failed GPU" aspect, the important bit being that he's quantified and
published exactly *where* the Larrabee approach doesn't work [as a
GPU].

> We can certainly do something like it
> with RISC-V. (In fact, I think the people at Esperanto are actively working
> on it.)
>
> We would need to:
> 1) Finish the vector extension and get a working vector unit.
> 2) Finish and implement the bit manipulation extension.
> 3) Finish the L2 cache in a way that supports efficient cache coherency with
> a massive number of cores.
> 4) Modify Rocket (and possibly Boom) to support a barrel processor
> configuration (hyperthreading in the case of Boom).
> 5) Reimplement Nyuzi's System Verilog-based rasterizer in Chisel.

this looks like a very sensible outline, where each phase *on its
own* has huge benefits in other areas, not just related to the
[indirect, arbitrary] goal of making a RISC-V-related GPGPU.

> People are already working on #1-3. #4 is something that we should do
> eventually anyway and would make a good master's project if someone isn't
> already doing it. #5 is a solid idea, but it won't be worth doing until the
> others are finished.

... and that's where the "pragmatic" or "engineer's" approach comes
in, which was to simply not try to fit Nyuzi in to the vector ISA at
all, instead to go with adding an entirely new ISA extension into
RISC-V, and to drop Nyuzi into place pretty much as-is, pretty much
intact.

the disadvantage being, that approach does not take full advantage of
the advancements made by RISC-V.... on balance, although it would be
longer, now that i've written it out, i kinda prefer the approaches
you're recommending, max.

> That's especially true because it isn't really
> essential to the design. It gets a 5% performance boost over software and
> saves a lot of power, but we could do without that optimization in an
> initial iteration.

ok whew i was going to say. not a fan of reimplementation without
good cause! :)

l.

David Lanzendörfer

unread,
Jan 22, 2018, 7:14:08 AM1/22/18
to hw-...@groups.riscv.org, Max Hayden Chiz
Hi
We've rented out the equipment here at HKUST in Hong Kong and are developing
our own process right now for manufacturing CMOS technology in combination
with non volatile memory.
Goal is to manufacture a RISC-V based MCU based on the new process and selling
physical products made from this this free design (which will be on GitHub
with layout and process specification and everything else).
Also we will be going to IEEE standardize the CMOS process itself.
If you're interested into joining please do so.

From the money from the sales of the chips we're intending to employ some of
the community contributors with the most contribution to the project.

Cheers
David
signature.asc

Max Hayden Chiz

unread,
Jan 22, 2018, 10:46:43 AM1/22/18
to RISC-V HW Dev, jcb6...@gmail.com, michael.c...@gmail.com, max....@gmail.com, David.C...@cl.cam.ac.uk


On Monday, January 22, 2018 at 4:39:37 AM UTC-6, David Chisnall wrote:
On 21 Jan 2018, at 01:46, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> The ISA spec also suggests (in commentary while describing branch instructions) that microarchitectures could translate short forward branches into predicated instructions.  I would like to see this made the standard encoding of predicated instructions on RISC-V, so actually implementing conditional move this way could be a good step in the right direction.
>
> For example: (pseudo-assembler)
>
> .macro CMOVE <cond>, <crs1>, <crs2>, <rd>, <rs>
> B<!cond> <crs1>, <crs2>, 1f
> MV <rd>, <rs>
> 1:
> .endm
>
> Hardware can recognize this because the branch offset literal will be 2 (or 1 if RVC is used for the MV instruction), corresponding to a 4-byte (2-byte if RVC used) forward branch.

Implementing good support for this in Rocket would be a good idea for an undergraduate student project.

Do you think there are enough situations like this (e.g. 3-operation add, auto-increment loads and stores, combining small shifts with ALU instructions, etc.) that someone could make a master's project by doing them all and backing it up with some kind of quantitative analysis of RISC-V assembly and the possibilities (similar to what was done for the compressed instruction set)?
 
 I had a student a couple of years ago who added a conditional move instruction to a Bluespec in-order RISC-V core and found that it got about a 20% speedup for less than a 1% area overhead (and needed about four times as much branch predictor state to get the same benefit), so for simple cores there’s definitely a big win to be had.

Was this on unmodified code or was this after telling the compiler that there was a cmove instruction convention? (i.e. was the compiler using it for an if-conversion optimization or was this just the performance that was lying around in existing code?)

Given those benefits, maybe we *do* want to look at the possibility of having a "branch on least significant bit" instruction in the bit manipulation extension since the benefits would be similarly large for a small processor...

Jacob Bachmeyer

unread,
Jan 22, 2018, 7:04:13 PM1/22/18
to Max Hayden Chiz, RISC-V HW Dev, michael.c...@gmail.com, David.C...@cl.cam.ac.uk
Max Hayden Chiz wrote:
> On Monday, January 22, 2018 at 4:39:37 AM UTC-6, David Chisnall wrote:
>
> On 21 Jan 2018, at 01:46, Jacob Bachmeyer <jcb6...@gmail.com
> <javascript:>> wrote:
> >
> > The ISA spec also suggests (in commentary while describing
> branch instructions) that microarchitectures could translate short
> forward branches into predicated instructions. I would like to
> see this made the standard encoding of predicated instructions on
> RISC-V, so actually implementing conditional move this way could
> be a good step in the right direction.
> >
> > For example: (pseudo-assembler)
> >
> > .macro CMOVE <cond>, <crs1>, <crs2>, <rd>, <rs>
> > B<!cond> <crs1>, <crs2>, 1f
> > MV <rd>, <rs>
> > 1:
> > .endm
> >
> > Hardware can recognize this because the branch offset literal
> will be 2 (or 1 if RVC is used for the MV instruction),
> corresponding to a 4-byte (2-byte if RVC used) forward branch.
>
> Implementing good support for this in Rocket would be a good idea
> for an undergraduate student project.
>
>
> Do you think there are enough situations like this (e.g. 3-operation
> add, auto-increment loads and stores, combining small shifts with ALU
> instructions, etc.) that someone could make a master's project by
> doing them all and backing it up with some kind of quantitative
> analysis of RISC-V assembly and the possibilities (similar to what was
> done for the compressed instruction set)?

I suggested CMOVE as a special case of predicated ADDI. Short forward
branches can encode general predication, even on JAL instructions.

Some of the situations suggested are features that RISC-V intentionally
omits to simplify implementations, while many of these features could be
achieved using macro-op fusion. (Arguably, branch conversion is a
special case of macro-op fusion.) I think that more research into
macro-op fusion possibilities in RISC-V could be interesting; UCB
technical report EECS-2016-130 examined some of these and offered
suggestions for further research.

> I had a student a couple of years ago who added a conditional
> move instruction to a Bluespec in-order RISC-V core and found that
> it got about a 20% speedup for less than a 1% area overhead (and
> needed about four times as much branch predictor state to get the
> same benefit), so for simple cores there’s definitely a big win to
> be had.
>
>
> Was this on unmodified code or was this after telling the compiler
> that there was a cmove instruction convention? (i.e. was the compiler
> using it for an if-conversion optimization or was this just the
> performance that was lying around in existing code?)
>
> Given those benefits, maybe we *do* want to look at the possibility of
> having a "branch on least significant bit" instruction in the bit
> manipulation extension since the benefits would be similarly large for
> a small processor...

There are two function codes (3'b01x) remaining in the baseline BRANCH
opcode, but that discussion belongs on isa-dev.

> The suggestion in the ISA spec effectively amounts to treating the
> conditional branch encoding as a limited form of Thumb-2’s IT
> instruction. It would be nice if it could be combined with
> slightly longer instruction sequences (IT supports up to 4
> predicated instructions). For a complex pipeline that already has
> support for multiple sets of speculated instructions and can merge
> different speculated paths, there’s little advantage in doing this
> optimisation, but on low-end cores it’s very useful. PowerPC is
> particularly annoying in this regard, from a compiler perspective,
> because different microarchitectures run the cmove or
> conditional-branch-one-instruction-forward sequence faster, so
> having a uniform encoding of this makes life easier for the compiler.
>

Recognizing short forward branches (presumably up to the pipeline
"shadow" in length) as predicated execution provides a single encoding
for this that is optimal on all implementations: complex pipelines that
can merge speculated paths and very simple pipelines would simply treat
the branches literally, while the "in-between" pipelines that can
benefit from this branch conversion can use it.

The maximum length that can be predicated from a forward branch is
implementation-dependent and different values are optimal for different
implementations. CMOVE only requires one instruction predicated, but
the actual cut-over point where predication turns to an actual branch
can vary by implementation. The same code works in all cases; CMOVE on
implementations that do not recognize it still executes as a conditional
move, just less quickly.


-- Jacob

Kevin Cameron

unread,
Jan 23, 2018, 2:44:17 AM1/23/18
to hw-...@groups.riscv.org

If so, I have this approach to making legacy code run on PiM -

Wandering Threads - the easy way to go parallel

- works best with some processor/hardware support.

The original target application was circuit simulation, but the code pattern for that is similar to neural-networks and database search.

Kev.

PS: I posted on this topic before, but the patent has since been granted.

Max Hayden Chiz

unread,
Jan 23, 2018, 6:07:03 PM1/23/18
to RISC-V HW Dev, max....@gmail.com, michael.c...@gmail.com, David.C...@cl.cam.ac.uk, jcb6...@gmail.com
You are correct that I was treating branch conversion as a type of macro-op fusion. Thanks for linking that report. It was helpful and interesting.
 

>      I had a student a couple of years ago who added a conditional
>     move instruction to a Bluespec in-order RISC-V core and found that
>     it got about a 20% speedup for less than a 1% area overhead (and
>     needed about four times as much branch predictor state to get the
>     same benefit), so for simple cores there’s definitely a big win to
>     be had.
>
>
> Was this on unmodified code or was this after telling the compiler
> that there was a cmove instruction convention? (i.e. was the compiler
> using it for an if-conversion optimization or was this just the
> performance that was lying around in existing code?)
>
> Given those benefits, maybe we *do* want to look at the possibility of
> having a "branch on least significant bit" instruction in the bit
> manipulation extension since the benefits would be similarly large for
> a small processor...

There are two function codes (3'b01x) remaining in the baseline BRANCH
opcode, but that discussion belongs on isa-dev.

So, I thought about how to implement the trick from 

but without special architectural mandating.

Here's my thought:

You don't even need the bit manipulation instructions. You can tell if the least significant bit of a register is one by just using an ANDI followed by a BEQ/BNE. (And then both paths of the branch will need to do a SLRI to move you to the "next" branch.) To get the branch outcomes *into* the register, it takes several instructions (which are potentially fusable), but they are straightforward. So to implement the trick from that paper, you just need to recognize the idiom and then use a (cached) copy of the register's contents as your "prediction" for that branch.

So if someone could teach LLVM to do the right thing and obey this idiom for control-flow decoupled branches, we could easily implement it in hardware without any architectural extensions. Between that and cmove / if-conversion, we could eliminate around 2/3rds of our branch mispredictions.

vegh....@gmail.com

unread,
Jan 28, 2018, 1:57:27 PM1/28/18
to RISC-V HW Dev

vegh....@gmail.com

unread,
Jan 28, 2018, 1:58:59 PM1/28/18
to RISC-V HW Dev
Did you receive any idea you like? I do have something unusual idea.


2018. január 18., csütörtök 10:22:39 UTC+1 időpontban muhammadali201 a következőt írta:

David Chisnall

unread,
Jan 29, 2018, 4:36:07 AM1/29/18
to vegh....@gmail.com, RISC-V HW Dev
On 28 Jan 2018, at 18:57, vegh....@gmail.com wrote:
>
> Well, as the topic suggests, I am looking for open ideas / issues that can be made part of Graduate (Master's) level research work, related to RISC-V and its FPGA Implementations or Application as FPGA soft processor.

A good FPGA TLB would a good Master’s project. On an FPGA, BRAMs are cheap, TCAMs are very expensive, so conventional TLB designs are not a great fit, but you can fit very large direct-mapped TLBs with little overhead and there’s a bunch of ideas like the nearly-associative memory from UPenn that could improve this, perhaps combined with a micro-TLB using a very small TCAM. A small design-space exploration evaluating performance vs FPGA resource usage would be interesting.

David

Muhammad Ali Akhtar

unread,
Jan 29, 2018, 5:20:49 AM1/29/18
to Max Hayden Chiz, RISC-V HW Dev, michael.c...@gmail.com, David.C...@cl.cam.ac.uk, jcb6...@gmail.com
Hello All,

I really appreciate the response from the community, especially Max Hayden and all others. This was extremely helpful. I'll discuss the ideas with other students and will keep everyone in loop. 

Thanks again everyone. I really feel obliged here.:)

Muhammad Ali Akhtar
Principal Design Engineer
http://www.linkedin.com/in/muhammadakhtar

--
You received this message because you are subscribed to the Google Groups "RISC-V HW Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hw-dev+unsubscribe@groups.riscv.org.

To post to this group, send email to hw-...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/hw-dev/.

Max Hayden Chiz

unread,
Jan 31, 2018, 1:47:54 AM1/31/18
to RISC-V HW Dev
Various people have asked for more information / details / non-FPGA hardware ideas so rather than spamming the list with multiple threads, I'm going to just put it all in one email here. If there's enough interest in a topic, we can fork it out to a new thread.

On Saturday, January 20, 2018 at 9:56:57 AM UTC-6, Max Hayden Chiz wrote:

5) If you want something ambitious, you could try implementing the ideas from pages.cs.wisc.edu/~vinay/pubs/isca-hybrid-arch.pdf and research.cs.wisc.edu/vertical/papers/2016/asplos16-exocore.pdf ; ideally you'd have a way to convert regular RISC-V instruction streams into instructions for these special accelerators. But just having them in the tree would be a huge help because it would let people see how useful they were in the presence of a "real" vector unit and if they prove useful, someone else could work on the hardware to detect the loops and do the conversion instead of requiring compiler profiling support. (B/c that's more of a PhD thing.)

You don't actually have to implement these two designs to research this area. The core issue is that integer loops just haven't seen the kind of intensive research that floating point ones have. So we don't actually have a good characterization of why these accelerators work. They get extract lots of parallelism from integer loops dynamically by using a dataflow design. But do we need that kind of dynamic behavior or will something simpler suffice?

Personally, I'm of the belief that upon thorough investigation, we'll be able to reuse the vector unit for the majority of these loops given the right ISA extensions. (But I could be proven wrong here.)

A possible route (more ISA research than hardware) would be to look at the programs in cloud suite, and TPC's DSS and OLTP benchmarks. (Or other similar "modern" workloads along SPEC Int) and figure out through profiling where the important loops are. Then you want to try to understand why they "can't" be vectorized. You want to try to come up with some typology for the loops and document whether that "can't" is something fundamental, due to compilers not being good enough, or due to a lack of a necessary instruction.

There's been some initial research in this area:
Hayes et al. _Future Vector Microprocessor Extensions for Data Aggregations_ suggests two new vector instructions so that you can implement the VSR routine in database applications. Their earlier work showed that although SIMD instructions don't work on database workloads, manual vectorization with a real vector unit generally did.

Similarly, segmented scan primatives can be used to handle nested data parallelism. (i.e. parallelism over irregular data structures). The approach of Chaterjee et al. _Scan primitives for vector computers_ (https://www.cs.cmu.edu/~guyb/papers/sc90.pdf) works with "normal" vector units, but suffers from excessive data copying. (See section 2.1 of http://manticore.cs.uchicago.edu/papers/ppopp13-flat.pdf) The problem is that the segmented scan algorithm uses the predicate register to define segment boundaries. Consequently the predicate register can't be used to handle control flow and algorithms built on top of it have to do lots of splitting and merging as a result. (For additional reference, here is a GPU version of these primitives: www.idav.ucdavis.edu/func/return_pdf?pub_id=915 and here is the Haskell version for multi-core processors: https://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/papers-ndph.pdf)

Given what we have been told about the vector ISA extension, it would be relatively trivial in hardware (and in the ISA) to implement predicated segment scan. Essentially instead of just using the least significant bit of a vector register as the predicate, you use the two least significant bits and the second one is the segment boundary markers.

Ideally research in this area would lead to a proposal for an "integer / irregular loop" ISA extension for the vector unit that would get us most/all of the performance the specialized loop accelerators do. The trick however is designing it in a way that a compiler can automatically do as much as possible (much like how they can auto-vectorize floating point loops; for that see the book _Optimizing Compilers for Modern Architectures: A Dependence-based Approach_). Sheffler and Chatterjee's approach in _An Object-Oriented Approach to Nested Data Parallelism_ is suboptimal because it requires rewriting code.

(NB: Talk to the software guys, but I'm not sure if our compiler infrastructure is ready for long-vectors in general. If you want a software project, helping with that could be it.)

As for detecting the loops in running code and optimizing them on the fly, Michaud's _Hardware acceleration of sequential loops_ shows how to do this for inner loops, but I'm unaware of any structure that detects nested loops in an efficient fashion.


6) Another project would be to replicate the ideas from _Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window_ and/or _A Performance-Correctness Explicitly-Decoupled Architecture: Technical Report_ (see also the thesis _Accelerating Decoupled Look-ahead to Exploit Implicit Parallelism_) These have some really promising performance and power numbers. But their designs rely on the presence of an L2 cache. So unless our L2 cache if forthcoming, this might not be a good option. Basically the idea here is to mimic a wide issue processor with a very large instruction window by pairing two narrower and shallower processors together with a queue. This eliminates about 90% of branch mispredictions and cache misses and allows you to simplify the design of the two cores.

The approach I suggested is the simplest to implement. But there are other ideas in the literature that achieve a similar effect. E.g. _Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads_. The ones I originally suggested effectively execute the program twice: once speculatively and without waiting on a LLC miss and again to ensure correctness. (Thus you could call it speculative-access/execute.) The first pass trips all of the branch mispredictions and the cache misses that it can. (i.e. everything that isn't dependent on a previous miss). The advantage of this approach is that you don't have to muck around with the constituent cores except as an optimization. The disadvantage is that it's area intensive and breaks even on power. But it gets a 40-50% performance boost.
 
In contrast, Continuous Runahead and similar designs work by detecting specific chains of instructions that cause pipeline stalls and only executing them. So you have to add hardware to to the pipeline to detect those chains. But the accelerator unit is much smaller than a full core (and can even be shared between the processors in a core). Unfortunately it doesn't provide as much coverage as the dual core thing, and it doesn't fix branch mispredicts either. So the performance is not as high. OTOH, it saves power instead of breaking even.

Both ideas could be further developed. E.g. instead of just executing everything or relying on the compiler to mark unneeded instructions, the speculative access core could detect backwards dependency chains itself and focus on low confidence branches and loads predicted to miss. It could also probably be made integer-only to save area and it could use a real value predictor instead of just a zero prediction. Similarly, maybe there's a way to make the runahead thing also resolve low-confidence branches. I think the later is probably harder to do in practice because the runahead unit only executes traces of 32 instructions. But I would think that the reason you can't just preexecute those branches yourself (using the control-flow decoupling idea I spoke about in the thread) is because there's a long dependency chain.

Ideally we'd have designs for both so that we could take a look at them and figure out exactly where the performance difference is coming from beyond just "higher coverage". If you could get the runahead unit to work with the coverage of speculative preexecution thing that would be ideal. Similarly, it would be good if you could get the efficiency of the preexecution core to the point that it could do a 4-way or an 8-way barrel processor design so that the area cost could be more easily amoritized.

But at this point we just want to explore the design space and compare and contrast the two ideas on equal footing.



7) Another ambitious project would be implementing _Exploiting Core Working Sets to Filter the L1 Cache with Random Sampling_ by Etsion. If this can be made to work, saving that power and area from the L1 cache in smaller RISC-V configurations would be nice. If we get an L2 cache soon, modifying/redoing it to support cache compression would be interesting as well. (I can provide citations for you if you need them.)

It seems like we will be trying this eventually because this fits naturally with how we are designing the prefetcher. The idea here is that we simplify the L1 caches by filtering what gets put into them so that you can get comparable performance out of a direct mapped cache of half the size of the original 4-way one. (Whether our actual cache will get comparable performance with our design is the question.) It is unclear how well this will work when combined with prefetching. If the associative buffer has to be too large when used for both purposes, the power and area savings will go away.

A related idea (that isn't incompatible with the above) is an L0 cache. A small (256B) flip-flop based cache can eliminate 70%+ of the cache accesses entirely. So this trades area for power (and possibly performance). The interesting thing is that if that this should let you get most of the performance of dual porting the cache at a much smaller cost. (with a 30% miss rate, the odds of both accesses needing the L1 cache is only 9%.) The question is how well these L0 caches work with workloads with larger working sets (like the benchmarks listed above).

For this see, _Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache_ for the instruction cache and _Designing a Practical Data Filter Cache to Improve Both Energy Efficiency and Performance_ for the data cache. The latter is for an in-order processor. So for BOOM you probably need to put it in a different place in the pipeline similar to the location of the line buffer from _Increasing Cache Port Efficiency for Dynamic Superscalar Microprocessors_ (and implementing the other ideas there for BOOM's Load-store queue is not a bad idea either). For more on this, see _Revisiting Level-0 Caches in Embedded Processors_.

As a note, I really like Sembrant's cache ideas (for example _The Direct-to-Data (D2D) Cache: Navigating the Cache Hierarchy with a Single Lookup_, but see his papers more generally and his thesis in particular.) Unfortunately, I don't see a straight forward way to implement these ideas in our code. You'd basically have to make your own TLB and cache hierarchy.

 
(Also, I left out a whole bunch of generic "make it better" type stuff. E.g. improve our documentation and wiki, widen the fetch unit, increase the cache bandwidth, make the LSU capable of more than one instruction per cycle, add an optimizing? trace cache, improve memory level parallelism, do macro-op fusion, increase the pipeline's scalability so that we can have a deeper instruction window, more units, and higher performance, etc.)

Jourdan _Exploring Configurations of Functional Units in an Out-of-Order Superscalar Processor_ and _An Investigation of the Performance of Various Instruction-Issue Buffer Topologies_ are relevant to making sure that our processors adequately provisioned. In general 4-way Boom as it has been tested isn't. Fixing this and using our reconfigurable nature to see how well our FPGA results track the results of these simulations would be interesting. (The main culprit is that the LSU and cache are both single ported. If the L0 cache idea cited above works on general workloads, then this is a partial solution. If it doesn't, that's more problematic.)

We can actually support a 6-wide processor fairly easily by making some very minimal adjustments to our fetch unit. (Along the lines of extended one block look-ahead from e.g. Michaud _An Exploration of Instruction Fetch Requirement in Out-of-Order Superscalar Processors_). This is worth doing eventually because a 6-wide with only 2 memory ports is roughly comparable to recent high-performance designs and so is a good reference point for research. (And because if you really want to test value prediction, you are going to need the extra fetch bandwidth.)

Going beyond 6-wide fetch is *much* harder. For reasonably resourced designs (i.e. less than 128KB of front-end state), it seems that a two-block ahead predictor is better than a trace cache up to about 12-wide. (Michaud, _Alternative Schemes for High-Bandwidth Instruction Fetching_). The trace cache however requires less in the way of invasive changes to the existing front-end code and involves less ambitious rename and decode infrastructure. So realistically an implementation could start there. (There's a ton of literature on various trace cache designs and the optimizations they support. Only some are cited in the Michaud article, but it should get you started. Researching ways to improve the hit rate and the next-trace prediction rate would be interesting, but perhaps a bit academic.)

Arguably, we should eventually support at least 8-wide Boom as a default configuration because so much research takes an 8-wide design as the "extremely aggressive" design point and as the "very high performance" baseline for comparing micro-architectural innovations for smaller processors. The 4 memory access requirement seems doable because if the L0 idea above works, then we'd only need to need to bank or dual port the DCache. But this is off in the future and not really appropriate for a master's project at this time.

Interestingly, cache issues aside, it isn't really that much harder to go crazy wide. E.g. It would be trivial to go from 8-wide to 16-wide by clustering two of them together. And there are designs from the 90s (trace processors, Ultrascalar) that can go wider with more innovative approaches. That said, as neat as such a things would be, once we have a vector unit and we assume some integer loop acceleration extension, how much ILP is really left-over for such a processor to take advantage of? (This question might be appropriate for research. If you can easily tag loops as vectorizable or acceleratable on one of the loop accelerators I've cited, then you can examine the remaining code and do an ILP limit study. I'd be shocked if anything wider than a 4-issue processor was needed for what was left. If there is any high ILP code left, it's probably acceleratable on some reasonable modification of the hardware. I.e. a dataflow processor with speculation support.)


On Saturday, January 20, 2018 at 4:40:51 PM UTC-6, prof.taylor wrote:
Hi,

Schematics and detailed description of the operation of various subsystems of Rocket would be very valuable to the community.

I know it isn't glamorous or exciting, but good documentation, especially for "on boarding" new developers is essential to the success of a project like this. Ours could be better. So I second Prof. Taylor's remarks. I'd also suggest documenting how the Rocket source tree uses Chisel and the Scala abstractions it allows. (If it helps, no one has written any books on this yet. So it's a lot easier to become a go-to expert on Chisel and Rocket's use of it.)
 
6) If you aren't at an academic institution, you don't have access to a technology library and other tools that will let you do area, power, and speed estimations. So it isn't possible for someone to e.g. work on 4-way BOOM to optimize its clock-rate in real silicon. I don't know how to solve this, but long-term we need to figure something out so that we can be accessible for contributors.

David Lanzendörfer's project (http://libresilicon.com/) seems like a solid attempt to resolve this problem. If someone wanted to work on hardware processes for their master's this could be a good project to contribute to. This is especially important long-run because projects will want to tapeout and we'd like to ultimately make the process as smooth as possible. (And we'd like for that work to be more accessible as well. There are lots of things about the most recent BOOM design that could be improved such as the register file, but it seems like kind of a waste without an open source technology library because other researchers can't use it without access to the same proprietary tools and info.)

I'll add that I think there's a lot of potential for MOS current-mode logic (MCML). Given that modern processes have high leakage current anyway, the constant power draw of CML is not as much of a problem and its lower power density and roughly constant power (independent of frequency) is a benefit. It is also extremely high-speed, and because of the lower noise, works better in mixed-signal applications. There are some limited studies in using this for processors. The most extensive I'm aware of is a thesis by Abdullah Al Owahid, _Design of 3.33GHz CML Processor Datapath_. He got 3.33 GHz in old 130nm tech using only about 40 watts of power. And he didn't use any of the old tricks for saving power in ECL circuits that could potentially apply to CML, You can also apply sub-threshold and near-threshold techniques. See, e.g., _MOS Current Mode Logic Near Threshold Circuits_. The only real issue is the exact same one we are trying to solve for CMOS -- there isn't a solid technology library for using this logic family as part of your research.  So I think it would be a good project to work towards fixing this so that more research can be done with this logic family. And since this is entirely a MOS process, this is easier to do than BiCMOS or a logic family for some exotic substrate.

lkcl .

unread,
Jan 31, 2018, 2:08:51 PM1/31/18
to Max Hayden Chiz, RISC-V HW Dev
[sorry max, once again hit "reply" and didn't see that the google
group doesn't have the correct "reply-to" headers... could the admin
for the group please fix that in the settings?]

On Wed, Jan 31, 2018 at 6:47 AM, Max Hayden Chiz <max....@gmail.com> wrote:

>> 6) If you aren't at an academic institution, you don't have access to a
>> technology library and other tools that will let you do area, power, and
>> speed estimations. So it isn't possible for someone to e.g. work on 4-way
>> BOOM to optimize its clock-rate in real silicon. I don't know how to solve
>> this, but long-term we need to figure something out so that we can be
>> accessible for contributors.
>
>
> David Lanzendörfer's project (http://libresilicon.com/) seems like a solid
> attempt to resolve this problem.

ah! that reminds me. jean-paul who wrote coriolis2 has *already
solved* the problem of not having libraries for gates (cells) i
believe all the way down to 90nm, and they're already entirely
libre-licensed. you'll have to do some digging as it was over 18
months ago that i encountered coriolis2 and last spoke to him.

for those people who may not be aware what coriolis2 is, it's an ASIC
design tool that is entirely libre licensed. it does have
auto-routing capability, but only for 4 layers at a time. it's
designed around the concept of "cells" so if you do not have a
particular "cell" but you at least know the size and its inputs and
outputs you can at least continue to design the rest of the layout.
very handy for when you're forced to use a proprietary foundry's
"cells" to fill in the missing blanks under NDA but wish to at least
have some team members *NOT* sign proprietary NDAs.

David Lanzendörfer

unread,
Feb 1, 2018, 2:51:00 AM2/1/18
to hw-...@groups.riscv.org, lkcl ., Max Hayden Chiz
Hi
> ah! that reminds me. jean-paul who wrote coriolis2 has *already
> solved* the problem of not having libraries for gates (cells) i
> believe all the way down to 90nm, and they're already entirely
> libre-licensed. you'll have to do some digging as it was over 18
> months ago that i encountered coriolis2 and last spoke to him.
coriolis2 is the back-end of Alliance. I've looked at it, however coriolis
doesn't have the features QtFlow provides.
QtFlow is really novell in the sense that it's a Qt5 based frontend
integrating all the parts from simulation&verification up to the point of
generating a pad frame and wiring it, as well as sending the layout to the
selected foundry with one button.
Coriolis2 is based on Qt as well, as far as I've seen, so if you know the guy
developing it, please tell him to hook up with me in order to coordinate him
joining our project.

> for those people who may not be aware what coriolis2 is, it's an ASIC
> design tool that is entirely libre licensed. it does have
> auto-routing capability, but only for 4 layers at a time. it's
> designed around the concept of "cells" so if you do not have a
> particular "cell" but you at least know the size and its inputs and
> outputs you can at least continue to design the rest of the layout.
> very handy for when you're forced to use a proprietary foundry's
> "cells" to fill in the missing blanks under NDA but wish to at least
> have some team members *NOT* sign proprietary NDAs.
With us you don't need an NDA or anything.
The standard cells and the process with which we will manufacture the cells
are available on GitHub.
You can virtually watch us develop the process with which we are going to
build your ASIC in realtime when you select "Watch" on my GitHub repository[1]
;-)
Same goes for Hagens repository[2] for the standard cells.
Our goal is to synthesize the very first LibreSilicon 1um minimal RISC-V-32 MCU
in December, so stay tuned.
As you can remember from us nearly ripping each others heads appart off in the
EOMA project I'm having the policy that "just free schematics/PCB" aren't good
enough, the silicon needs to be fixed. And I've kept this policy since then :-)

Our license we're working on this process of designing this manufacturing
process has even shocked Richard Stallman recently, when a friend of mine who
is friends with RMS showed him the draft while having Chinese food.
We're providing options which allow you to go even further than the GPL goes.
e.g.:
* non weaponize option
* non unfree electronics compatibility (no unfree chips on the same PCB
allowed/no unfree PCB or schematics allowed)

I'll post the final version 1.0 of the license on the libresilicon.com website
as well as this mailinglist as soon as we're there.
My friend with whom I've founded this startup is a lawyer and very passionate
about his profession. We both consider this license to be a "juristical piece
of art", so getting it right takes time and artists are ashamed to show their
unfinished portraits :-)

About the process: If everything goes as planned you should be able to buy a
35c3 limited edition LibreSilicon 1um MCU at the next Chaos Communication
Congress. (A "fairy dust" lasered onto the package).

Cheers
David

[1] https://github.com/leviathanch/libresiliconprocess
[2] https://github.com/chipforge/StdCellLib
signature.asc

Dr Jonathan Kimmitt

unread,
Feb 1, 2018, 3:06:00 AM2/1/18
to hw-...@groups.riscv.org
Dear David,

I don' t think you can call your license a libre license if it has
discriminatory clauses in it.

A license won't stop bad guys from using your designs anyway.

The defensible stance is the same stance as any scientist or engineer
would take.

"I've created something that benefits society or else is morally neutral,
and I shan't be bound to enquire into all the possible or actual uses
of my creation that are outside of my control".

Regards,
Jonathan

lkcl .

unread,
Feb 1, 2018, 5:44:50 AM2/1/18
to David Lanzendörfer, RISC-V HW Dev, Max Hayden Chiz
On Thu, Feb 1, 2018 at 7:52 AM, David Lanzendörfer
<david.lan...@o2s.ch> wrote:

> QtFlow is really novell in the sense that it's a Qt5 based frontend
> integrating all the parts from simulation&verification up to the point of
> generating a pad frame and wiring it, as well as sending the layout to the
> selected foundry with one button.

niiice

> Coriolis2 is based on Qt as well, as far as I've seen, so if you know the guy
> developing it, please tell him to hook up with me in order to coordinate him
> joining our project.

yeh i will (bcc'ing him now) - i think it would be a really _really_
smart idea to agree some file-format interoperability.

>> for those people who may not be aware what coriolis2 is, it's an ASIC
>> design tool that is entirely libre licensed. it does have
>> auto-routing capability, but only for 4 layers at a time. it's
>> designed around the concept of "cells" so if you do not have a
>> particular "cell" but you at least know the size and its inputs and
>> outputs you can at least continue to design the rest of the layout.
>> very handy for when you're forced to use a proprietary foundry's
>> "cells" to fill in the missing blanks under NDA but wish to at least
>> have some team members *NOT* sign proprietary NDAs.
> With us you don't need an NDA or anything.
> The standard cells and the process with which we will manufacture the cells
> are available on GitHub.
> You can virtually watch us develop the process with which we are going to
> build your ASIC in realtime when you select "Watch" on my GitHub repository[1]
> ;-)

neat!

> Same goes for Hagens repository[2] for the standard cells.
> Our goal is to synthesize the very first LibreSilicon 1um minimal RISC-V-32 MCU
> in December, so stay tuned.

that would be superb. is it samuel's MCU? btw i've raised a gsoc2018
project, with librecores,
http://rhombus-tech.net/riscv/shakti/m_class/gsoc2018/

the critical thing that's missing from quite literally every single libre
processor i've ever seen and investigated in the past... 10 years is:
a pinmux.

by contrast: every single commercial processor from ST, ATMEL, NXP/Freescale,
Texas Instruments, Allwinner, Rockchip - *all* of them, right from
the lowest-cost
STM8S @ $0.24 up to some of TI's 1000 pin monsters, *all* of them have
multiple I/O pins per function.

> As you can remember from us nearly ripping each others heads appart off in the
> EOMA project

*sigh* aaron did a _lot_ of damage. he lied to a _lot_ of people.
i only found out last year what he'd been saying. mind you, the
suggestion he had of doing a Certification Mark was extremely
valuable, so yes i'm much more strict, now as well. interesting
learning experience all round.

> I'm having the policy that "just free schematics/PCB" aren't good
> enough, the silicon needs to be fixed. And I've kept this policy since then :-)

_great_. well, since hearing from an anonymous sponsor and also from madhu
(shakti project team leader), both of them independently wanting a 64-bit
RISC-V mobile-class processor to happen, in the meantime people have
been asking, "hey can you use the RK3399 as an intermediary SoC?"
and i thought, "y'know what? with ARM messing about with Luc
Verhaegen's livelihood, and also trying to offer a $24m bribe to have the
shakti project shut down, i don't *want* to be involved in empowering ARM
to do those kinds of things".

> Our license we're working on this process of designing this manufacturing
> process has even shocked Richard Stallman recently, when a friend of mine who
> is friends with RMS showed him the draft while having Chinese food.

cooool

> We're providing options which allow you to go even further than the GPL goes.
> e.g.:
> * non weaponize option

mmm *stress* - a strong case can be made that restricting people's
choices in such ways is in fact highly unethical. such restrictions
amount to deciding *on other people's behalf* whether they are
responsible or not for their decisions and actions. removing peoples'
right to make such decisions [and witness the consequences] can be
viewed as being unethical. additionally, in the case of "war",
sometimes fighting is not such a bad thing, particularly if your
survival is at stake [and you have an unequivocably clear case for
making the world better by doing so].

but hey, i'm sure that if a country is going to go to war, the last
thing that will be on their mind will be "copyright law". or, they
will decide, "it's an emergency, copyright law is suspended"....

so weaponisation... if it's truly an emergency for an entire country,
they'll just suspend copyright law - not a problem, however if the
license has the option to be extended to prevent and prohibit use in
*medical* scenarios, that _would_ be a problem i feel [same ethical
logic as above].

oh, reading ahead in the thread... summary: "What Dr Kimmitt Said".


> * non unfree electronics compatibility (no unfree chips on the same PCB
> allowed/no unfree PCB or schematics allowed)

that's very smart. you'll be pleased to know that madhu has been given
carte blanche to go after absolutely everything except memory and storage
ICs. WIFI, 2G, 3G, 4G, LTE - everything.

he can't go after memory and storage yet because if he annoys all the
memory and NAND manufacturers (S.Korea, Taiwan, China...) what you
rate the team's chances of ever being able to get any silicon made,
using their foundries, ehn?

but in the case of libresilicon, well, you're starting right at the bottom of
the geometries, i don't think they'll particularly care or take any notice
of you whatsoever. by the time you're manufacturing 800mhz DDR3
RAM ICs in 90nm or slightly less, that's a market they've pretty much
abandoned so i don't think you'll really ever be a threat to any of the
Triads in any of the three main countries. but.. just... watch out, ok?


> About the process: If everything goes as planned you should be able to buy a
> 35c3 limited edition LibreSilicon 1um MCU at the next Chaos Communication
> Congress. (A "fairy dust" lasered onto the package).

cooool. hey you should offer people the chance to sponsor the project through
an auction, by having a bitmap of their choice engraved in the
silicon. if that
turns out to be an... um... inappropriate image of say um... a naked person
um... you could start a *second* bidding process to have it removed... :)

l.

David Lanzendörfer

unread,
Feb 1, 2018, 7:25:20 AM2/1/18
to hw-...@groups.riscv.org, Dr Jonathan Kimmitt
Hi
> I don' t think you can call your license a libre license if it has
> discriminatory clauses in it.
Well. The difference between BSD/MIT and GPL I guess is that the GPL has for
instance conditions already like "infecting" code relying on it.
We just go a step further and give the developer of an IP core the right to
say that she/he doesn't want her/his code/silicon being used in weaponry.

> A license won't stop bad guys from using your designs anyway.
>
> The defensible stance is the same stance as any scientist or engineer
> would take.
>
> "I've created something that benefits society or else is morally neutral,
> and I shan't be bound to enquire into all the possible or actual uses
> of my creation that are outside of my control".
For us as company we need all customers we can get, so we don't use this
option of "non weaponization", but it will be available to choose to anyone
who can't stand the thought of having helped to end lifes.
Someone who wouldn't develop an ASIC because he thinks it's the opposit of a
benefit for society to build electronics which kills people can then choose
this option and publish designs after all.
Also, our license will be legally binding in mainland China as well (what we
are working on right now), so opposed to the GPL, it will be possible to
enforce the license in a court of law, even in mainland China.

Peace
David
signature.asc

Max Hayden Chiz

unread,
Feb 1, 2018, 8:29:22 AM2/1/18
to RISC-V HW Dev
People should be less shy about emailing this list vs. contacting me off-list. I'm just some guy with FPGA experience at a previous job and an interest in this stuff. There are lots of people on this list smarter and more knowledgeable than me. Plus many brains are better than one. You'll learn a lot more by asking questions of the whole list and you'll benefit others as well.


On Saturday, January 20, 2018 at 9:56:57 AM UTC-6, Max Hayden Chiz wrote:
9) I've been doing some preliminary design work on adding data cache prefeching to Rocket and Boom. If you want to take that over, I'll share my work (and my literature review) and you can implement (some/all) of it.

I haven't thought as deeply about instruction cache prefetching because you also have to prefetch the BTB. So a modern front-end prefetcher is much more invasive and touches lots of critical path things. Working on this is a good idea though, but I expect it to be fairly challenging. For many workloads adding front-end prefetching would give us ~40% performance boost. (And my impression is that the design in BOOMv2 is not really that satisfactory anyway b/c of the cost of fetch bubbles.)

I haven't done the extensive literature review that I've done for data cache prefetching (so you should do your own, and maybe other people on the list have better ideas then me), but my impression is that the best option is branch-predictor-directed prefetching like the design in _Boomerang: a Metadata-Free Architecture for Control Flow Delivery_. This design also seems to have the benefit of taking the BTB off of the critical path and letting it be pipelined more simply than in e.g. Seznec's _Effective ahead pipelining of instruction block address generation_. (You can also pipeline the L1 cache easily in this design.) And since you could keep the fetch target queue and just not have the prefetching pieces, I think this is a reasonable approach to Boom's front-end in general.

The base design would use the FTQ and our current L1 cache. Minimally, this should have less in the way of front-end bubbles than the current approach. The step up would be to add hit-under-miss support and prefetching (with, maybe, an optional probe port for the L1 cache's tags). Another step up would (maybe) have the cache fetch two consecutive lines and have the BTB support extended basic blocks (bypassing one not taken branch). If you wanted to go further, you'd have to bank the BTB and the I-cache and use two-block ahead or extended two-block ahead prediction (which allows either deeper pipelining or wider fetch). Because this doesn't add much state to the BTB and the banking cost is mitigated by the ability to pipeline the steps, this isn't nearly as expensive as it would be in our current design. (And actually you could probably just make extended two-block ahead the default since it simplifies pipelining the BTB in any event.) So one benefit of this prefetching design is that it makes it more cost effective to go wider/deeper.

The real engineering problem is that it isn't clear that Rocket benefits much from this work. And while it could probably live in Boom's tree, you'd really like to avoid having *completely* separate front-end designs (BTBs, I-caches, etc.) so you could share innovations. Chisel is probably sufficiently powerful to allow for code sharing here, but you'd have to coordinate with multiple people to work out exactly how. I don't know who is in charge of Rocket's front-end. And with Celio now at Esperanto, I also don't know who is in charge of Boom.

A further note is that a trace cache isn't really a solution here because it would also have to be banked for dual porting to allow the fill unit to work while the tCache served traces. So doing something like putting discontinuous traces into a trace cache (and fetching the continuous ones from the I-cache) as in _Trace cache redundancy: Red & Blue traces_ doesn't really help you. (It only helps if you go wider since that requires dual-porting vs the quad-porting you'd have to do on the I-cache.) In fact, if you could get the coverage and branch prediction for a tCache to comparable levels of the L1, you could just ditch the L1 entirely and fill the tCache directly from the L2 using the same prefetching idea discussed above.

What might help however is developing an efficient trace predictor (maybe a TAGE-based version of the one from _Wide and Efficient Trace Prediction using the Local Trace Predictor_) and storing trace meta-data (similar to how _The Block-based Trace Cache_ works; in this case by storing info about two extended basic blocks). If this works, you could get the benefits of a renamed trace cache without actually having the cache. See, _Energy Efficiency Improvement of Renamed Trace Cache through the Reduction of Dependent Path Length_ for how a renamed tCache works.

Max Hayden Chiz

unread,
Feb 1, 2018, 4:20:46 PM2/1/18
to RISC-V HW Dev, vegh....@gmail.com, David.C...@cl.cam.ac.uk
Relatedly, you can also simplify Boom's FPGA register file using various techniques (probably a live value table for port indirection). I'd start with the "Related work section" of _A Multiported Register File with Register Renaming for Configurable Softcore VLIW Processors_. (Though I think the TLB thing is a much better project.)

Several people have asked me if I have ideas related to Rocket instead of Boom. I've posted all of the project ideas I have at this point. If anything comes up via discussions with others, I'll post here. But generally my ideas are geared towards a higher performance envelope b/c that's what I've historically worked on. Maybe others lurking in the thread can suggest some low power / single-issue / Rocket-related projects.

That's not to say that Rocket isn't important. Single-issue in-order processors probably cover 40% of the efficient frontier. And in terms of manufacturing volume, single-issue in-order cores dwarf everything else. I just don't have any ideas in this area beyond the ones already mentioned in the thread.

Max Hayden Chiz

unread,
Feb 5, 2018, 5:51:05 PM2/5/18
to RISC-V HW Dev
A few more *possible* projects (some of them Rocket-related); these have had less thought put into them then the original ones, so you'll have to do more research to see if these are really viable and appropriate:

10) Kaveh Aasaraai in a chapter of his 2014 thesis, _High Performance Soft Processor Architectures for Applications with Irregular Data- and Instruction-Level Parallelism_, manages to improve the clock rate of a soft processor to over 270MHz. This was a 90% increase in clock rate and (after the reduced IPC) led to an 80% increase in performance. He documents his methods and you could follow his approach with Rocket. (Though you'd probably have to use the blocking cache with direct mapping, and change the TLB and the BTB to be direct mapped as well.) Even if you can't get his results, documenting the critical paths in Rocket would be helpful. (As would documenting how various instruction set options impact things.) NB: he stops at 270MHz because that's the speed of the commercial processor he's using as a reference point, but this is arbitrary. Ideally you'd go until the increase in clock frequency is offset by the loss of IPC.

11) You could add run-ahead to Rocket. Possibly, incorporate the tricks for softcore implementation that Aasaraai uses in his thesis. (You could reuse these tricks for single-issue Boom, but unless you come up with a way to generalize them, I don't think they can be used for efficient checkpointing in a wider processor. And it isn't clear that an OoO core can use his pseudo-NBCache. Though if secondary misses are rare, then it would be an option to just halt or swap to run-ahead on a secondary miss.)

12) You could also incorporate Boom's better branch predictors into Rocket. The literature suggests this is good for at least a 5% performance gain. There are two ways to do this, one is to feed the data into the decode stage and redirect from there. The other is to use ahead pipelining (cited above) and do it in the fetch stage. A comparison of the costs and benefits of the two approaches might be interesting. (Also, while you are in there, you could add the statistical corrector stuff I suggested before. You could also consider an indirect jump predictor to make us better with interpreted code. I have in mind a tage-based option, but see _VPC Prediction: Reducing the Cost of Indirect Branches via Hardware-Based Dynamic Devirtualization_ for something that might be more cost effective but less performant. OTOH, see _Short-Circuit Dispatch: Accelerating Virtual Machine Interpreters on Embedded Processors_ which might get incorporated into RISC-V via an extension at some point and would moot this work; if you are going to work on this, ask the people on the ISA list if their dynamic languages extension is going to include something like this.)

13) In a recent thesis, _A Superscalar Out-of-Order x86 Soft Processor for FPGA_, Henry Ting-Hei Wong gets a faster softcore than we currently do for Boom. You could port over some/all of his micro-architectural ideas to Boom to try to get us similarly speedy on an FPGA. (If you are doing #10 you want to read this as well b/c he gives a very good explanation of why certain things are fast or slow on FPGAs.) He and Aasaraai actually disagree about the merits of doing wide-issue in a softcore. Aasaraai says that the IPC gain is outweighed by the clock speed loss and that you are better off with a single-issue OoO. Wong disagrees. (I think the reason for the disagreement may be that Aasaraai used *in-order* super-scalar cores for his comparison.) In any event if you do this, you want to rule on who is right in the context of Boom's design.

14) You could also expand on Wong's microarchitectural comparisons. E.g., He uses a 7r1w register file and has the register rename stage do the equivalent of the live value table. There are alternative options you could explore like banking the register file and clustering the execution units as in _CRAM: Coded Registers for Amplified Multiporting_. He also examines a variety of distributed single-issue schedulers. (And rules out using multi-issue ones for performance reasons). He settles on a hybrid matrix scheduler, but it seems like his schedulers could be improved on via methods like _Efficient Dynamic Scheduling Through Tag Elimination_, _Instruction Packing: Toward Fast and Energy-Efficient Instruction Scheduling_, or _Half-Price Architecture_ (the last of which also cuts the size of the register file). He also reports that the pointer-based scheduler I suggested in my first post doesn't work well in a softcore, but he doesn't explain why. It could be that the issue is with the use of a scoreboard to handle the select logic (and so would crop up in a two stage scheduler like _Select-Free Instruction Scheduling Logic_ but not one like _An Enhancement for a Scheduling Logic Pipelined over two Cycles_ that doesn't use a scoreboard). It could also be an issue with the redispatch logic. So consider trying an alternative pointer design like _Direct Instruction Wakeup for Out-Of-Order Processors_ and comparing them. (You could also use the original design from the Federation core report, but I think that one requires too many ports for a softcore implementation.) Similarly, his design for the Load/Store unit uses the original SVW design instead of the more efficient (but less performant) memory alias table from the federation design. (There's also the possibility that a completely different type of micro-architecture might work well in a softcore implementation. See, e.g.,  _CRIB: Consolidated Rename, Issue, and Bypass_.) Also, if you have access to a tech library, you could see how this stuff synthesizes in hardware vs the current boom design.

15) Re: the macro-op fusion we were discussing. There are basically three options for multi-operand fusion. You can fuse only when you have no more than two register reads and one write. (So in 3-op addition or shift-and-add, one of the operands would be an immediate.) You can add a read port to the register file and fuse for things that require three register reads but only one write. Or you could go whole-hog and add both a read and a write port and fuse as much as possible. I would be interested in seeing area, performance, and power comparisons for both hard and softcore implementations of this vs just going with a 4r2w file and going super-scalar. Is there some space on the efficient frontier between the Rocket core we currently have and a very simple two-way core? (NB: At the other end of the single-issue spectrum, assuming you have a modern tech library, you could see how a very minimal Rocket stacks up against a Cortex M0 and the others from that line. If the comparison is not favorable, you could try to fix this.)

16) Re: the dual-core execution / continuous runahead thing; the alternative option is to go for a very large (~2-4k) instruction window by replacing the ROB with checkpointing-only or with a validation buffer (or one of the hybrid designs from the literature). I.e. enough of a window to just keep going past any LLC miss until the memory returns. The validation buffer makes it easy to do things like control independence, but the checkpoint-only approach is perhaps easier to scale. You'd also have to come up with a way to save on LSQ entries (probably with a noSQ design), registers (early release and/or late allocation), and issue queue entries (a ROB-free variation of long-term parking or some kind of latency prediction followed by either prescheduling or Zephyr-like latency sorting). This is way more work than either of the things I originally proposed. A complete comparison of all the options would be a PhD thesis. But maybe you could do a piece of this for a masters. Adding counters to all the registers (see the register sharing paper I cited above) and making the modifications to have the ROB be a validation buffer might be a good starting point. (I already suggested the noSQ thing as part of idea #1 of my original previous post. If people did both of these, that would leave just the issue queue entries for future work.) If you do this, it's important to quantify the cost of the counters that let you do early register release. In the literature there are disagreements about how expensive this would be in practice. (Also, in addition to what I cited before for continuous runahead, see _Micro-Architectural Techniques to Alleviate Memory-Related Stalls for Transactional and Emerging Workloads_ by Islam Atta.)

17) If you can track down Wong (I can't find his email address; let me know if you do) and get him to share his code with you, you could use his x86 decoder as a black-box unit and work on hybrid DBT for Rocket and/or Boom. See _Efficient Binary Translation for Co-Designed Virtual Machines_ by Shiliang Hu. Even though some of the commercial vendors are working in this direction, I don't personally find it that interesting. I'd be more excited about DBT for one of the loop accelerators I mentioned. Or something like _HW/SW Mechanisms for Instruction Fusion, Issue and Commit in Modern μ-processors_ by Abhishek Deb where the complexity of the OoO processor is reduced by selectively using DBT methods. (If you do this, there are *lots* of suggestions in the literature for DBT-specific micro-ops for various sorts of optimizations.)

--Max

P.S. Someone asked me why I was concerned with having lots of fetch and issue width available as configuration options if I didn't think it was an optimal design point for actual usage. This is a specific example of a more general problem. If we are doing *research* on some area of a micro-architecture, the other parts have to not become bottlenecks. For an example of what happens if you don't do this, see _Dependence-Based Scheduling Revisited: A Tale of Two Baselines_.

Tanveer Ahmad

unread,
Apr 16, 2018, 7:56:01 AM4/16/18
to RISC-V HW Dev
Hi Max,

Thank you for your valuable suggestions.

This is Henry Wong page. http://www.stuffedcow.net/

I would like more insights on hybrid DBT for Rocket and/or Boom cores. Can you please? Thanks.

Andrea Merlo

unread,
Apr 28, 2018, 5:17:11 AM4/28/18
to Tanveer Ahmad, RISC-V HW Dev
Hi everyone,

I think that the ideas that has been proposed here are really
interesting and that they could be a driver for lots of student in
order to join the RISC-V ecosystem.

Since I would like to contribute myself in some of the projects
proposed here, I was asking if we have a "Project Ideas" page where
collect and structure the information stored in this mailing list. I
would have think of a Wiki page as the coreboot one
(https://www.coreboot.org/Project_Ideas), where detailed information
on the topic, literature and mentor could be included.

Is it something reasonable to build?

Andrea
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V HW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to hw-dev+un...@groups.riscv.org.
> To post to this group, send email to hw-...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/hw-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/hw-dev/7a4eae61-bab4-4e34-b988-f018bc028da1%40groups.riscv.org.

Max Hayden Chiz

unread,
Apr 29, 2018, 6:39:16 PM4/29/18
to RISC-V HW Dev
Sorry for taking so long to get back to you. I messed up my elbow and had to have surgery. I'll try to get to you by the end of the week.
Reply all
Reply to author
Forward
0 new messages