I'm not sure what the appropriate process for providing this kind of input is, so I'm sending this to isa-dev list and cc'ing the head of the vector extension working group.
I'm very impressed by RISC-V and everything that has happened since it launched. And I'm strongly considering buying an FPGA board just to play with it and Chisel. (I used to do FPGA development for work, but haven't in years.)
I had a few thoughts regarding the vector extension you are working on. While others have probably had similar ideas, I wanted to share mine in case they are helpful since there doesn't seem to have been much discussion on the list.
1) Even though a vector architecture allows code to run on different hardware implementations, people will still compile fat binaries that detect the microarchitecture and use code tuned to a particular implementation because of the performance gains from cache blocking. But the same ideas that work for variable vector length also work for variable cache size. So I propose having a length mechanism for blocking at least the first cache level used by the vector unit and possibly the full cache hierarchy. You may want to think of the ATLAS BLAS library as a test case. To the extent that you can eliminate machine specificity from their optimizations, it will improve the effective performance of your vector architecture by bringing generic binaries closer to the machine optimized case. Having the cache blocking be determined like vector length also has the benefit that the optimization can be confident that higher level algorithms built on top of BLAS will agree as to the cache blocking size. (ATLAS currently has some limitations on the extent of its blocking optimizations because it can't be confident that libraries calling it will agree with it.)
2) I haven't seen any mention of how you plan to implement segmented scan primitives. Because segmented scan is performance critical for vectorizing code with irregular data structures, how the processor will implement these operations needs careful thought. The approach of Chaterjee et al. _Scan primitives for vector computers_ (
https://www.cs.cmu.edu/~guyb/papers/sc90.pdf) suffers from excessive data copying. (See section 2.1 of
http://manticore.cs.uchicago.edu/papers/ppopp13-flat.pdf) So, there's a case to be made for doing at least some aspects in hardware if you can significantly improve on the performance or memory usage.
The problem with Chaterjee's implementation is that the segmented scan algorithm uses the predicate register to define segment boundaries. Consequently the predicate register can't be used to handle control flow and algorithms built on top of it have to do lots of splitting and merging as a result. Therefore I suggest that the ISA allow for an efficient version of *predicated* segment scan.
Unfortunately, I've never done assembly programming for a vector unit so I'm not confident enough to tell you how to adjust the ISA to do it, but the algorithms that rely on these primitives are an important test case and all existing implementations suffer from an inability to do predication and segmentation at once. To the extent that allowing for both requires hardware, I don't think it will be a major expense and the performance benefits would be significant. (For reference, here is a GPU version of these primitives:
www.idav.ucdavis.edu/func/return_pdf?pub_id=915 and here is the Haskell version for multi-core processors:
https://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/papers-ndph.pdf)
3) According to Sheikh et al. _Control-Flow Decoupling_ (
https://people.engr.ncsu.edu/ericro/publications/conference_MICRO-45.pdf) about one third of branch mispredictions are branches that can be separated from their loops and executed as vectorized code. Because these are half of all the TAGE mispredictions that can't be if-converted, investigating how many of these occur in scalar loops that can't be auto-vectorized is worthwhile. Intuitively one would expect that there are loops with vectorizable control flow but that have loop dependencies in the data that prevent full vectorization. To the extent that this is common, an extension to deal with it should be worthwhile.
The architecture proposed in that paper uses a branch queue and an optional value queue, but a cleaner design might be possible. My thinking is that the the vector version this would let you calculate the branch outcomes vectorially and then fill the branch queue (or whatever) from a predicate register. If you also allow for reading of vector data registers into the scalar processor, then that should also allow for any partially vectorizable loop to use the vector unit for as much of its code as possible. That said, I'm not sure how common partial vectorization is, especially given that the code examples in the paper all seem fully vectorizable. So this is worth investigating, but as a possible extension.
I hope these thoughts have been helpful. Keep up the good work.
Best wishes,
--Max H. Chiz
New Orleans, LA