YouTube Live Interview on the PERCIVAL Posit RISC-V Core

192 views
Skip to first unread message

David Mallasén Quintana

unread,
Mar 14, 2023, 5:33:13 AM3/14/23
to Unum Computing
Hello all,

Next Thursday (March 23rd) at 19:00 CET I will be interviewed regarding our PERCIVAL posit RISC-V core live on YouTube by Antonio L. Flores-Galea . We will be talking about our work at UCM on posits and their impact. Make sure to tune in and ask any questions or comments you may have!

This interview will be mainly for the open public but, in case you missed it, you can also find a more detailed presentation on PERCIVAL in ARITH22 here: https://video.computer.org/ARITH22/14SvIRmYEyLOv0L5ccbnam-session4-DavidMallasen.mp4

Best,
David Mallasén

MitchAlsup

unread,
Mar 14, 2023, 3:00:07 PM3/14/23
to Unum Computing
Congratulations, well done.

Do you think there are means to shrink the quire part of the posit-unit ?

David Mallasén Quintana

unread,
Mar 16, 2023, 5:44:28 AM3/16/23
to Unum Computing
Thanks. As always, there are tradeoffs in that area. We explored part of this in our paper:
R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Energy-Efficient MAC Units for Fused Posit Arithmetic,” in 2021 IEEE 39th International Conference on Computer Design (ICCD), Oct. 2021, pp. 138–145. doi: 10.1109/ICCD53106.2021.00032.

John Gustafson

unread,
Mar 26, 2023, 5:56:51 PM3/26/23
to MitchAlsup, Unum Computing
I have suggested a way to make the quire very inexpensive in silicon, but so far no one has taken me up on it. Use 16 of the general-purpose registers as a quire, with hardware support for the shifting and carry propagation. There's a long history of treating register pairs as double-precision going back at least as far as IBM's System 360 back in the 1960s. Just extend the idea to 16 registers. If the code uses a quire accumulation, just push the 16 registers onto the stack, use them as a quire, and when the quire work is done, pop the saved registers back into place.

Perhaps a hardware expert in the Unum Computing group can explain to me why it's better to create a separate register for the quire instead of ganging the registers already defined in RISC-V.

Best,
John

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/1192fd51-7c0c-4397-8954-39173dc75f3en%40googlegroups.com.

Message has been deleted
Message has been deleted
Message has been deleted

John Gustafson

unread,
Mar 28, 2023, 6:24:05 PM3/28/23
to MitchAlsup, Unum Computing
Thanks, Mitch and others, for the detailed explanation of why ganging registers together is a losing idea. I will stop bringing it up!

Let me cut to your final four questions, and I will in-line my responses

I understand the motivation behind the quire, and I admire whomever though it up, but there are a few points I do not understand about a quire :

The person who deserves the most credit if Professor Ulrich Kulisch of the University of Karlsruhe. He built the first computing system with support for an exact dot product in the late 1970s, and also built an amazing mathematical foundation for how to use it. He fought to get it into the IEEE 754 Standard but it was rejected as too expensive by Kahan, who said double the working precision would be good enough. Kulisch also fought to get exact dot products into the IEEE 1788 Standard for Interval Arithmetic, and they were there for about six months until someone (I think at IBM) objected, and they were silently dropped from that standard.

I think there are two main reasons the exact dot product never caught on with floats. One is that 64-bit floats have an absurdly large dynamic range (since exponent adders were much cheaper to build than significand multipliers and transistors were millions of times more expensive than they are now). The accumulator needed to be, like, 4,664 bits wide or something like that. Not an integer power of 2. The other reason, and no on ever seems to bring this up, is the issue of how you deal with all the exception cases in IEEE floats. An exact dot product is a big fixed-point number, so how do you handle infinities and NaN? It's a mess. Posit arithmetic avoids both problems. The quire is always 16 times the size of the posit it supports, and like posits it just has the one NaR exception (1 followed by all 0 bits).

a) is there always 0 or 1 quire ? One would think that once a feature has risen into source code expression, SW can get/use as many as it desires !?!

Since it is possible to read a quire from memory and store it to memory, you can keep as many of them going as you can afford. when doing parallel processing with message passing, the quire has to be passed when computing a dot product done on more than one processor, so obviously there has to be a way to get them to and from memory. While the quire is envisioned as a register, the Posit Standard (2022) omits any such assumption or implication in its text. It can certainly be supported entirely with software, and there is full support for quire operations in Cerlane Leong's SoftPosit if you want to see a reference implementation.

b) is there (will there ever be) any quire OP quire arithmetic ?

Yes, add and subtract. You have to be able to do that for parallel add reduction and parallel dot products. If you're wondering about multiplying two quires, eek! That is what the Germans would call a schnappsidee. Very expensive or very slow or both, and almost invariably unnecessary.. There's nothing forbidding it in the Standard, but don't expect it to be supported in posit environments.

There's nothing forbidding people from constructing their own library routines to do operations on quires outside the Standard. We were very tempted to standardize an operation of shifting the quire by N bits (in either direction) before rounding back to a posit, since that could help keep the calculations in the high-accuracy "golden zone."

c) if you had a Padé polynomial to evaluate, with only 1 quire you have to evaluate the numerator, and then evaluate the denominator separately; this might be a burden on the compiler (writer) since the polynomials are generally expressed in pairs of Horner or Estrin evaluations finishing up with FDIV. (this smells like more of (a) to me)

Evaluate the numerator and use the quire to get it right to 0.5 ULP. Kulisch has a lot of papers and books about how to do this. For safety, you can increase accuracy this way, where numeratorHigh and numeratorLow are posit data types:

numeratorHigh = qToP(quire);
quire –= numeratorHigh;
numeratorLow = qToP(quire);

That gives the numerator in extended precision, like the "double-double" libraries that simulate quad precision floats. The numerator is stored as an unevaluated sum of high-order and low-order bits.

The quire is then free to evaluate the denominator to with 0.5 ULP. No need for two quires. With denominator in posit format, compute

quotient = numeratorHigh / denominator + numeratorLow / denominator;

It's not bombproof, but will be within 0.5 ULP of the exact quotient most of the time. If you're really fanatic, you can use the quire to compute the remainder exactly:

quire = numeratorHigh + numeratorLow – quotient * denominator;

which can then be used to correct the quotient.

d) swapping of quires into these semi-dedicated registers will not be fun for the HW or compiler--leading to code explosion and poor performance.

Whether you are doing Machine Learning or LINPACK, you have long dot products but only one accumulator. There should be very few cases where you need more than one quire at a time. Hardware implementations show that doing a multiply-accumulate into the quire is cheap, even cheaper than doing it with fused multiply-adds since there is no normalization and no rounding needed. Converting the quire back to a posit is admittedly expensive, but you amortize that cost over thousands of (exact) multiply-accumulates.

Interval arithmetic is a tough sell because it requires a lot of study to figure out how to avoid results that all look like [–∞, ∞]. Only a small group of people have that knowledge and skill set. The quire is a much, much cheaper way to let people access rigorous calculations instead of just crossing fingers and hoping that rounding error won't silently ruin your results.

You ask;
Mitch

Best,
John
--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/1192fd51-7c0c-4397-8954-39173dc75f3en%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.

Oscar Smith

unread,
Mar 28, 2023, 11:41:53 PM3/28/23
to Unum Computing
> Whether you are doing Machine Learning or LINPACK, you have long dot products but only one accumulator.

What about Matrix Matrix multiply? The naive matmul with inner products works out OK for quires, but a modern CPU will be significantly load starved if you implement your matmul that way (and for all of the benefits of Posits, they don't make your ram or cache any faster).

Once you start looking into how you want to do an optimized matmul with register tiling, posits start looking a lot worse. For an example of a relatively simple kernel, see https://github.com/flame/blis/blob/master/kernels/haswell/3/bli_gemm_haswell_asm_d8x6.c (line 151 to 253). This is an AVX-2 kernel which is on the smaller side for modern desktops, but even it accumulates into a 12 registers (96 elements at a time). More modern systems like AVX-512 or ARM/RISC CPUS with bigger/more registers will accumulate in even more pieces (i.e RISC-V with 256 bit vectors would want to accumulate into 27 registers=216 float32s). Doing so gives a compute/load ratio of 3.4 for AVX-2 and 4.2 for RISC-V.

How would you implement a gemm using a single quire without being completely load bound?

Tommy Thorn

unread,
Apr 2, 2023, 3:41:14 PM4/2/23
to MitchAlsup, Unum Computing
For a superscalar renaming microprocessor, abrusing integer registers for the quire is a flat out non-starter.  It would severely impact the highly optimized data paths.

Also, even if you architecturally only have a single quire, your execution window is 300 and growing instructions wide and there will likely be more than a single quire active.

Tommy

On Tue, Mar 28, 2023 at 13:05 'MitchAlsup' via Unum Computing <unum-co...@googlegroups.com> wrote:


On Sunday, March 26, 2023 at 4:56:51 PM UTC-5 johngustafson wrote:
I have suggested a way to make the quire very inexpensive in silicon, but so far no one has taken me up on it. Use 16 of the general-purpose registers as a quire, with hardware support for the shifting and carry propagation. There's a long history of treating register pairs as double-precision going back at least as far as IBM's System 360 back in the 1960s. Just extend the idea to 16 registers. If the code uses a quire accumulation, just push the 16 registers onto the stack, use them as a quire, and when the quire work is done, pop the saved registers back into place.

HW people really hate ganging registers, compiler writers really hate ganging registers; but for different reasons, Architects have other reason to choose-differently.
<-----------------------------
Over on the HW side, we hate having instructions that require odd-get-started sequences, and odd-finish-up sequences; which is what you current suggestion would do (I think).
<-----------------------------
Over on the compiler side, Compiler writers would have to "see" that a quire was required, before doing any register allocation (probably not hard) then setup a prologue and epilogue to deal with the swath of registers required (also not very hard). But if you ran into a subroutine where it had 1 path using a quire and another path devoid of quire, then the subroutine would have to pay quire-overheads even for the majority of subroutine calls that did not need the quire. Pushing prologue and epilogue sequences into nesting blocks is quite hard (although doable, it is rarely done.)

Then you have the situation where the compiler needs 20+ FP registers to perform the source algorithm and by allocating a quire you end up with a bunch of spill/fill instructions in the subroutine. This is likely the biggest enemy of your current thought train.
<-----------------------------
Architects generally have a fee hand in choosing certain features of their architecture; one of these features is separate versus unified register model. My 66000 architecture has chosen the unified register model while many have chosen the separate registers model (more choose separate than unified by a considerable margin). For those who choose the unified model, the mapping of a quire into the register file is a high enough hill to climb that it would discourage adopting Posits over IEEE 754. Something I suspect you don't want at this moment.

I have been studying various numeric benchmarks using the LLVM RISC-V compiler and My 66000 LLVM compiler (same source code, same compiler flags, trying to make instruction set comparisons as fair as possible) and I have run into several subroutines where the RISC-V LLVM compiler had to generate spill/fill instructions when it has 32 64-bit GPRs and 32 64-bit FPRs while I only have 32 64-bit GPRs and need no spill/fill instructions. So, the choice of unified versus separate is a bit more delicate than one might surmise--it depends on other features of the ISA--and in this case features RISC-V did not have.

Perhaps a hardware expert in the Unum Computing group can explain to me why it's better to create a separate register for the quire instead of ganging the registers already defined in RISC-V.

I understand the motivation behind the quire, and I admire whomever though it up, but there are a few points I do not understand about a quire ::
a) is there always 0 or 1 quire ? One would think that once a feature has risen into source code expression, SW can get/use as many as it desires !?!
b) is there (will there ever be) any quire OP quire arithmetic ?
c) if you had a Padé polynomial to evaluate, with only 1 quire you have to evaluate the numerator, and then evaluate the denominator separately; this might be a burden on the compiler (writer) since the polynomials are generally expressed in pairs of Horner or Estrin evaluations finishing up with FDIV. (this smells like more of (a) to me)
d) swapping of quires into these semi-dedicated registers will not be fun for the HW or compiler--leading to code explosion and poor performance.

You ask;
Mitch

Best,
John
--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/1192fd51-7c0c-4397-8954-39173dc75f3en%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.

John Gustafson

unread,
Apr 2, 2023, 8:16:59 PM4/2/23
to Oscar Smith, Unum Computing, Lenore Mullin
Oscar,

You raise an interesting point. The quire will require re-thinking some of the ways of tiling the matrices for optimum speed without sacrificing accuracy. In architectures for which I've optimized GEMM, the submatrices are seldom square. They wind up being rectangular, like 2:1 or 4:1 on a conventional microprocessor or more eccentric like 24:1 on a ClearSpeed accelerator (64-bit PCI board that predated GPGPUs).

Pipelined architectures tend to encourage the computation of as many dot products at once as there are stages in the floating-point adder. Typically four. A well-designed quire can keep the multiplier-accumulator busy doing a single dot product, so you get top speed less the amortized cost of converting the final sum back to posit format. That suggests to me that you can finish a dot product with one quire, put it into temporary storage if necessary (in general its inputs won't be the full length of the row of A and column of B, so it's a partial sum), then do the next dot product with re-use of input operands to keep memory bandwidth from limiting performance. The work done by Biancolin et al. on the ASPIRE project at Berkeley has shown that full-speed exact dot products can be built for floats. They're much easier to build with posits.

There is a science to optimizing GEMM or a range of other tensor product operations, and Prof Lenore Mullin has the whole mathematical framework for it if you want to contact her directly… I've cc'd her. In a nutshell, for any given architecture it is possible to write down a closed form expression for the time taken on a GEMM once you know the costs and parallelism of every level of the memory hierarchy and the speed and number of parallel multiply-accumulators, and it will be a function of the tiling dimensions.

Once you have that closed form expression, you can find the optimum tiling dimensions. And it may be that you only get 80% of peak speed and discover that with two quires you might get close to 90% of peak speed, so yes, it might be worth it to have another quire. The optimum for a quire-enabled system will be quite different from the optimum for an AVX-512 or an ARM or a GPU. This looks like a good idea for a technical paper, gang!

Best,
John


Oscar Smith

unread,
Apr 4, 2023, 12:41:09 PM4/4/23
to Unum Computing
> Pipelined architectures tend to encourage the computation of as many dot products at once as there are stages in the floating-point adder

This is the minimum, but the bigger you do at once, the lower the load pressure is. Most CPUs are memory bound, not compute bound (even for matmuls). Matmuls require a cubic number of loads, and using a bigger kernel (more concurrent accumulators) reduces the coefficient.

While "A well-designed quire can keep the multiplier-accumulator busy doing a single dot product", on modern vectorized CPUs, you can do 512 bits of accumulation at a time, and there is no way for a well designed quire to keep the multiplier-accumulator busy while doing 8 dot products at a time (and some CPUs now have capability to do 1024 bits of fma per cycle).

Lenore Mullin

unread,
Apr 4, 2023, 3:58:55 PM4/4/23
to John Gustafson, Oscar Smith, Unum Computing
The MoA matrix multiplication accesses all arrays contiguously and that includes a new representation for sparse arrays.
I attach a paper under review to help you understand the design of The dense version but what is  most cool is that the same algorithm is for sparse MM. I can demo that. I love to be able to combine my work with John’s. How can I help?
--
"Great spirits have always encountered violent opposition from mediocre minds" - Albert Einstein

MitchAlsup

unread,
Apr 4, 2023, 4:33:54 PM4/4/23
to Unum Computing
On Sunday, April 2, 2023 at 2:41:14 PM UTC-5 tommy...@gmail.com wrote:
For a superscalar renaming microprocessor, abrusing integer registers for the quire is a flat out non-starter.  It would severely impact the highly optimized data paths.

I would assume that consuming  ½ (or even ¼) of your register file for 1 container would be a non-starter in any architecture, not just GBOoO ones; no mater how clever you are about its position in the register file and ISA.

Also, even if you architecturally only have a single quire, your execution window is 300 and growing instructions wide and there will likely be more than a single quire active.

And if it takes 16 ST to push one out and 16 LDs to bring the next one in, you have a lot of memory references to hide.

But even John has withdrawn favor of this proposal; so it is pretty much dead. This throws "how a quire is manipulated, accessed, and addressed" back into the hands of ISA and data-path architects.

Mitch

Angelo Bulfone

unread,
Sep 1, 2023, 3:23:24 AM9/1/23
to Unum Computing
What about the RISC-V Vector registers? They already support register grouping with LMUL and EMUL. Whatever the current LMUL for vectorized posit arithmetic is, quire operations would implicitly use an EMUL of LMUL*16. Given the available LMUL options, this would at least support LMUL 1/8 with quire EMUL of 2, 1/4 and 4, and 1/2 and 8. Alternatively, there could be a multiply-accumulate-reduce instruction that multiplies two posit vectors and then accumulates their results directly into a quire, which depending on the circumstance may only even use a single vector register for the target quire.

If the quire needs a lot of register space, why not use the existing large register file that will be implemented anyways? There's nothing stopping you from using VL=1 to preform scalar operations in the vector registers while still having the full quire available even if the registers in question are only 64 bits wide.

Angelo Bulfone

unread,
Sep 8, 2023, 3:37:37 AM9/8/23
to Unum Computing
While currently reserved, the RISC-V V extension has encodings for Selected Element Width and Extended Memory Element Width that seem to be intended to eventually allow 128, 256, 512, and 1024-bit elements in the vector registers. As a bonus, the encoding for the extended element widths can have the lower bits be exactly the same as the width of the posits they correspond to. Combine these with the usual integer operations and most of what you need to support a quire is already available with the existing instructions. A minor kink is that currently the Zve* extensions explicitly require that each vector register be at least as wide as the widest element it supports, but given the extended element widths, that can probably be fudged. That just leaves instructions to convert to and from a quire with an extended element width, maybe even a mul_add_reduce_into_quire instruction, but that seems a bit too ambitious.

If the posits themselves are stored in the vector registers as well, then the usual comparisons, negations, and shifts become available. vnclip is perfect for converting wider posits into smaller ones thanks to the Fixed-Point-Rounding Mode (vxrm) register allowing the round-to-nearest-even behavior, and the opposite direction can use vwsll, but that's only available in Zvbb rather than the usual vector extensions.

Reply all
Reply to author
Forward
0 new messages