Reasoning behind the removal of FMA in Posit

133 views
Skip to first unread message

Oscar Smith

unread,
Mar 5, 2023, 1:50:02 AM3/5/23
to Unum Computing
Hi John and all,
I've been really interested in Posits for a while as I believe that they solve a lot of significant problems with IEEE numbers. However, in my opinion, the lack of a fused multiply add operation in the Posit standard is a major omission that makes them less viable as a format for writing efficient algorithms.

Arguments in favor of FMA:
1. FMA can be efficiently implemented in hardware but not in software. Quires help a lot with fma emulation but are likely to be a fairly slow replacement (see point 5 below)
2. polynomial evaluation computing x*(x*(x*(x+d)+c)+b)+a (e.g. horner's algorithm for polynomials) is 2x faster and more accurate (0.5 vs 1.5 ULP) with FMA. This is espeically important if we want CPUs in the future to be Posit only since pretty much every elementary function needs accurate and fast polynomial evaluation.
3. Convolution evaluation.
4. The fMM function is already a required 3 argument function so any Posit instruction set already has to deal with functions of 3 arguments.
5. Quires don't work well with vectorization. SIMD and vector instruction sets are vital on modern computers for achieving high performance, but vectorized quire instructions will inherently be dramatically slower due to the increased amount of data needing to be transferred. For example, with a 512 bit wide SIMD FMA instruction on 16 bit floats requires reading 512*3=1536 bits and writing 512 bits (and requiring 1536 bits of register space). To perform an equivalent operation in Posits currently would require 3 instructions: pToQ, qMulAdd, and qToP.  Since the 32 Posit16 quires require 8192 bits, and this operation requires two reads and writes to all of them, this operation will currently require 9728 bits of register (8192 for the quires and 1536 for the Posits), and reading and writing over 16kb. As such, the Posit solution requires over 10x as much data transfer and 6x as much register space. As such, we are likely to see at least one of 3 things.

1. No vectorized quire instructions (making the operations 32x slower)
2. Limited vector width
3. dramatically lower performance.

All in all, I understand that FMA isn't as essential for Posits, but I think that not requiring them in the spec will do a significant dis-service to implementers and users of Posit focused machines.

MitchAlsup

unread,
Mar 5, 2023, 7:13:19 PM3/5/23
to Unum Computing
On Sunday, March 5, 2023 at 12:50:02 AM UTC-6 oscard...@gmail.com wrote:
Hi John and all,
I've been really interested in Posits for a while as I believe that they solve a lot of significant problems with IEEE numbers. However, in my opinion, the lack of a fused multiply add operation in the Posit standard is a major omission that makes them less viable as a format for writing efficient algorithms.

Arguments in favor of FMA:
1. FMA can be efficiently implemented in hardware but not in software. Quires help a lot with fma emulation but are likely to be a fairly slow replacement (see point 5 below)
2. polynomial evaluation computing x*(x*(x*(x+d)+c)+b)+a (e.g. horner's algorithm for polynomials) is 2x faster and more accurate (0.5 vs 1.5 ULP) with FMA. This is espeically important if we want CPUs in the future to be Posit only since pretty much every elementary function needs accurate and fast polynomial evaluation.

Yes, polynomial evaluation needs some mechanism to be fast and efficient.
No, Horner's method is not one of them.
Estrin's method tree-ifies the multiplications and is significantly superior to Horner's method.
However, as explained in my patent (10, 671,806), inside HW there is an evaluation means that is superior to Estrin's method.
This is one reason I have put Transcendental instructions back in my ISA.........so I can evaluate in HW.......
 
3. Convolution evaluation.

Big Convolutions are done with FFT; multiplication, inverse FFT.
Big Correlations are done with FFT, conjugate multiplication, inverse FFT.

FFTing pays off once you number of points gets above about 32.
 
4. The fMM function is already a required 3 argument function so any Posit instruction set already has to deal with functions of 3 arguments.
5. Quires don't work well with vectorization.
I take no issues on that statement. 
SIMD and vector instruction sets are vital on modern computers for achieving high performance, but vectorized quire instructions will inherently be dramatically slower due to the increased amount of data needing to be transferred. For example, with a 512 bit wide SIMD FMA instruction on 16 bit floats requires reading 512*3=1536 bits and writing 512 bits (and requiring 1536 bits of register space). To perform an equivalent operation in Posits currently would require 3 instructions: pToQ, qMulAdd, and qToP.  Since the 32 Posit16 quires require 8192 bits, and this operation requires two reads and writes to all of them, this operation will currently require 9728 bits of register (8192 for the quires and 1536 for the Posits), and reading and writing over 16kb. As such, the Posit solution requires over 10x as much data transfer and 6x as much register space.
 
You are assuming that SIMD instructions and Vector instructions are the only reasonable means to "strip mine data through function units". There are other means: Virtual Vector Method is one.

As such, we are likely to see at least one of 3 things.

1. No vectorized quire instructions (making the operations 32x slower)
2. Limited vector width
3. dramatically lower performance.

All in all, I understand that FMA isn't as essential for Posits, but I think that not requiring them in the spec will do a significant dis-service to implementers and users of Posit focused machines.

I guess I find myself surprised that there is no FMAC instruction requirement in posits and
I guess I find myself surprised that there is no FMAC numerical specification in posits.
I find myself surprised that there is no equivalent to Kahan-Babuška summation in posits, but perhaps
this is exactly what is attacked with quires from a different direction (perfection rather than good enough).

Is there an arithmetic reason for leaving this 3-operand functionality out of posits ?? 

Oscar Smith

unread,
Mar 5, 2023, 7:51:53 PM3/5/23
to Unum Computing
Horner's method is definitely one of them. Estrin's will be faster in some cases (mainly when you have nothing else for the CPU/GPU to do), but Horner's method is slightly more accurate and uses strictly fewer operations which leads to higher speed as long as there is enough other work to be done at the same time. With the amount of register renaming and operation re-ordering modern CPUs do, Horner's will often win out
(or the difference will be in the noise). All of this is a tangent anyways since Estrin's algorithm also requires FMA operations. Your patent is interesting, but since it's been ~25 years and I've yet to see an implementation of it, it doesn't seem especially relevant.

Do you have a reference for the Virtual Vector method? A quick google only turns up ViVA from the IBM Cray supercomputers. If this is what you're talking about, I don't understand why this is a contradiction to what I have written above. The concern I have with speed doesn't have to do with the UI, but rather the sheer amount of data that must be read and  written by registers, and the amount of processing that must be done by the ALU.

John Gustafson

unread,
Mar 5, 2023, 8:05:17 PM3/5/23
to Oscar Smith, Unum Computing
Hi Oscar and all,

Responses embedded:

Hi John and all,
I've been really interested in Posits for a while as I believe that they solve a lot of significant problems with IEEE numbers. However, in my opinion, the lack of a fused multiply add operation in the Posit standard is a major omission that makes them less viable as a format for writing efficient algorithms.

It's not omitted, because fused dot product is in the Posit standard, and fused multiply-add is a subset of that. The Working Group endeavored to make the Standard as concise as possible, and maybe it was so concise that this point did not come across. But we realize that fused multiply-add is easier to build and possibly faster to execute than accumulating in the quire and converting back to posit, so we put this line in the Standard:

Fused expressions (such as fused multiply-add and fused multiply-subtract) need not be performed using quire representations to be posit compliant.

There are no performance requirements in the Standard, intentionally. It's perfectly OK to support all the quire operations in software, for example… but they HAVE to be supported. We envisioned that hardware engineers would provide fused multiply-add for posits, and would exploit the fact that you don't need a full quire for that.

Arguments in favor of FMA:
1. FMA can be efficiently implemented in hardware but not in software. Quires help a lot with fma emulation but are likely to be a fairly slow replacement (see point 5 below)

Yes, that's why the sentence was put in the Standard, excerpted above. You say they are likely to be slow, but those who have actually built quires and tested them have found them to be faster than fused multiply-add. To accumulate in the quire is simply a shift-and-add of an exact integer product, whereas a fused multiply-add must align the radix points and round for the add part, with a possible need to count leading zeros if there is digit cancellation on the left. Kulisch discovered this over 50 years ago, and the Berkeley ASPIRE researchers discovered it again just a few years ago when they built an exact dot product accumulator for 64-bit IEEE floats.

2. polynomial evaluation computing x*(x*(x*(x+d)+c)+b)+a (e.g. horner's algorithm for polynomials) is 2x faster and more accurate (0.5 vs 1.5 ULP) with FMA. This is espeically important if we want CPUs in the future to be Posit only since pretty much every elementary function needs accurate and fast polynomial evaluation.

Actually, Horner's algorithm is unstable near the roots of polynomials and this is a considerable source of grief. The cure is to express Horner's rule as a lower triangular system of equations, like this:

Lt=b.pdf

MitchAlsup

unread,
Mar 5, 2023, 8:53:16 PM3/5/23
to Unum Computing
On Sunday, March 5, 2023 at 6:51:53 PM UTC-6 oscard...@gmail.com wrote:
Horner's method is definitely one of them. Estrin's will be faster in some cases (mainly when you have nothing else for the CPU/GPU to do), but Horner's method is slightly more accurate and uses strictly fewer operations which leads to higher speed as long as there is enough other work to be done at the same time. With the amount of register renaming and operation re-ordering modern CPUs do, Horner's will often win out
(or the difference will be in the noise). All of this is a tangent anyways since Estrin's algorithm also requires FMA operations. Your patent is interesting, but since it's been ~25 years and I've yet to see an implementation of it, it doesn't seem especially relevant.

My patent, granted and issued in 2021 (Sept) is 25 years old ???

In any event, using FMAC (which john down below states is (IS) part of posits) with a 4 cycle latency, one
can start a new Horner multiplication once every 4 cycles. Horner's method::

The calculation sequence for the degree 7 polynomial from beginning to end is:

      S6   = C6 + r×C7

     S5   = C5 + r×S6

     S4   = C4 + r×S5

     S3   = C3 + r×S4

     S2   = C2 + r×S3

     S1   = C1 + r×S2

     p(r) = C0 + r×S1

 And taking data-dependencies into account goes::

     S6   = C6 + r×C7

                        // waiting for S6

                        // waiting for S6

                        // waiting for S6

     S5   = C5 + r×S6

                        // waiting for S5

                        // waiting for S5

                        // waiting for S5

     S4   = C4 + r×S5

                        // waiting for S4

                        // waiting for S4

                        // waiting for S4

     S3   = C3 + r×S4

                        // waiting for S3

                        // waiting for S3

                        // waiting for S3

     S2   = C2 + r×S3

                        // waiting for S2

                        // waiting for S2

                        // waiting for S2

     S1   = C1 + r×S2

                        // waiting for S1

                        // waiting for S1

                        // waiting for S1

     p(r) = C0 + r×S1

Whereas: Estrin's evaluation goes::

            r2   = r×r             

            K6   = C6 + C7×r

     K4   = C4 + C5×r

     K2   = C2 + C3×r

     K0   = C0 + C1×r

     r4   = r2×r2       

     K'4   = K4 + K6×r2

                        // waiting for K0

     K'0   = K0 + K2×r2

                        // waiting for K'0 and K'4

                        // waiting for K'0

                        // waiting for K'0

     p(r) = K'0 + K'4×r4
So there are many fewer wait cycles in Estrin's method than Horner's.

Now, in comparison, working InSide the FAMC unit, I can use a straight Power Series and get::

     r2 = r×r

     S0 = C0 + C1×r

     r3 = r×r2

     S1 = S0 + C2×r2

     r4 = r2×r2

     S2 = S1 + C3×r3

     r5 = r3×r2

     S3 = S2 + C4×r4

     r6 = r4×r2

     S4 = S3 + C5×r5

     r7 = r5×r2

     S5 = S4 + C6×r6

                        // wait for S5

     p(x) = S5 + C7×r7
With only 1 wait state. As illustrated this is much faster than Horner's method. And I can get this because
the multiplier tree is fully pipelined.

Do you have a reference for the Virtual Vector method? A quick google only turns up ViVA from the IBM Cray supercomputers. If this is what you're talking about, I don't understand why this is a contradiction to what I have written above. The concern I have with speed doesn't have to do with the UI, but rather the sheer amount of data that must be read and  written by registers, and the amount of processing that must be done by the ALU.

Well, I invented it, and although I have spoken often about in on comp.arch, it appears little has reached the 
outside world. Would you like a *.pdf ?? if you find my e-mail address* and I will send a copy.
(*) hint:: it is at the top of my responses right here.........

Oscar Smith

unread,
Mar 5, 2023, 9:09:57 PM3/5/23
to Unum Computing
Right, my point with Horner's method is that if you are say calculating 4 polynomials at the same time it turns into
S6,1   = C6 + r1×C7
S6,2   = C6 + r2×C7
S6,3   = C6 + r3×C7
S6,4   = C6 + r4×C7
S5,1   = C6 + r1×S6,1
S5,2   = C6 + r2×S6,2
S5,3   = C6 + r3×S6,3
S5,4   = C6 + r4×S6,4
S4,1   = C6 + r1×S5,1
S4,2   = C6 + r2×S5,2
S4,3   = C6 + r3×S5,3
S4,4   = C6 + r4×S5,4
S3,1   = C6 + r1×S4,1
S3,2   = C6 + r2×S4,2
S3,3   = C6 + r3×S4,3
S3,4   = C6 + r4×S4,4
S2,1   = C6 + r1×S3,1
S2,2   = C6 + r2×S3,2
S2,3   = C6 + r3×S3,3
S2,4   = C6 + r4×S3,4
S1,1   = C6 + r1×S2,1
S1,2   = C6 + r2×S2,2
S1,3   = C6 + r3×S2,3
S1,4   = C6 + r4×S2,4

All the pipeline stalls disappear, and the result is roughly 2x faster than Estrin.

In less highly ordered computation the benefit of horner is reduced, but given the out of order capabilities of modern chips, even if you don't write code explicitly that fills the pipeline stall, there is a reasonable likelihood that something will be able to anyway.

Oscar Smith

unread,
Mar 5, 2023, 9:26:43 PM3/5/23
to Unum Computing
That sentence in the spec does clear things up a lot, but I still think a FMA instruction should be added to section 5.7 as otherwise it isn't clear that that is an operation that can be expected to exist in Posit arithmetic (and it can be implemented using quires for chips without dedicated FMA hardware).

Posits without FMA are definitely a lot better than Floats without FMA since Quires make it relatively simple and fast to emulate an FMA instruction.

> those who have actually built quires and tested them have found them to be faster
 than fused multiply-add

The papers I've seen evaluating the speed of quires have done so in the context of dot products where you don't have to deal with lots of Posit->Quire and Quire->Posit conversions (which are required for things like Horner's algorithm). The speed of a PtoQ, qMulAdd and QtoP will necessarily be slower than an FMA since it computes the same thing and also has to keep the Quire in working state.

I don't see what you get from expressing Horner's rule as a lower triangular system of equations. The way you solve a bi-diagonal system is a chain of FMAs (e.g. Horner's algorithm). Also the instability near roots isn't a major problem in the context of implementing special functions since the implementer has the freedom to chose a method where you aren't evaluating the polynomial near numerically unstable roots.

MitchAlsup

unread,
Mar 6, 2023, 1:47:44 PM3/6/23
to Unum Computing
What percentage of time do you have more than 2 polynomials of similar length at the same time ??

Oscar Smith

unread,
Mar 6, 2023, 1:57:42 PM3/6/23
to Unum Computing
This shows up in a number of cases. Some examples are Bessel functions, or evaluation rational polynomials (there you only have 2, but the cross over between horner and estrin is roughly at 2).
Also, any time you need to evaluate a vectorized special function (e.g. `tanh.(v)` shows up all the time in neural nets), you can interleave 4 separate invocations of the `tanh` to remove the data dependency.
The more important part though is that and even if you don't, modern CPUS have very long reorder buffers (Intel and AMD are over 200 and apple's M1 is 600 elements). This means that all of the gaps you drew in the scalar case will be filled in as long as there is any work to be done in the next couple hundred instructions that isn't dependent on the polynomial (and you don't need perfect fill in since Estrin  has more instructions than Horner).

MitchAlsup

unread,
Mar 6, 2023, 9:32:37 PM3/6/23
to Unum Computing
On Monday, March 6, 2023 at 12:57:42 PM UTC-6 oscard...@gmail.com wrote:
This shows up in a number of cases. Some examples are Bessel functions, or evaluation rational polynomials (there you only have 2, but the cross over between horner and estrin is roughly at 2).
Also, any time you need to evaluate a vectorized special function (e.g. `tanh.(v)` shows up all the time in neural nets), you can interleave 4 separate invocations of the `tanh` to remove the data dependency.

Perhaps you should know I hold a patent on performing Transcendental calculations in(side of) an FMAC unit,
and that ln2() takes 14 cycles, exp2() 16 cycles, sin() and cos() take 19 cycles including Payne and Hanek
argument reduction! So, these evaluations are not a series of instructions but a series of steps taking place
entirely within the FMAC unit. Just like FDIV and SQRT are a series of steps inside the unit assigned to do
FDIVs and/or SQRTs. The latency<->instruction count tradeoffs are very much different. Once you get down
to these cycle counts, for a fairly large number of transcendentals (~40) much of your argument vanishes;
you end up comparing the latency of something akin to FDIV against a SW polynomial that when executed
natively, in a library that cares about accuracy, would be taking 150+odd cycles {maybe per SIMD lane}.

In a year or so you may be able to buy on of my machine implementations.
 
The more important part though is that and even if you don't, modern CPUS have very long reorder buffers (Intel and AMD are over 200 and apple's M1 is 600 elements). This means that all of the gaps you drew in the scalar case will be filled in as long as there is any work to be done in the next couple hundred instructions that isn't dependent on the polynomial (and you don't need perfect fill in since Estrin  has more instructions than Horner).

That is the theory! However, this is one of those places where theory works better in theory than in practice.

Athlon and Opteron had reorder buffers on the order of 40 instructions (x86 instructions) and were achieving
2/3rds of the cycle-by-cycle throughput of the modern ultra-deep ROBs......while the instruction decode width
has gone from 3 to 4, and cache hierarchy more effective. {I was there at AMD in that time period.....}

I think you are overplaying the SIMD hand and the deep RoB hand--but for your applications they just might be true.
Me, on the other hand, I believe when I see compilers/linkers/JITers/* performing at 3+ native-inst/cycle.
(*) I would even believe if the OS were performing at 3 I/C levels........
<snip>
Reply all
Reply to author
Forward
0 new messages