Hi John and all,I've been really interested in Posits for a while as I believe that they solve a lot of significant problems with IEEE numbers. However, in my opinion, the lack of a fused multiply add operation in the Posit standard is a major omission that makes them less viable as a format for writing efficient algorithms.Arguments in favor of FMA:1. FMA can be efficiently implemented in hardware but not in software. Quires help a lot with fma emulation but are likely to be a fairly slow replacement (see point 5 below)2. polynomial evaluation computing x*(x*(x*(x+d)+c)+b)+a (e.g. horner's algorithm for polynomials) is 2x faster and more accurate (0.5 vs 1.5 ULP) with FMA. This is espeically important if we want CPUs in the future to be Posit only since pretty much every elementary function needs accurate and fast polynomial evaluation.
3. Convolution evaluation.
4. The fMM function is already a required 3 argument function so any Posit instruction set already has to deal with functions of 3 arguments.5. Quires don't work well with vectorization.
SIMD and vector instruction sets are vital on modern computers for achieving high performance, but vectorized quire instructions will inherently be dramatically slower due to the increased amount of data needing to be transferred. For example, with a 512 bit wide SIMD FMA instruction on 16 bit floats requires reading 512*3=1536 bits and writing 512 bits (and requiring 1536 bits of register space). To perform an equivalent operation in Posits currently would require 3 instructions: pToQ, qMulAdd, and qToP. Since the 32 Posit16 quires require 8192 bits, and this operation requires two reads and writes to all of them, this operation will currently require 9728 bits of register (8192 for the quires and 1536 for the Posits), and reading and writing over 16kb. As such, the Posit solution requires over 10x as much data transfer and 6x as much register space.
As such, we are likely to see at least one of 3 things.1. No vectorized quire instructions (making the operations 32x slower)2. Limited vector width3. dramatically lower performance.All in all, I understand that FMA isn't as essential for Posits, but I think that not requiring them in the spec will do a significant dis-service to implementers and users of Posit focused machines.
Hi John and all,I've been really interested in Posits for a while as I believe that they solve a lot of significant problems with IEEE numbers. However, in my opinion, the lack of a fused multiply add operation in the Posit standard is a major omission that makes them less viable as a format for writing efficient algorithms.
Fused expressions (such as fused multiply-add and fused multiply-subtract) need not be performed using quire representations to be posit compliant.
Arguments in favor of FMA:1. FMA can be efficiently implemented in hardware but not in software. Quires help a lot with fma emulation but are likely to be a fairly slow replacement (see point 5 below)
2. polynomial evaluation computing x*(x*(x*(x+d)+c)+b)+a (e.g. horner's algorithm for polynomials) is 2x faster and more accurate (0.5 vs 1.5 ULP) with FMA. This is espeically important if we want CPUs in the future to be Posit only since pretty much every elementary function needs accurate and fast polynomial evaluation.
Horner's method is definitely one of them. Estrin's will be faster in some cases (mainly when you have nothing else for the CPU/GPU to do), but Horner's method is slightly more accurate and uses strictly fewer operations which leads to higher speed as long as there is enough other work to be done at the same time. With the amount of register renaming and operation re-ordering modern CPUs do, Horner's will often win out(or the difference will be in the noise). All of this is a tangent anyways since Estrin's algorithm also requires FMA operations. Your patent is interesting, but since it's been ~25 years and I've yet to see an implementation of it, it doesn't seem especially relevant.
S6 = C6 + r×C7
S5 = C5 + r×S6
S4 = C4 + r×S5
S3 = C3 + r×S4
S2 = C2 + r×S3
S1 = C1 + r×S2
p(r) = C0 + r×S1
And taking data-dependencies into account goes::
S6 = C6 + r×C7
// waiting for S6
// waiting for S6
// waiting for S6
S5 = C5 + r×S6
// waiting for S5
// waiting for S5
// waiting for S5
S4 = C4 + r×S5
// waiting for S4
// waiting for S4
// waiting for S4
S3 = C3 + r×S4
// waiting for S3
// waiting for S3
// waiting for S3
S2 = C2 + r×S3
// waiting for S2
// waiting for S2
// waiting for S2
S1 = C1 + r×S2
// waiting for S1
// waiting for S1
// waiting for S1
p(r) =
C0 + r×S1
r2 = r×r
K6 = C6 + C7×r
K4 = C4 + C5×r
K2 = C2 + C3×r
K0 = C0 + C1×r
r4 = r2×r2
K'4 = K4 + K6×r2
// waiting for K0
K'0 = K0 + K2×r2
// waiting for K'0 and K'4
// waiting for K'0
// waiting for K'0
p(r) = K'0 + K'4×r4r2 = r×r
S0 = C0 + C1×r
r3 = r×r2
S1 = S0 + C2×r2
r4 = r2×r2
S2 = S1 + C3×r3
r5 = r3×r2
S3 = S2 + C4×r4
r6 = r4×r2
S4 = S3 + C5×r5
r7 = r5×r2
S5 = S4 + C6×r6
// wait for S5
p(x) = S5 + C7×r7Do you have a reference for the Virtual Vector method? A quick google only turns up ViVA from the IBM Cray supercomputers. If this is what you're talking about, I don't understand why this is a contradiction to what I have written above. The concern I have with speed doesn't have to do with the UI, but rather the sheer amount of data that must be read and written by registers, and the amount of processing that must be done by the ALU.
This shows up in a number of cases. Some examples are Bessel functions, or evaluation rational polynomials (there you only have 2, but the cross over between horner and estrin is roughly at 2).
Also, any time you need to evaluate a vectorized special function (e.g. `tanh.(v)` shows up all the time in neural nets), you can interleave 4 separate invocations of the `tanh` to remove the data dependency.
The more important part though is that and even if you don't, modern CPUS have very long reorder buffers (Intel and AMD are over 200 and apple's M1 is 600 elements). This means that all of the gaps you drew in the scalar case will be filled in as long as there is any work to be done in the next couple hundred instructions that isn't dependent on the polynomial (and you don't need perfect fill in since Estrin has more instructions than Horner).