Getting Started with Implementation

158 views
Skip to first unread message

Marcello Mariani

unread,
Feb 21, 2023, 4:45:10 AM2/21/23
to Unum Computing
Hello,

I am getting started on an implementation for the Posit type. The main goal will be to support the type and basic arithmetic operations. I was wondering if there are any recommendations of papers or articles describing the technical functionality (bit level) for Posit numbers.

Thanks in advance!

MitchAlsup

unread,
Feb 21, 2023, 12:23:04 PM2/21/23
to Unum Computing
On Tuesday, February 21, 2023 at 3:45:10 AM UTC-6 marcellom...@gmail.com wrote:
Hello,

I am getting started on an implementation for the Posit type. The main goal will be to support the type and basic arithmetic operations. I was wondering if there are any recommendations of papers or articles describing the technical functionality (bit level) for Posit numbers.

At hand-waving accuracy level::

To a large first order, the insides of any floating point unit is a series of parallel fixed-point operations {exponent and fraction}. To this end, much of the literature of IEEE 754 FP units is applicable, with a few additions and modifications::

There will be find-first circuits used to find isolate the fraction from the "exponent" portions.
The multiplexers used to gate the fraction into its fixed-point calculation are necessarily different. 

The multiplexers used to compress regime+exponent into a binary exponent are necessarily different.
The multiplier array (tree) is going to be larger. 754 FMUL or FMAC can use a 53×53 multiplier, posit(64,2) appear to need  60×60 multiplier, *

and there will be a corresponding growth in the adder widths (53->60) for FADD and FMAC. The FADD unit will be delayed a few gates due to isolating the fraction from the container, and aligning the fractions for add/subtract.

The Divider needs scaled up in width as will SQRT. Goldschmidt dividers will likely need 1 more iteration, SRT dividers will need 2 more cycles (when regime has few bits) so the iteration count in an optimal divider will end up dependent on the width of regime.

(*) With the multiplier tree being 60×60 (1.28× the size of the IEEE tree) for another minor increase in size to 64×64 one can get integer multiply and divide in the same FU as posit FMUL, FMAC. There are opportunities in building a "do everything else" function unit, here; that were not present in IEEE-land.

The exponent "path" in these function units will be a bit different due to the 'exponent' being represented with "regime+exponent" in the posit container. I suspect some clever designer will figure out a find-first circuit to convert regimes into binary so the most of the exponent path run biased-binary. Then, in the round-normalize section this gets converted back into regime||exponent. Binary to unary (regime) is straightforward, the exponent is likely just the lower 2 bits of the binary. Regime-width will be used to shift the fraction (and exponent) to its (their) desired position.

Rounding will be a few gates easier than IEEE, but not enough to worry about.

David Mallasén Quintana

unread,
Feb 22, 2023, 3:18:21 AM2/22/23
to Unum Computing
Hello,

You might want to take a look at the posit standard if you haven't already (https://posithub.org/docs/posit_standard-2.pdf). Then, a couple of useful and more specific articles (biased towards the work we have been doing in my university, but you can also find many other references in the articles) could be the following:

Some parameterized posit units with their detailed description.
[1] R. Murillo, A. A. Del Barrio, and G. Botella, “Customized Posit Adders and Multipliers using the FloPoCo Core Generator,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), Oct. 2020, pp. 1–5. doi: 10.1109/ISCAS45731.2020.9180771.

Some more advanced quire fused MAC units.
[2] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Energy-Efficient MAC Units for Fused Posit Arithmetic,” in 2021 IEEE 39th International Conference on Computer Design (ICCD), Oct. 2021, pp. 138–145. doi: 10.1109/ICCD53106.2021.00032.

A newer, more efficient way of decoding posit numbers.
[3] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella, “Comparing Different Decodings for Posit Arithmetic,” in Next Generation Arithmetic, vol. 13253, J. Gustafson and V. Dimitrov, Eds. Cham: Springer International Publishing, 2022, pp. 84–99. doi: 10.1007/978-3-031-09779-9_6.

Integrating all of these posit and quire units into a RISC-V core capable of booting linux.
[4] D. Mallasén, R. Murillo, A. A. D. Barrio, G. Botella, L. Piñuel, and M. Prieto-Matias, “PERCIVAL: Open-Source Posit RISC-V Core With Quire Capability,” IEEE Transactions on Emerging Topics in Computing, vol. 10, no. 3, pp. 1241–1252, 2022, doi: 10.1109/TETC.2022.3187199.

Best regards,
David

Marcello Mariani

unread,
Feb 22, 2023, 5:26:11 AM2/22/23
to Unum Computing
This is very helpful. Thanks a lot!

John Gustafson

unread,
Feb 22, 2023, 7:26:43 AM2/22/23
to David Mallasén Quintana, Unum Computing
David, you beat me to it. I highly recommend any paper on hardware implementation that has the name R. Murillo on it. He was the first hardware designer (after Isaac Yonemoto, that is) to see how to exploit the 2's complement nature of posit implementation, and his designs are efficient and getting more so. He's got another paper in the CoNGA ’23 conference that happens in a few days, March 1–2 in Singapore.

The earlier designs took the absolute value (noting the sign), then decoded the 2^n scale factor and the fraction, and applied conventional floating-point algorithms. That makes posit arithmetic always take more work than float arithmetic (for normal floats only), obviously.

Some additional thoughts:

The Count-Leading-Zeros and Count-Leading-Ones operation is critical for decoding the regime bits, obviously, and there are a number of shortcuts for doing that. The original log-cost circuit for it was designed by Vojin Oklobdzija (very high Scrabble® score for that name) in the early 1990s. But the circuits that quickly add two regimes (for multiplication of two posits) haven't been designed yet as far as I know. It should be possible to catenate or cancel the strings of 0 or 1 bits directly without first turning them into some kind of positional notation integer.

A critical design decision is the extent to which you need constant-time operation. It's better from a RISC standpoint if all your operations take a single clock, but it's better for arithmetic performance if you instead allow every operation to go as fast as it can. Half of all posit values have either 01 or 10 for the regime bits and can be decoded as quickly as a float since you know where all the bits are and can decode them in parallel. You could decode speculatively with the assumption that there are only two regime bits, while in parallel testing if that's true and interrupting the calculation to count the regime bits for the other half of the cases. While it's true for half of the bit patterns, it actually happens much more often than that because the more commonly-used numbers are in that center range of low to high magnitudes.

Notice that IEEE 754 floats only have constant-time operation for normal operands. If anything is exceptional (subnormals, infinities, NaNs) the operation traps to microcode or software and can take 50 to 200 clock cycles to process. So in a way, designers have already accepted variable-time execution for arithmetic on real values. Having hardware that takes far fewer clocks when multiplying by 1 or 0, or adding 0, or passing through a NaN (NaR) value seems like a performance win to me even if it tends to mess up the pipeline.

One thing I don't think anyone has tried for multiplication is multiplying the fraction bits, right to left, while decoding the regime bits from left to right (and adding the two regime-exponent pairs). That should have a lower worst-case latency because the more work there is to decode the regime, the less work there is to multiply the significands and vice versa.

If you write up a comparison of the cost of a posit arithmetic unit with a float arithmetic unit for same-precision values please

1) Note that there is no need for comparison instructions on the posit arithmetic unit since they are the same as the integer comparison instructions, whereas the float unit will have a rather complicated job to do if it is at all IEEE-compliant.

2) If the float unit does not handle exception cases in hardware (IBM's POWER seems to be the only CPU left standing that does so), include the cost of the microcode for exception handling, and the time required for an exception operation.

That will lead to much fairer comparison of posits and floats.

One last suggestion: Include two registers that retain the smallest magnitude and largest magnitude posit encountered in a program (or sub-program) execution, resettable by a compiler. This can then replace underflow and overflow warnings to the user. Hitting minPos or maxPos in magnitude somewhere during execution means a severe loss of accuracy and it would be incredibly useful to be able to have the option of reporting such incidents to programmers and users.

Best,
John


--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/b8aab891-9d10-4886-8a1b-fd11f47572a7n%40googlegroups.com.

Marcello Mariani

unread,
Feb 22, 2023, 8:12:56 AM2/22/23
to Unum Computing
Hey, thanks everyone for the help and orientation.

Quick follow-up questions. The implementation will be as a library, so only software-wise; similarly, to SoftPosit. Can this be achieved on existing hardware, even if it means it will not take advantage of the possible performance improvements? This is due to the programming language not having the support for float numbers natively, and the goal of experimenting with possible alternatives.

Regards,
Marcello

Oscar Smith

unread,
Feb 22, 2023, 8:59:24 AM2/22/23
to Unum Computing
"
Notice that IEEE 754 floats only have constant-time operation for normal operands. If anything is exceptional (subnormals, infinities, NaNs) the operation traps to microcode or software and can take 50 to 200 clock cycles to process.
"
This is only correct for Intel. On AMD and most ARM chips subnormals and the rest all take the same amount of time as regular numbers.

On an AMD EPYC 7513:
```
julia> using BenchmarkTools
julia> x1 = rand(1000); x2 = copy(x1);

julia> x2[rand(1:1000, 300)] .= nextfloat(0.)
julia> @benchmark sum(x1)
BenchmarkTools.Trial: 10000 samples with 963 evaluations.
 Range (min … max):  84.350 ns …  3.969 μs  ┊ GC (min … max): 0.00% … 97.48%
 Time  (median):     85.409 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.541 ns ± 49.800 ns  ┊ GC (mean ± σ):  0.57% ±  1.54%

    ▄█▅                                                        
  ▂▅████▅▃▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▅▆▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  84.3 ns         Histogram: frequency by time        97.6 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.

julia> @benchmark sum(x2)
BenchmarkTools.Trial: 10000 samples with 963 evaluations.
 Range (min … max):  84.319 ns …  5.595 μs  ┊ GC (min … max): 0.00% … 98.18%
 Time  (median):     92.263 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.339 ns ± 78.632 ns  ┊ GC (mean ± σ):  0.76% ±  1.58%

                        ▂▂    ▁█▅                              
  ▃▅▃▃▃▄▄▄▄▄▄▄▅▄▄▅▅▅▅▆▇▇███▇▇▆████▆▄▄▄▄▄▃▃▂▂▃▅▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  84.3 ns         Histogram: frequency by time         103 ns <

 Memory estimate: 16 bytes, allocs estimate: 1.
```

John Gustafson

unread,
Feb 22, 2023, 9:42:26 AM2/22/23
to Marcello Mariani, Unum Computing
Marcello,

It absolutely can be achieved in software, and be surprisingly fast. Cerlane and I were amazed that SoftPosit could do plus-minus-times operations in 20–30 clock cycles on a 2016-era Intel x86… I don't remember the exact processor, but I think it was an i7 multicore with 44 cores. So it gets around 100 megapops per core at 3 GHz, and over 4 gigapops for the entire chip. We were also surprised that 64-bit integer operations are now faster than 32-bit or 16-bit integer operations on x86; they apparently default to 64-bit integer ops and have to do extra work to trim them down for smaller sizes. This is when everything is register-register operations, of course, which is by far the most common type of data reference in SoftPosit. And get this… 64-bit integer multiplication completes in one clock cycle!

On AMD, the last time I checked, Count Leading Zeros runs in 1 clock cycle but on AMD it used to take 3 clock cycles. That was before Epyc hit the market, so they may have speeded it up. There is a subtle incompatibility between Intel and AMD when doing CLZ on a 64-bit integer that is all zero bits; I think Intel returns 0, and AMD (correctly) returns 64. It's like the Intel designers only allowed six bits to store the count of leading zeros, so an all-zero-bit value overflows that counter. So watch out! I'm not sure how portable you are trying to make your library.

John

John Gustafson

unread,
Feb 22, 2023, 10:05:40 AM2/22/23
to Oscar Smith, Unum Computing
Oscar,

You may be right that AMD has now fixed this. I'm attaching a paper from UC San Diego researchers on subnormal floats and abnormal timing, and how that can be exploited for side-channel attacks, and it shows AMD and ARM as guilty as Intel of sweeping exception handling under the rug when talking about floating point performance.

The issue may be whether SSE is used or not, and whether a processor mode switch has been set. Because the traps to microcode reduce performance, compilers typically have a "noieee" switch to turn off the more expensive parts of IEEE 754 rules and behave like everything is a normal float. This is a particularly common practice when doing vector operations like the SSE or on GPU architectures, since you need SIMD similarity in what every functional unit is doing.

Notice that it is also common for scalar float instructions on x86 to use the x87 instructions with 80-bit stack registers, whereas the SSE uses 32-bit or 64-bit operands. This can also introduce differences both in the timing and in the result, and is one more reason why IEEE 754 compliance does not result in bitwise identical results on different computer systems.

An interesting experiment is to try squaring x = 2.6469783e–23, then taking the square root of the result, using 32-bit IEEE 754 floats. (The value x should be hexadecimal 0x1A000001.) If subnormals are getting flushed to zero, as almost all GPUs do, you'll get 0 back. If it correctly handles the subnormal you get when you square, the square root will come back √x² = 3.7433921e–23, a relative error of 41%, and there should be some report that the calculation suffered underflow. I'm guessing most computing environments won't say anything about underflow; reporting of underflow is often turned off because it's usually harmless and it clutters the output (and slows execution).

Even though IEEE 754 says dividing a positive number by (signed) zero should yield (signed) infinity, I suspect you'll see a significantly different execution time when dividing by zero. Try it.

Thanks for the info about EPYC. I hope you're right, but I'm not sure a Julia benchmark is sufficiently low-level to expose the timing differences.

Best,
John

P.S. The paper apparently was updated in 2017: https://cseweb.ucsd.edu/~dkohlbre/papers/subnormal.pdf. I don't know if that's the same as the PDF attached here.

subnormal.pdf

Oscar Smith

unread,
Feb 22, 2023, 11:47:43 AM2/22/23
to Unum Computing
Thanks for the reference! I'm able to reproduce the `sum` results with
```
function mysum(xs)
    s=0.0
    for x in xs
        s += x
    end
    s
end
```
Which yields native code
```
julia> @code_native syntax=:intel debuginfo=:none mysum(x1)
    .text
    .file    "mysum"
    .globl    julia_mysum_730                 # -- Begin function julia_mysum_730
    .p2align    4, 0x90
    .type    julia_mysum_730,@function
julia_mysum_730:                        # @julia_mysum_730
    .cfi_startproc
# %bb.0:                                # %top
    #APP
    mov    rax, qword ptr fs:[0]
    #NO_APP
    mov    rax, qword ptr [rax - 8]
    mov    rax, qword ptr [rax + 16]
    mov    rax, qword ptr [rax + 16]
    #MEMBARRIER
    mov    rax, qword ptr [rax]
    #MEMBARRIER
    mov    rax, qword ptr [rdi + 8]
    test    rax, rax
    je    .LBB0_1
# %bb.2:                                # %L17
    mov    rcx, qword ptr [rdi]
    vxorpd    xmm0, xmm0, xmm0
    vaddsd    xmm0, xmm0, qword ptr [rcx]
    cmp    rax, 1
    je    .LBB0_5
# %bb.3:                                # %L35.preheader
    mov    edx, 1
    .p2align    4, 0x90
.LBB0_4:                                # %L35
                                        # =>This Inner Loop Header: Depth=1
    vaddsd    xmm0, xmm0, qword ptr [rcx + 8*rdx]
    inc    rdx
    cmp    rax, rdx
    jne    .LBB0_4
.LBB0_5:                                # %L41
    ret
.LBB0_1:
    vxorps    xmm0, xmm0, xmm0
    ret
.Lfunc_end0:
    .size    julia_mysum_730, .Lfunc_end0-julia_mysum_730
    .cfi_endproc
                                        # -- End function
    .section    ".note.GNU-stack","",@progbits
```
(ignore the top bit which is just a GC safepoint).

I'm also unable to see performance differences for dividing by zero on the same CPU.

Marcello Mariani

unread,
Feb 28, 2023, 4:38:35 AM2/28/23
to Unum Computing
Are there standardised methods for binary posit arithmetic? I have not been able to find any specifications about this as of now. Thanks in advance.

John Gustafson

unread,
Mar 4, 2023, 8:31:36 PM3/4/23
to Marcello Mariani, Unum Computing
Hi Marcello,

The CoNGA’23 conference held in Singapore is now over and I'm back in the US and catching up with unanswered email. I see that I never responded to this, and it looks like no one else in the Unum Computing email group did, either:


On Feb 28, 2023, at 2:38 AM, Marcello Mariani <marcellom...@gmail.com> wrote:

Are there standardised methods for binary posit arithmetic? I have not been able to find any specifications about this as of now. Thanks in advance.

I'm not sure I understand your question. There are many papers on methods for performing posit arithmetic both in hardware and software. The best hardware implementation I've seen so far is PERCIVAL, from Universidad Complutense de Madrid. It's open-source. Here's a link:

They seem to be a very capable team; I think Raul Murillo, the second author, may be the one we should credit with finally exploiting the 2's complement nature of posits and exploiting the simplifications that proceed from that.

When you get to the part about supporting the quire, don't reinvent the wheel in trying to make exact dot products fast. Kulisch wrote papers about how to do that going all the way back to the 1970s, and actually built hardware to do it as well. The trick of using multiple integer adders was rediscovered by Berkeley people on the ASPIRE project (Koenig and Biancolin), and like Kulisch, they discovered exact dot product accumulation can be as much as six times faster than doing it with the radix-point alignment and leading-zero count and round-to-nearest-tie-to-even hardware you need for a conventional fused multiply-add operation.

A frontier for improvement is the way two regime-exponent pairs are added, when multiplying two posits. I think designers are still decoding the regime-exponent to a signed (positional notation) integer, and not thinking about ways that regimes might add by concatenation (if the same sign) or cancellation (if opposite sign), and then use the fact that the exponent is always two bits long. Not converting to positional notation would save the time (and logic gates) needed to re-encode the regime-exponent form of the scale factor. It's one of those things that we will struggle to do the first time, but once we see the way to do it, it will seem easy and obvious from then on.

Best,
John

P.S. I'm writing from my academic address this time, jlgu...@asu.edu, which does not have a spam filter as far as I can tell. That should make it easier to write back to me for anyone who wants to comment.

Marcello Mariani

unread,
Jun 26, 2023, 5:25:27 AM6/26/23
to Unum Computing
Hello,

A follow-up on this thread. Is it possible to do multiple posit additions, and only pack and round it at the end (once)? If so, has it been done before?

Thanks in advance

Theodore Omtzigt

unread,
Jun 26, 2023, 8:06:40 AM6/26/23
to Unum Computing
yes, use the quire

John Gustafson

unread,
Jun 26, 2023, 4:33:26 PM6/26/23
to Marcello Mariani, Unum Computing
Marcello,

Yes, multiple posit additions can be done in the quire, a wide fixed-point register, with no rounding error until you convert the quire value back into posit form. This restores the associative property of addition, for situations where that is desirable. 

For 8-bit (standard) posits, the quire is 128 bits long so you can sum up to about 3.6e16 numbers with no possibility of overflow. (and they would all have to to be the maximum posit value, 2^24, to overflow). 

For 16-bit posits, you can sum up to about 1.5e26 numbers with no possibility of overflow. You might be able to hit that maximum with the fastest existing supercomputers and a lot of patience, but otherwise you're safe to consider it impossible to overflow. The quire is 256 bits long for 16-bit posits.

For 32-bit posits, you'd have to add the largest posit to a sum over 2.9e45 times before there is any way to overflow. Now we're talking about numbers you can't reach even if every atom on Earth is also a computer. The quire is 512 bits long, the same sie as a cache line on many microprocessors.

John G.

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.

Troy Benjegerdes

unread,
Jun 29, 2023, 10:57:12 PM6/29/23
to Unum Computing
Is there any implementation or linux kernel patches to support quire save/restore on a context switch? Same for glibc or uclibc setjmp/longjmp...

MitchAlsup

unread,
Jun 30, 2023, 3:44:48 PM6/30/23
to Unum Computing
On Thursday, June 29, 2023 at 9:57:12 PM UTC-5 Troy Benjegerdes wrote:
Is there any implementation or linux kernel patches to support quire save/restore on a context switch? Same for glibc or uclibc setjmp/longjmp...

I would argue that one should not be using a quire when setjump/longjump is in the control flow model {including try-throw-catch}

Oscar Smith

unread,
Jun 30, 2023, 3:54:14 PM6/30/23
to Unum Computing
Doesn't that mean that pretty much no programming languages should be allowed to use a quire?

Theodore Omtzigt

unread,
Jun 30, 2023, 6:01:17 PM6/30/23
to Oscar Smith, Unum Computing
all functional languages and hybrid imperative/functional languages would do fine. Almost all of DL are functional graphs, in particular the ONNX model definition.


--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.

John Gustafson

unread,
Jun 30, 2023, 9:31:34 PM6/30/23
to Oscar Smith, Unum Computing
There are many issues with adding posits to conventional languages. This is a major project in itself, perhaps the biggest single obstacle to posit adoption after we get past the high-speed hardware availability. The I/O formatting needs attention. And to pick just one example, suppose floats and posits coexist in a language the way floats and integers do now; if f is a 32-bit float and p is a 32-bit posit, and you multiply them and store them in a float, what should the rule be for how the math and exception handling get done? If you think the answer is simple or obvious, you haven't thought it through!

And that doesn't even begin to address the introduction of the quire as a data type.

The Posit Standard (2022) was careful to avoid dictating anything about computer languages, and restrict itself to statements about how the format behaves. The Working Group bumped up against the language issues a number of times, and realized it was far beyond the scope of a format standard.

Just doing it for C sounds like a pretty good PhD dissertation idea to me…

John

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages