fast-posit: A correct, clean, and fast software implementation of posits (the fastest?)

82 views
Skip to first unread message

André

unread,
Oct 29, 2025, 7:52:08 PM (14 days ago) Oct 29
to unum-co...@googlegroups.com
Hi! I've been following the development of posit arithmetic on and off for the past 2 years or so. A couple months ago I decided to try my hand at writing a software implementation. I believe it might be an interesting contribution to the posit ecosystem, for a few different reasons.

`fast-posit` is a software implementation of posit arithmetic, written in Rust (bindings to C are coming soon). You can find the source code and README at https://github.com/andrepd/posit-rust. It has three goals in mind:

  • Correctness: all functions are correct as defined in the standard (i.e. they give the correct results, for all inputs, with the correct rounding). 
  • Performance: this library aims to be faster than, or at least as fast as, any freely available software implementation of posits.
  • Readability: notwithstanding a fast implementation being quite byzantine and difficult to understand at first glance, the code tries to be well structured and _extensively_ documented.

I believe it achieves all three. First of all, its correctness is extensively tested via checking against an oracle (which uses arbitrary precision arithmetic to calculate the correct unrounded result), exhaustively where possible, and probabilistically where we cannot enumerate all inputs. Furthermore, there is only one single implementation for each function, parametrised by the values of N and ES, rather than a separate one for each N; this gives us more confidence that the exhaustive test coverage for the smaller sizes transfers to the instantiations for the larger sizes.

Also, it's very fast! Comparing it to Cerlane Leong's SoftPosit (the most mature software impl as I understand it), it's ~2–10 times faster, depending on the operation! Some of the tricks I used to make it go fast may or may not be already known; I confess I did not do much prior lit review besides reading some of the code for Cerlane's library. Benchmarks are included and are easy to run, so feel free to give it a test on your machines.

Finally, it's easy to use (the README includes example code that showcases its usage, please do give it a read), and it's well documented, both the user-facing API as well as the implementation details. The user-facing API documentation is live here: https://docs.rs/fast_posit.

While still being very fast, I believe the implementations are still reasonably understandable for anyone who is interested. I took care to document all the non-trivial algorithms (decode a posit, add two posits, round quire to posit, etc.) in great detail and, as mentioned above, there is only one implementation for each function, which is parametrised at compile-time on the size N and exponent size ES of the posit type. See e.g. decode.rs. My hope is that if you are interested in learning more about posits, or about software implementation of floating point formats in general, you might benefit from reading through this code!

-----

This library allows the user to define posit types with arbitrary N and ES (up to 128); the standard p8, p16, p32, p64 are already defined. Rust allows operator overloading, so you can write code like `a + b.ceil() < -c`, just as you would with native types. Conversions to/from signed ints and IEEE floats are included. Conversions between posit types are too, duly optimised when the ES are the same (as is the case with the standard posits with ES=2). Quire arithmetic is there, but I have ideas on how to optimise it (the current implementation is quite ugly and kinda slow).

It is not yet feature complete as of v0.1.3: the main thing missing are all the elementary functions (apart from + - × ÷). There's a checklist on the bottom of the readme detailing what exactly is left to be done.

Anyways, thanks for reading, and any feedback would be greatly appreciated! I'd be happy to answer any questions you might have.

For those who aren't familiar with Rust, don't be discouraged! There's a "tour" of the main features in the readme, and I'm confident that someone who hasn't read a line of Rust can still pretty much understand everything. C bindings are in the to-do list, so you will be able to call into this fast implementation from C, C++, or any language that can do ffi into a C API.

Diego Masotti

unread,
Nov 3, 2025, 8:25:58 AM (9 days ago) Nov 3
to Unum Computing
Hi,

congratulation on the library. Some time ago I've also implemented a posit library. It's implemented in zig however I didn't publicized it because the language lacks maturity.
It implements the 'simple' operations (+ - * / quire sqrt mul_add, fused_multiply) however I didn't come around to implement the transcendental function in a satisfactory way.
I tried some iterative algorithms like CORDIC and BKM, but I'm not sure if they converge correctly for all values.
Are you up to collaborate on implementing these features? And some friendly performance competition😉? 

Btw for correctness I tested against Cerlane's SoftPosit and mpfr. However an helping hand from some smart mathematician here is surely appreciated.

Regards,

Diego

Theodore Omtzigt

unread,
Nov 3, 2025, 9:03:33 AM (9 days ago) Nov 3
to Unum Computing
The Universal Number Systems library (https://github.com/stillwater-sc/universal) uses an inductive proof mechanism to validate arithmetic operations across thousands of number systems. This, of course, only works for parameterized types; you couldn't do this in C without undue effort.

The Universal library now also includes arbitrary precision and adaptive precision number systems, such as those by Priest and Shewchuk. This means that if you need to represent intermediate values that do not fit in native types, such as posit<32,2>, you have a mechanism to solve that problem.

For the Priest number systems, which use a composite pattern around floatcascade<>, the regression tests use corner case enumeration instead. 

There is also a random testing framework, but we all know that randoms for arithmetic validation is not effective, the randoms are there really more as a code quality means for the CI.

Diego Masotti

unread,
Nov 3, 2025, 9:15:47 AM (9 days ago) Nov 3
to Unum Computing
Thanks for the pointers.

What do you mean by inductive proof mechanism? Do you test every possible combination or something more advanced? (Sorry I'm really unfamiliar with the field).

As for the transcendental functions, for example, if you want to compute the atan2 of a posit<32,2>, you convert it to Priest (or Shewchuk), use it's facilities and then convert back?

Theodore Omtzigt

unread,
Nov 3, 2025, 9:27:58 AM (9 days ago) Nov 3
to Unum Computing
When you design parameterized implementations, the core problem you encounter is that you are relocating the corner cases. Corner cases mostly in terms of how to deal with bits and where they fall in the different limbs of the representation. So, on the one hand, you get this inductive proof mechanism in which you can validate nbits=n and then inductively prove that nbits=n+1 works. However, as we discovered with the classic floats and fixed points, you actually need to run the induction mechanism, as it needs to cover these shifting corner cases.

Universal is designed for mixed-precision algorithm design and optimization, with an eye toward utilizing hardware acceleration to deliver the compute. This creates the opportunity to use a general-purpose CPU that doesn't support the type, such as posits, to still work with posits and prepare data structures that have the correct bit-encoding for the hardware accelerator to process natively. The reverse works as well; the accelerator can produce a posit tensor and dump it in main memory, and the general-purpose CPU can serialize it and introspect the bits for testing and validation. However, for all of that to work, you need the ability to specify the address alignment, which drives the limb architecture of these implementations and introduces the subtle complexity of shifting corner cases in bit field manipulations.

Diego Masotti

unread,
Nov 3, 2025, 9:51:43 AM (9 days ago) Nov 3
to Unum Computing
I'm a bit confused.
What do you do in practice when you run the inductive proof?

Also, is this correct? 
"As for the transcendental functions, for example, if you want to compute the atan2 of a posit<32,2>, you convert it to Priest (or Shewchuk), use it's facilities and then convert back?"


Theodore Omtzigt

unread,
Nov 3, 2025, 1:57:57 PM (9 days ago) Nov 3
to Unum Computing
What do you do in practice when you run the inductive proof?
    They run in the regression suites. For posits in particular, you can use the n+1 version to know where the rounding point is for the nbits=n version. That is how it is practically used.

As for testing functions or arithmetic operations where you do not have the required bits in native to calculate a reference to compare to, you can pick the Priest versions as reference. If you look at the regression tests they all follow the same pattern:: 

enumerate all the bit patterns of an operand
get the reference values: let's use double to illustrate
   double da = double(a);
   double db = double(b);
   double dref = da op db // do reference op
   c = a op b; // do native op
   // compare
   double ref = double(c)
   if (ref != dref) mark fail

In the case of operators that need more precision bits than double provides, you need to pick an reference number system that has enough bits. Like double-double for posit<32,2> for example.

Nathan Waivio

unread,
Nov 3, 2025, 9:57:32 PM (9 days ago) Nov 3
to André, unum-co...@googlegroups.com
Welcome to the group,
I've developed the Haskell posit library: posit
I'm interested in working on benchmarks, to see how the Haskell implementation would compare to others.  I've not put much work into performance yet.

I have implemented the elementary functions in my library in a polymorphic way.
The typical algorithm is:
1.  Increase the word size
2.  Perform domain reduction (usually similar to what was done in the CORDIC paper)
3.  Recursively calculate the Taylor Series until it reaches a fixed point
4.  Put everything back together
5.  Decrease the word size back to the original 

I think there's a lot of room for improvement but the results seem pretty good so far.

I've also developed the cl3-posit library in Haskell, to gain some experience using Posits in practice.  
It also has the elementary functions for R, C, H, and other "numbers" in the Algebra of Physical Space (i.e. M(2,C)).

Thanks,
Nathan.

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/unum-computing/Is0pdHYoCrWZ8a_N27bq04VAEDSMnoCuAbmKAHKz6ywUsO2CqGbdynN47Kq6VtwfpXdFrztjafQ76mZbIt6_XSjwtElwD3fMXBUCpeBo4JA%3D%40andrepd.eu.

Diego Masotti

unread,
Nov 4, 2025, 4:04:56 AM (8 days ago) Nov 4
to Unum Computing
I see thank you.
I spelunked a bit the codebase for the native op of the transcendental functions, but I can't find something specific for the posits. For example I searched for the log function, but I found only the stub, where it converts the input value to a double, uses the standard library implementation, and then converts back. I'm very unfamiliar with c++ code, so I'm not sure if I missed something somewhere.

Diego Masotti

unread,
Nov 4, 2025, 4:11:49 AM (8 days ago) Nov 4
to Unum Computing
Ah, a nice Clifford library. It always put a smile on me. One day I hope we are going to standardize math around a version of it.

Btw, how much do you increase the word size? Also does the taylor series have good convergence properties?

I searched a bit around, and the atan implementation of musl uses a polynomial with some magical values (after doing domain reduction). I don't think we can use these values directly for a posit implementation, because they have more (or less) bits in the mantissa.
Maybe it is possible to compute these values also for the posits. Time to study!

Thanks,
Diego

Theodore Omtzigt

unread,
Nov 4, 2025, 7:59:35 AM (8 days ago) Nov 4
to Unum Computing
You read that correctly: there is no posit native math library in Universal. At the time we wrote the posit emulator, we didn't yet have a mechanism to implement the Minefield method that John postulated would yield correctly rounded math libraries. So, what you see is the shunt. 

Jay Lim and Santhosh Nagarakatte did that research and developed Rlibm-32: https://dongura.me/publications/rlibm32_arxiv.pdf

We have not brought that into Universal for two reasons: the Rlibm work shows that it is not feasible to generate the Minefield polynomial beyond nbits = 32 bits, and two, there is no clear path yet to parameterize the Minefield polynomial as a function of the posit parameters, nbits, and es. Therefore, a library like Universal would not be able to adhere to the standard that all math functions are correctly rounded across the state space of its arithmetic operations.

The experiments we have done to approach the math library through traditional function approximations have elucidate the problem that you cannot compute iterates in the given precision, you need to have enough precision to drive an accurate residual. So an adaptive mathlib for posits is still work in progress.

Diego Masotti

unread,
Nov 4, 2025, 12:38:54 PM (8 days ago) Nov 4
to Unum Computing
Ah thank you. That article was very interesting. 
This seems a very difficult problem. 
The best path forword appears to be decompose the posit and then hand out these values to some generic arbitrary precision library. 
Maybe studying they're algorithms it's possible to find some optimisation opportunities since we know the number of bits needed

Thank you again for the clarifications 
Diego

John Gustafson

unread,
Nov 4, 2025, 3:50:43 PM (8 days ago) Nov 4
to André, unum-co...@googlegroups.com
André,

Congratulations, and thanks for this contribution to next-generation arithmetic!

The tricky part will be correct rounding for transcendental functions for 64-bit posits. You may have to use Abraham Ziv's approach (see https://inria.hal.science/hal-03721525) and tolerate considerable variation in the speed of each function evaluation depending on the argument.

For precisions up to 32, the Minefield Method works great, and I hear the Github from Rutgers for RLIBM is easy to use. Its functions are supposed to be faster than Intel's math library, yet every result is correctly rounded (unlike Intel's library). 

I'm especially amazed that you're getting 2–10 times faster than SoftPosit. Cerlane and I tuned for a 44-core Xeon processor from around 2016, and I wonder if optimal strategies have changed. We were seeing about 20 clock cycles for a multiply and 30 for an add or subtract. The multiply was faster than a float multiply in SoftFloat, so if you're improving on that, it's impressive. When you consider that it takes 7 or 8 clock cycles to do a multiply on a modern CPU with a fused multiply-adder (ignoring pipelining), the gap between software and hardware performance is not nearly as large as one would guess.

Is there any chance you could add the parameter of limiting maximum regime size, rS? It's not in the Posit Standard (2022) but this will further improve the speed and it sets a minimum guaranteed minimum accuracy within the dynamic range. It looks like eS = 5 and rS = 6 provide excellent dynamic range coverage even for astronomical calculations, and once we have done enough experiments, we may want to extend the Standard with the "b-posit" (bounded posit) type.

Thanks,
John

André

unread,
Nov 7, 2025, 10:56:27 AM (5 days ago) Nov 7
to unum-co...@googlegroups.com
Thanks for all the replies!

Some time ago I've also implemented a posit library. It's implemented in zig however I didn't publicized it because the language lacks maturity.

I skimmed the code: even though I'm not very familiar with Zig it's very impressive how incredibly powerful the "comptime" (compile-time computation) mechanism of that language is. I can see you have used that a lot! I try to do the same thing in Rust, but the compile-time system is much less powerful.

Let me know when/if you add some benchmarks! If you add a C interface I can also integrate them in my own benchmarks (where I already compare with cerlane​ and stillwater​) Hopefully we can steal ideas from each other to make it faster! :)

To clarify: for testing I use an arbitrary precision rational / bignum library, and define a function to convert a posit to a rational. Then I define an explicit, slow, well-tested function that rounds an arbitrary-precision rational to the nearest the posit, using the standard's rounding rule​, and I use it as an oracle. Then testing is very simple: e.g. for addition I just assert, for all pairs of posits a and b, that "rational(a) + rational(b)", performed in arbitrary-precision arithmetic, rounds to a + b, performed in posit arithmetic.

Of course this is only exhaustive for up to ~2^30 cases, larger types use random testing + manual corner cases. Still, given that there is only implementation for arbitrary N and ES, rather than a specialised one for each size, I'm more confident that the exhaustive testing for e.g. p16, p20, transfers to the bigger posits like p64 which only have ~10M random test cases.

Universal is designed for mixed-precision algorithm design and optimization, with an eye toward utilizing hardware acceleration to deliver the compute. This creates the opportunity to use a general-purpose CPU that doesn't support the type, such as posits, to still work with posits and prepare data structures that have the correct bit-encoding for the hardware accelerator to process natively. The reverse works as well; the accelerator can produce a posit tensor and dump it in main memory, and the general-purpose CPU can serialize it and introspect the bits for testing and validation. However, for all of that to work, you need the ability to specify the address alignment, which drives the limb architecture of these implementations and introduces the subtle complexity of shifting corner cases in bit field manipulations.

I'm curious about this part. What exactly do you mean by "limb" architecture? And what is the implication for "shifting corner cases in bit field manipulations", this part is unclear to me.

Regards,

Theodore Omtzigt

unread,
Nov 7, 2025, 11:12:51 AM (5 days ago) Nov 7
to Unum Computing
multi-limb architectures for arithmetic divide the representation into chunks so that you can represent sizes that are larger than the native underlying hardware architecture. For example, say you want to model a 128-bit IEEE-754 floating-point on an x86. You could divide the 128bits into 4 limbs of 32bit unsigned integers, and than use the integer machinary of the ISA to implement the floating-point arithmetic operations. But you could also do 16 limbs of 8bit unsigned integers. As most computer architecture like to have operands aligned to their granularity in memory, the memory address alignment of the latter will be 8bit, and the former 32bit. 

The pattern we use in Universal is to make that alignment parameterized. The user can pick their limb structure. For example, in the example above, the declaration in Universal would be a cfloat<128, 15, std::uint32, HAS_SUBNORMALS> and cfloat<128, 15, std::uint8, HAS_SUBNORMALS>. When you now look at the implementation, the places where particular field bits hit the limbs will change, and thus impact the implementation algorithm of the arithmetic. 

André

unread,
Nov 8, 2025, 3:11:38 PM (4 days ago) Nov 8
to John Gustafson, unum-co...@googlegroups.com
For precisions up to 32, the Minefield Method works great, and I hear the Github from Rutgers for RLIBM is easy to use. Its functions are supposed to be faster than Intel's math library, yet every result is correctly rounded (unlike Intel's library). 

Indeed, this is my next step. I plan on adding support just for the standard's types (2-bit exponent) for now; as Theodore mentioned above, there's no clear way that I'm aware of to make the minefield method parametrised on arbitrary N and ES (or even for N = 64).

I'm especially amazed that you're getting 2–10 times faster than SoftPosit. Cerlane and I tuned for a 44-core Xeon processor from around 2016, and I wonder if optimal strategies have changed. We were seeing about 20 clock cycles for a multiply and 30 for an add or subtract. The multiply was faster than a float multiply in SoftFloat, so if you're improving on that, it's impressive. When you consider that it takes 7 or 8 clock cycles to do a multiply on a modern CPU with a fused multiply-adder (ignoring pipelining), the gap between software and hardware performance is not nearly as large as one would guess.

You are right that the gap is surprisingly low! For example, the fastest p32 operation in my benchmarks is multiply, which in my machine runs at ~300Mops. Since the same machine does native FPU f32 multiplies at ~1400Mops, that's just about a 4-5× slowdown. Which is significant, of course, but perhaps not so much as one might think.

I believe one of the keys is exploiting instruction-level parallelism. Modern CPUs have huge amounts of ILP; if you can write your code such that there are as few dependencies as possible, or at least that two or more parts of the algorithm are independent, then the instructions may be able to run and retire almost as fast as they can be decoded :)

The code in SoftPosit is quite branchy. Perhaps these are not so well-predicted? If so there will be pipeline stalls that hurt performance in the benchmark. There are tools like Linux perf that can look into this in much detail, but I've not done such an analysis. But even if they are well predicted, they might introduce unwanted data dependencies.

Just as an example: when I re-wrote the "encode" (frac and exp → posit) function to be branchless, and to compose the "regime bits" in one register and the "fraction bits" in another, only or'ing them in the final step, the "multiply" benchmarks got 1.8× faster. Because not only that "encode" portion became slightly faster, but it unlocked ILP opportunities (and probably better codegen from the compiler) throughout the whole "multiply" operation (comprising of: decode posit operands to frac and exp → multiply fracs, add exps → encode resulting frac and exp to posit).

Beware that this is mostly supposition. Performance optimisation in modern CPUs is a dark art :)

Just a final note: as you noted, it takes "7-8 cycles to do an FMA ignoring pipelining". That's true, but the peak throughput is 0.5cycles/instr if there are no data dependencies, which is a massive difference. The same thing applies to the functions in this library: they are much faster in a loop than if we're measuring start-to-end latency of 1 call. But the benchmarks I ran loop through 1M pairs of random numbers stored in an array, so they're actually measuring the fully pipelined speed. I'll consider adding other types of benchmarks that are more representative.

Is there any chance you could add the parameter of limiting maximum regime size, rS?

This is in the "every bit counts" book, correct? I have this on the radar, I will pick it up when I manage to get my hands on the book and when I have some more free time, probably towards the end of the year :)
Reply all
Reply to author
Forward
0 new messages