David, you beat me to it. I highly recommend any paper on hardware implementation that has the name R. Murillo on it. He was the first hardware designer (after Isaac Yonemoto, that is) to see how to exploit the 2's complement nature of posit implementation, and his designs are efficient and getting more so. He's got another paper in the CoNGA ’23 conference that happens in a few days, March 1–2 in Singapore.
The earlier designs took the absolute value (noting the sign), then decoded the 2^
n scale factor and the fraction, and applied conventional floating-point algorithms. That makes posit arithmetic always take more work than float arithmetic (for normal floats only), obviously.
Some additional thoughts:
The Count-Leading-Zeros and Count-Leading-Ones operation is critical for decoding the regime bits, obviously, and there are a number of shortcuts for doing that. The original log-cost circuit for it was designed by Vojin Oklobdzija (very high Scrabble® score for that name) in the early 1990s. But the circuits that quickly add two regimes (for multiplication of two posits) haven't been designed yet as far as I know. It should be possible to catenate or cancel the strings of 0 or 1 bits directly without first turning them into some kind of positional notation integer.
A critical design decision is the extent to which you need constant-time operation. It's better from a RISC standpoint if all your operations take a single clock, but it's better for arithmetic performance if you instead allow every operation to go as fast as it can. Half of all posit values have either 01 or 10 for the regime bits and can be decoded as quickly as a float since you know where all the bits are and can decode them in parallel. You could decode speculatively with the assumption that there are only two regime bits, while in parallel testing if that's true and interrupting the calculation to count the regime bits for the other half of the cases. While it's true for half of the bit patterns, it actually happens much more often than that because the more commonly-used numbers are in that center range of low to high magnitudes.
Notice that IEEE 754 floats only have constant-time operation for normal operands. If anything is exceptional (subnormals, infinities, NaNs) the operation traps to microcode or software and can take 50 to 200 clock cycles to process. So in a way, designers have already accepted variable-time execution for arithmetic on real values. Having hardware that takes far fewer clocks when multiplying by 1 or 0, or adding 0, or passing through a NaN (NaR) value seems like a performance win to me even if it tends to mess up the pipeline.
One thing I don't think anyone has tried for multiplication is multiplying the fraction bits, right to left, while decoding the regime bits from left to right (and adding the two regime-exponent pairs). That should have a lower worst-case latency because the more work there is to decode the regime, the less work there is to multiply the significands and vice versa.
If you write up a comparison of the cost of a posit arithmetic unit with a float arithmetic unit for same-precision values please
1) Note that there is no need for comparison instructions on the posit arithmetic unit since they are the same as the integer comparison instructions, whereas the float unit will have a rather complicated job to do if it is at all IEEE-compliant.
2) If the float unit does not handle exception cases in hardware (IBM's POWER seems to be the only CPU left standing that does so), include the cost of the microcode for exception handling, and the time required for an exception operation.
That will lead to much fairer comparison of posits and floats.
One last suggestion: Include two registers that retain the smallest magnitude and largest magnitude posit encountered in a program (or sub-program) execution, resettable by a compiler. This can then replace underflow and overflow warnings to the user. Hitting minPos or maxPos in magnitude somewhere during execution means a severe loss of accuracy and it would be incredibly useful to be able to have the option of reporting such incidents to programmers and users.
Best,
John