Hardware Implementations

726 views
Skip to first unread message

jfcg...@gmail.com

unread,
Aug 16, 2019, 9:57:36 AM8/16/19
to Unum Computing
Hi all,

What are the planned hardware (cpu/gpu/asic) implementations of posits? Is there any clue that we could buy hw with posit support in say 3 years?
Does any of the big players (amd, intel, nvidia, arm, ibm) have any plans involving posits?

Thanks..

theo

unread,
Aug 16, 2019, 10:30:39 AM8/16/19
to Unum Computing
There are already several commercial posit implementations available.

There is a full HPC cluster configuration available since the beginning of this year for petascale application performance levels. This is being presented in RFPs to supercomputing centers. 

There is a RISC-V with posits demonstrated at the RISC-V workshop in Zurich a couple of months ago.

And the University of Washington has a full DL stack that is posit enabled with a posit-based tensor processor in the style of Google TPU.

And at least three FPGA vendors, Achronix, Xilinx, and Intel have posit arithmetic solutions for their platforms. There are examples of Software Defined Radios, and image processing pipelines.

Intel, with its MIC architecture, is in the best position to offer posit-based accelerators for their Xeon line. They are already offering bfloat16 as an accelerator, and it would take very little to incorporate a posit-based hw accelerator in this ecosystem.

But, posit hardware has been available from a variety of sources for more than 18 months.

theo

unread,
Aug 16, 2019, 7:02:08 PM8/16/19
to Unum Computing
Some more data on posit hardware, 

Both Huawei and Robert Bosch have commercial posit arithmetic chips for specialized products. Huawei in base station silicon, and Robert Bosch in sensor fusion modules for motor control and battery management. Both organizations started their R&D in the middle of 2017.

China has several AI and blockchain ASIC vendors that have been talking to us about posit arithmetic. Nothing is ever coming back so it is difficult to know if they have integrated it into their ASICs, but it is not unreasonable to assume they have as posits deliver a 2x performance benefit on memory-bound algorithms so it is an easy optimization to make. Again, function specific silicon, so the software lift to adopt posits is much easier.

Facebook has built posit-based tensor processor RTL but they have progressed to another number system they call DeepFloat. They were talking about open sourcing both the run-time and the hardware design back in 2017, but it doesn't appear they have progressed to a working system.

James Brakefield

unread,
Sep 1, 2019, 9:35:46 PM9/1/19
to Unum Computing
A french group has C++ source for Posits that feeds into Xilinx's HLS (High Level Synthesis)
Paper: https://gitlab.inria.fr/lforget/marto
They also have a 2nd edition of their floating-point "encyclopedia": Handbook of Floating-Point Arithmetic, 2nd ed.
Perhaps the 3rd ed. will include Posits.

Jim Brakefield

John L. Gustafson

unread,
Sep 13, 2019, 9:19:43 AM9/13/19
to James Brakefield, Unum Computing
James, and everyone on Unum Computing,

If you go through the papers by de Dinechin et al, you discover the amazing claim that posits have to watch out for exception cases, but floats have no such burden. I am not making this up. Read the paper carefully, and you will see it. Those who have invested their technical careers in IEEE 754 floats are making desperate and disingenuous claims about the relative cost of posits and IEEE 754 float rules. The hardware complexity comparisons they are generating assume that the case of exponent saturation (all 1s or all 0s) are so rare that they can say, "Nothing to see here, folks, move along…"  works to make people overlook some massive warts in the rules for processing IEEE 754 floats.

If you have a commercial investment in old-fashioned 1985-style floating-point, brace for impact. The only remaining reason for adhering to a grossly-antiquated Standard is looking more indefensible than ever.

John

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/0133bb9c-54aa-49ee-ac3d-940f679a5086%40googlegroups.com.




Important: This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately; you should not copy or use it for any purpose, nor disclose its contents to any other person. Thank you.

James Brakefield

unread,
Sep 14, 2019, 12:22:42 AM9/14/19
to Unum Computing
I've looked at "Evaluating the hardware cost of the posit number system" and it appears that the Posit decode and encode to intermediate format circuits and resources are included in the add/subtract and multiply resource counts?  On a RISC computer Posit decode and encode are likely to be part of the memory paths and not part of the ALU paths?
In any case, ASIC resource counts would be of greater value than FPGA resource counts.  BTW for low cost FPGAs one gets ~1K LUTs for a dollar.


>Those who have invested their technical careers in IEEE 754 floats are making desperate and disingenuous claims 
Ugh, keep one's eyes open: the multiple favorable Posit accuracy reports coming from 3rd parties (using Posit software emulation) is convincing to me.

On Friday, September 13, 2019 at 8:19:43 AM UTC-5, John L. Gustafson wrote:
James, and everyone on Unum Computing,

If you go through the papers by de Dinechin et al, you discover the amazing claim that posits have to watch out for exception cases, but floats have no such burden. I am not making this up. Read the paper carefully, and you will see it. Those who have invested their technical careers in IEEE 754 floats are making desperate and disingenuous claims about the relative cost of posits and IEEE 754 float rules. The hardware complexity comparisons they are generating assume that the case of exponent saturation (all 1s or all 0s) are so rare that they can say, "Nothing to see here, folks, move along…"  works to make people overlook some massive warts in the rules for processing IEEE 754 floats.

If you have a commercial investment in old-fashioned 1985-style floating-point, brace for impact. The only remaining reason for adhering to a grossly-antiquated Standard is looking more indefensible than ever.

John

To unsubscribe from this group and stop receiving emails from it, send an email to unum-co...@googlegroups.com.

Theodore Omtzigt

unread,
Sep 14, 2019, 1:57:48 PM9/14/19
to Unum Computing
> and it appears that the Posit decode and encode to intermediate format circuits and resources are included in the add/subtract and multiply resource counts?  On a RISC computer Posit decode and encode are likely to be part of the memory paths and not part of the ALU paths?
Exactly, in any reasonable hw implementation decode/encode are part of load/store and are not part of the arithmetic pipeline. In sw emulation, the decode/encode is part of the arithmetic because a decoded posit stored in memory would negate all the bandwidth gains we get from posits being a denser information encoding for the real numbers.

Another benefit of posits over IEEE for high-performance microprocessors is the exception behavior. Posit arithmetic makes it easier to create high-frequency (read deep) computational pipelines as compared to IEEE floating point.

florent.d...@gmail.com

unread,
Sep 16, 2019, 11:32:07 AM9/16/19
to Unum Computing
I totally agree that having the decoder/encoder logic in the load/store unit is the most promising way to go in terms of accuracy/performance trade-off. This is indeed what we suggest in our good/bad/ugly paper.

But then you have an implementation that is no longer compatible with the posit standard, since you are actually performing some (as much as possible) of the computation in an internal format that is not posits, with internal roundings which are not posit roundings.
Worse, you loose all form of reproducibility, since the decision to keep data in registers or to send it back to memory is taken in the register allocation phase of the compiler: you add a print command for debugging, and you get a different result.
We have been through all this with IEEE floating-point with double-extended 80 format, (look for horror stories in our floating-point book, or in Monniaux' pitfall paper) .
Don't reproduce the same mistakes 30 years later.
What is good for an accelerator is not necessarily good for general-purpose computing.

So you cannot claim superior reproducibility without paying the full conversion latency. As far as I know.

florent.d...@gmail.com

unread,
Sep 16, 2019, 12:24:03 PM9/16/19
to Unum Computing
About warts:
I fully agree with John that the IEEE standard is antiquated. I was 15 when it was designed. It doesnt make all its choices automatically obsolete.
Posits get rid of signed zeroes, and it is a progress.
Posits (currently) get rid of infinities, and I humbly believe it is a regression.
Saturated arithmetic is a good choice for most of signal processing and machine learning, and a very bad choice in other contexts: when I'm running safety-critical code I don't want overflow to go unnoticed.
It is easy to fix in an upcoming standard by e.g. giving an infinity semantics to maxposit, so that maxposit-maxposit=NaR instead of 0, etc.
It could be an option stored in a status flag, so that the programmer decides when he wants saturated arithmetic, and when she wants infinity arithmetic.
But then it would become what John calls a wart.

I do believe that warts grown this way make a more attractive face.
Relatedly, I believe we are still missing all the warts that will make the quire useable and efficient in an actual software+hardware system.

And if you think this is the word of a blind IEEE-754 fanatic, please consider that in the 10 last years I have promoted in FloPoCo the use of a non-IEEE FP format.
This is what makes the most sense in FPGAs, while IEEE-754 makes more sense in the context of general-purpose processors it was designed for.
(not mentionning the current investment of my group in posits)


On Friday, September 13, 2019 at 3:19:43 PM UTC+2, John L. Gustafson wrote:
James, and everyone on Unum Computing,

If you go through the papers by de Dinechin et al, you discover the amazing claim that posits have to watch out for exception cases, but floats have no such burden. I am not making this up. Read the paper carefully, and you will see it. Those who have invested their technical careers in IEEE 754 floats are making desperate and disingenuous claims about the relative cost of posits and IEEE 754 float rules. The hardware complexity comparisons they are generating assume that the case of exponent saturation (all 1s or all 0s) are so rare that they can say, "Nothing to see here, folks, move along…"  works to make people overlook some massive warts in the rules for processing IEEE 754 floats.

If you have a commercial investment in old-fashioned 1985-style floating-point, brace for impact. The only remaining reason for adhering to a grossly-antiquated Standard is looking more indefensible than ever.

John

To unsubscribe from this group and stop receiving emails from it, send an email to unum-co...@googlegroups.com.

Theodore Omtzigt

unread,
Sep 16, 2019, 12:26:43 PM9/16/19
to Unum Computing
Hi Florent:

The double extended and the Intel Itanium extended precision history are indeed great stories to remember.

However, I can ALWAYS guarantee that the representation of the triple (sign, scale, signficant) is the EXACT representation of the posit, so the fact that we have a register file of triples is not a cause for failing posit compliance. 

florent.d...@gmail.com

unread,
Sep 16, 2019, 12:51:56 PM9/16/19
to Unum Computing

That's a very good design choice indeed.
Can you achieve it in significantly less hardware/latency than encoding then decoding the exact intermediate result of the operation? You need to determine at which position of the significand to round.

Theodore Omtzigt

unread,
Sep 16, 2019, 2:15:26 PM9/16/19
to Unum Computing
Looking at our current hw pipeline, I think the answer is yes on both: significantly less hardware and little impact on cycle time. There are a couple of bypass paths where we would need to round inside the pipeline, and here the rounding logic adds a little bit of latency that would impact the cycle time of a raw arithmetic pipeline by a couple %. Bigger problems are in the unrounded path to the quire as there is a conversion that requires a shift determination and shift.

The caveat here is that all our current experiments are with relatively short pipelines (2 and 3 clocks). I am sure that we need to move stuff around if we want to go to higher frequencies.

John Gustafson

unread,
Sep 16, 2019, 10:28:06 PM9/16/19
to florent.d...@gmail.com, Unum Computing
Florent,

Welcome to the discussion! Let me see if I can change your mind about one thing: representation of infinity.

Can you show me a real application where you are trying to calculate infinity? Or where infinity is a legitimate intermediate value? I am not aware of one. It would have to be a pretty silly calculation, one that does not require a computer so much as a little symbolic math.

I believe you are finding value in infinites in output as a way of alerting the user or programmer that overflow has occurred. Yet, in IEEE 754 it is very easy to hit an infinity, compute with it (divide a finite value by it), and proceed to a result that looks correct in the final output. In fact, IEEE 754 even lets you bury a NaN value such that it does not propagate to the final answer! Check out the definition of hypot(x,y). If x is NaN and y is infinity, the result is infinity. So by those rules, 1 / sqrt( NaN² + ∞²) = 0. (For posits, arithmetic with a NaR always propagates to the answer, no math library exceptions.)

Originally I was tempted to put into the Draft Standard the requirement that two registers, resettable to zero by the user, record the smallest magnitude and largest magnitude numbers that occur in a posit calculation. They can be output and reset to detect if the calculation has strayed into the inaccurate regions, or saturated at minpos or maxpos. I decided not to burden the hardware with that because it belongs in the debugger, not the hardware.

For that matter, many dubious features of IEEE 754 belong in the debugger, or are taken care of by the language environment. The idea that the hardware should babysit a poorly-designed algorithm and be responsible for reporting when it goes awry is simply not the way computing systems (or programmers, or users) work. Exceptions like overflow are an indication that the software is not yet finished and is not ready to ship! If a calculation in C, say, attempts a square root of a negative number, the logic in the function notices and puts out an error message, not the hardware. Exceptions are supposed to be rare; when you put complicated exception-handling into a hardware standard, you burden every single operation with waste, and you cannot turn it off. You say you don't want overflow in safety-critical code to go unnoticed; I don't want safety-critical code that can overflow to be released, ever! If a computation can overflow, put in a conditional test for it in the software and then do whatever the intelligent thing is to do for that situation… don't produce an "infinity" and take comfort that someone will notice the bad answer.

I once saw a life-critical application, CT medical imaging, where the application found the pixel with the maximum brightness and normalized the image to that so that image contrast would be maximized. This was decades ago, when the computation took 30 minutes, and a patient with a concussion needs a diagnosis ASAP. Well, one of the pixels was so bright that it overflowed to infinity. Dividing by that maximum made every other pixel zero, that is, black. After waiting for the image for half an hour, all the doctor got back was a black screen. Not cool. Bad programming, yes, but made worse by overflow-to-infinity rules, and by proceeding with a calculation that should have tested and handled the too-large value.

Status flags are evil, because the same program with the same data can produce different results depending on something utterly invisible and changeable. Is that really what you want? This is why posits have only one rounding mode. I'm glad you mentioned the paper by David Monniaux, because it is an excellent survey of all the ways IEEE 754 rules can lead to irreproducible results. (For others on this forum, you can find the paper at https://arxiv.org/abs/cs/0701192.)

Status flags are useful for letting hardware track things like integer carry, negative, zero, and so on, but notice that integer status flags are not something the programmer sets to control the behavior of an entire program. Imagine, say, an integer flag for 2's complement integers that says 10000000 means +128 instead of –128 for an 8-bit integer. Arguing, perhaps, "That way, absolute value will always work." Now you have irreproducibility in integer codes, too! So don't expect me to put such "mode of operation" flags into the Draft Posit Standard. Those aren't "warts." They're malignant tumors.

John

To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/57402dd8-f529-4da0-aac0-3944eb692a38%40googlegroups.com.

John L. Gustafson

unread,
Sep 17, 2019, 12:15:00 AM9/17/19
to Theodore Omtzigt, Unum Computing
Theo,

I really don't think you want to store the sign separately from the significand; that's like sign-magnitude representation of integers, which we now realize makes integer calculations more complicated, not less. Store the significand as a 2's complement signed integer; that also takes care of the zero exception case for many operations. Similarly store the scale as a 2's complement signed integer. Many of the conditional tests needed for sign-magnitude representation disappear that way, and it's also fairly direct to convert a posit to its 2's complement significand just by sign-extending the bits beyond the es bits. 

However, you probably do want to have a bit that says whether it is NaR or not, and I know Florent agrees that doing so saves logic, time, and energy. So your triple is (scale, significand, NaR flag). Right?

John

To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/631ad9a0-ea87-4f2d-92f6-45ce82d19cde%40googlegroups.com.

florent.d...@gmail.com

unread,
Sep 17, 2019, 10:00:30 AM9/17/19
to Unum Computing

I tend to agree that code should be written bug-free. I keep telling that to students.

Please show me the tool that tells you that your code will never overflow, so people can use it for their floating-point code immediately:
The quest for such tools is totally independent of the underlying arithmetic.
However, if the generalization of posits is conditionned to the existence of such a tool, it is a strong point against posits, because such tools don't exist yet, and (with such annoyances as the halting problem) will probably never exist.
We've made quite a lot of progress with flagships such as Why or Herbie, but this is an endless quest, requiring deep changes to the software stack that will take much longer than the adoption of posits.

Meanwhile, overflow will happen in code because code is used in situations unpredicted when writing it.
Bugs will still be detected after the code is released, I would say it is part of the definition of a bug.
Or will not, that is the question. Most programmers/users worry when an infinity (or a  black screen) pops up in their code, whatever the source. 
This is what infinity is useful for: an indication that something went wrong.
Posit code will do strange things and the code will eventuallly break at some point, and it will just be more difficult to trace to the source.

Back to thread, in summary:
- On the one hand, infinity arithmetic or directed rounding support (a few bits to toggle) are an unsufferable burden to hardware.
- On the other hand, improving accuracy from 24 to 27 bits (in the favorable cases) perfectly justifies a factor 2-4 in the area*latency of the unit (and probably power).
I see a contradiction here.
(Reducing this overhead is fun and we are doing our part, but there is little doubt it will remain an overhead).

Finally, I know state bits are evil. No discussion about that. I didn't advocate them, I advocate choice.
Modern instruction sets give you the opportunity to encode the warts in each instruction word, without status bits.
In RISC-V, they even got rid of the integer carry flag, you have to use a full register instead.


On Tuesday, September 17, 2019 at 4:28:06 AM UTC+2, johngustafson wrote:
Florent,

Welcome to the discussion! Let me see if I can change your mind about one thing: representation of infinity.

Can you show me a real application where you are trying to calculate infinity? Or where infinity is a legitimate intermediate value? I am not aware of one. It would have to be a pretty silly calculation, one that does not require a computer so much as a little symbolic math.

I believe you are finding value in infinites in output as a way of alerting the user or programmer that overflow has occurred. Yet, in IEEE 754 it is very easy to hit an infinity, compute with it (divide a finite value by it), and proceed to a result that looks correct in the final output. In fact, IEEE 754 even lets you bury a NaN value such that it does not propagate to the final answer! Check out the definition of hypot(x,y). If x is NaN and y is infinity, the result is infinity. So by those rules, 1 / sqrt( NaN² + ∞²) = 0. (For posits, arithmetic with a NaR always propagates to the answer, no math library exceptions.)

Originally I was tempted to put into the Draft Standard the requirement that two registers, resettable to zero by the user, record the smallest magnitude and largest magnitude numbers that occur in a posit calculation. They can be output and reset to detect if the calculation has strayed into the inaccurate regions, or saturated at minpos or maxpos. I decided not to burden the hardware with that because it belongs in the debugger, not the hardware.

For that matter, many dubious features of IEEE 754 belong in the debugger, or are taken care of by the language environment. The idea that the hardware should babysit a poorly-designed algorithm and be responsible for reporting when it goes awry is simply not the way computing systems (or programmers, or users) work. Exceptions like overflow are an indication that the software is not yet finished and is not ready to ship! If a calculation in C, say, attempts a square root of a negative number, the logic in the function notices and puts out an error message, not the hardware. Exceptions are supposed to be rare; when you put complicated exception-handling into a hardware standard, you burden every single operation with waste, and you cannot turn it off. You say you don't want overflow in safety-critical code to go unnoticed; I don't want safety-critical code that can overflow to be released, ever! If a computation can overflow, put in a conditional test for it in the software and then do whatever the intelligent thing is to do for that situation… don't produce an "infinity" and take comfort that someone will notice the bad answer.

I once saw a life-critical application, CT medical imaging, where the application found the pixel with the maximum brightness and normalized the image to that so that image contrast would be maximized. This was decades ago, when the computation took 30 minutes, and a patient with a concussion needs a diagnosis ASAP. Well, one of the pixels was so bright that it overflowed to infinity. Dividing by that maximum made every other pixel zero, that is, black. After waiting for the image for half an hour, all the doctor got back was a black screen. Not cool. Bad programming, yes, but made worse by overflow-to-infinity rules, and by proceeding with a calculation that should have tested and handled the too-large value.

Status flags are evil, because the same program with the same data can produce different results depending on something utterly invisible and changeable. Is that really what you want? This is why posits have only one rounding mode. I'm glad you mentioned the paper by David Monniaux, because it is an excellent survey of all the ways IEEE 754 rules can lead to irreproducible results. (For others on this forum, you can find the paper at https://arxiv.org/abs/cs/0701192.)

Status flags are useful for letting hardware track things like integer carry, negative, zero, and so on, but notice that integer status flags are not something the programmer sets to control the behavior of an entire program. Imagine, say, an integer flag for 2's complement integers that says 10000000 means +128 instead of –128 for an 8-bit integer. Arguing, perhaps, "That way, absolute value will always work." Now you have irreproducibility in integer codes, too! So don't expect me to put such "mode of operation" flags into the Draft Posit Standard. Those aren't "warts." They're malignant tumors.

John

William Tanksley, Jr

unread,
Sep 17, 2019, 11:41:20 AM9/17/19
to Unum Computing
It strikes me that this problem of infinity is precisely what the ubit
was intended for (along with the much more constrained problem of
every other rounded result).

So perhaps this is an argument that it makes sense to lose something
when you remove the ubit. You have to decide what you're going to lose
-- either lose all ability to detect overflow, OR your ability to
distinguish overflow from NAR, OR your biggest +/- pair of numbers to
become your overflow buckets.

-Wm
> To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/0484FD7B-6B06-4C05-8DE8-DF7922144586%40earthlink.net.

Florent de Dinechin

unread,
Sep 17, 2019, 4:27:16 PM9/17/19
to unum-co...@googlegroups.com
That's a very good design choice indeed.
Can you achieve it in significantly less hardware/latency than encoding
then decoding the exact intermediate result of the operation? You need
to determine at which position of the significand to round.

Cheers,

Florent


16/09/2019 18:26, Theodore Omtzigt пишет:
> for exception cases, but floats have *no such burden*. I
>> <https://groups.google.com/d/msgid/unum-computing/0133bb9c-54aa-49ee-ac3d-940f679a5086%40googlegroups.com?utm_medium=email&utm_source=footer>.
>
>
> ------------------------------------------------------------------------
>
> Important: This email is confidential and may be
> privileged. If you are not the intended recipient,
> please delete it and notify us immediately; you should
> not copy or use it for any purpose, nor disclose its
> contents to any other person. Thank you.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "Unum Computing" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/unum-computing/IxR3w-gKbyU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> unum-computin...@googlegroups.com
> <mailto:unum-computin...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/unum-computing/631ad9a0-ea87-4f2d-92f6-45ce82d19cde%40googlegroups.com
> <https://groups.google.com/d/msgid/unum-computing/631ad9a0-ea87-4f2d-92f6-45ce82d19cde%40googlegroups.com?utm_medium=email&utm_source=footer>.

John L. Gustafson

unread,
Sep 17, 2019, 10:26:56 PM9/17/19
to florent.d...@gmail.com, Unum Computing
Florent,

From reading your paper on the hardware cost of posits, and noted earlier by me on this forum, your claims of simplicity for IEEE float hardware assume that there is no exception handling needed for IEEE floats. To quote your paper verbatim:

"The posit decoder used is described in Figure 1. The expensive part of this architecture comes from: the OR reduce over N–1 bits to detect NaR numbers; and the leading zero or one count (LZOC + Shift) that consumes the regime bits while aligning the significand. The +EMin aligns the exponents to simplify following operators. This decoding cannot be compared to an IEEE floating-point equivalent as no decoding is needed." (boldfacing is mine.)

You then use this to claim that posits have "a factor 2-4 in the area*latency" compared to floats. I was baffled by your claim until I figured out that you must be assuming all exceptions, including the leading zero count needed for denormalized floats, is being handled separately. Like, with a trap to microcode or a software handler. Intel and AMD are quietly using this trick, and taking about 200 clock cycles to process denormalized floats. Setting aside the question of whether this is a good idea or not, a fair comparison of IEEE float hardware and posit hardware would include all the hardware needed for subnormal, quiet NaN, signaling NaN, and infinity exception handling. For IEEE floats, you will need LZOC + Shift to decode subnormal inputs; you will also need an OR reduce to detect all 0 or all 1 bits in the exponent, indicating a float that is not normal. And if that happens, you then need another OR reduce over the rest of the bits to determine what kind of exception case you have. It is clearly a superset of the hardware required for posit decoding. You should correct this in your paper, because it is a highly misleading, unscientific, and unfair comparison based on the false claim that IEEE float decoding only needs to handle normal floats. 

The area required for the quire is a different matter; I understand that cost, and IEEE floats do not have that cost nor do they have the capabilities enabled by the quire. The feature costs something, and I have found that it frequently enables the replacement of 64-bit IEEE floats with 32-bit posits, which can save gigabytes or terabytes of storage. That seems worth spending a few hundred on-chip transistors for, don't you think? Now that processor chips have over 10 billion transistors?

I'm not sure where you get "improving accuracy from 24 to 27 bits (in the favorable cases)" from, because half of all 32-bit posits (the most commonly-used half) have 28 bits of significand, not 27. (One-fourth of all 32-bit posits have 27 bits of significand.) That raises the (wobbling) accuracy of single precision from 7.2–7.5 decimals to 8.4–8.7 decimals for most of the calculations, which I assert can sometimes allow 32-bit posits to replace 64-bit IEEE floats even without the quire. And in an honest accounting of hardware cost, they do that with less area and latency since the exception-handling is trivial for posits.

John

To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/1851c484-8743-4802-9d5b-eeca72f89e59%40googlegroups.com.

florent.d...@gmail.com

unread,
Sep 19, 2019, 2:05:33 AM9/19/19
to Unum Computing
Dear John,

I stand corrected for the 28 bits instead of 27.

Besides this, this post is FACTUALLY WRONG.
We compare an IEEE adder with full hardware subnormal support to (draft) standard posit.
And we consistently get almost double size _and_ almost double latency for posits.
With same technology, same tools, same design effort by the same people, and open-source reproducibility.
Before that, we compare our posit adder to the best recently published, and it seems we we have the state of the art in posit implementation, don't accuse us of sabotaging posits.
The state of the art is by definition a moving target, for instance Theodore already suggested a way that reduces the latency overhead (but increasing the area), I'll take that as a valid comment.

But not the comment that we compare to non-standard IEEE.
John, please stand corrected for the false claims you make about our work.

For multiplication, what we have in the paper is indeed a comparison with the industry-standard on FPGAs, which essentially flushes to zero, and this caveat is in the article.
"The comparison on multiplication is less definitive, as it lacks a fully compliant IEEE multiplier implementation with subnormal support."
We are working on this, but don't expect a miracle.

However, and this is my second point, flushing subnormals to zero or delegating them to software is a PERFECTLY SENSIBLE DESIGN CHOICE.
At least under the Gustafson-like argument that "no code should ever be released that even comes close to subnornals".
Let us remind here that in the IEEE subnormal range, posit16 have already lost all half their accuracy, and posit32 and posit64 have long flushed to zero.
Why should IEEE float designer be denied a freedom that posit people take?

If we take this liberty then the area.delay overhead of posits is more a factor 10. Look at the tables in the paper.
Repeat, one order of magnitude worse area.delay for posits.

I don't like being accused of making false claims.
Any claim that the hardware cost of posits is comparable to floats of the same size is, as far as I could verify, very false.
The question would rather be: can posit32 be competitive in terms of area.delay with the float64 they are supposed to replace?
Look at the data in the paper, for addition we are not there yet. Not to mention that float64 have twice the accuracy on a wider range.

Now we have this data, we can improve it, and anybody can balance the benefits of posits (e.g. better accuracy for the same energy when moving from/to memory) with their cost.

Regards to all,

   Florent de Dinechin


On Wednesday, September 18, 2019 at 4:26:56 AM UTC+2, John L. Gustafson wrote:
Florent,

From reading your paper on the hardware cost of posits, and noted earlier by me on this forum, your claims of simplicity for IEEE float hardware assume that there is no exception handling needed for IEEE floats. To quote your paper verbatim:

"The posit decoder used is described in Figure 1. The expensive part of this architecture comes from: the OR reduce over N–1 bits to detect NaR numbers; and the leading zero or one count (LZOC + Shift) that consumes the regime bits while aligning the significand. The +EMin aligns the exponents to simplify following operators. This decoding cannot be compared to an IEEE floating-point equivalent as no decoding is needed." (boldfacing is mine.)

You then use this to claim that posits have "a factor 2-4 in the area*latency" compared to floats. I was baffled by your claim until I figured out that you must be assuming all exceptions, including the leading zero count needed for denormalized floats, is being handled separately. Like, with a trap to microcode or a software handler. Intel and AMD are quietly using this trick, and taking about 200 clock cycles to process denormalized floats. Setting aside the question of whether this is a good idea or not, a fair comparison of IEEE float hardware and posit hardware would include all the hardware needed for subnormal, quiet NaN, signaling NaN, and infinity exception handling. For IEEE floats, you will need LZOC + Shift to decode subnormal inputs; you will also need an OR reduce to detect all 0 or all 1 bits in the exponent, indicating a float that is not normal. And if that happens, you then need another OR reduce over the rest of the bits to determine what kind of exception case you have. It is clearly a superset of the hardware required for posit decoding. You should correct this in your paper, because it is a highly misleading, unscientific, and unfair comparison based on the false claim that IEEE float decoding only needs to handle normal floats. 

The area required for the quire is a different matter; I understand that cost, and IEEE floats do not have that cost nor do they have the capabilities enabled by the quire. The feature costs something, and I have found that it frequently enables the replacement of 64-bit IEEE floats with 32-bit posits, which can save gigabytes or terabytes of storage. That seems worth spending a few hundred on-chip transistors for, don't you think? Now that processor chips have over 10 billion transistors?

I'm not sure where you get "improving accuracy from 24 to 27 bits (in the favorable cases)" from, because half of all 32-bit posits (the most commonly-used half) have 28 bits of significand, not 27. (One-fourth of all 32-bit posits have 27 bits of significand.) That raises the (wobbling) accuracy of single precision from 7.2–7.5 decimals to 8.4–8.7 decimals for most of the calculations, which I assert can sometimes allow 32-bit posits to replace 64-bit IEEE floats even without the quire. And in an honest accounting of hardware cost, they do that with less area and latency since the exception-handling is trivial for posits.

John

marc.b....@gmail.com

unread,
Sep 19, 2019, 5:30:42 PM9/19/19
to Unum Computing
In the spirit that it never hurts to state the obvious. Adding a second load/store op pair to the "Lyon" design which operate on a 2x width (perhaps just the internal format?) would greatly increase usability. No more worries about hitting an extra rounding step when a register spills as an example. 

John L. Gustafson

unread,
Sep 26, 2019, 1:25:36 AM9/26/19
to florent.d...@gmail.com, Unum Computing
Dear Florent,

The hardware cost of posits vs. floats is one of the most important topics for this forum. Progress is being made on a number of fronts regarding this issue, but the question needs to be answered carefully, rigorously, and dispassionately.

I found a great many errors in your paper, not just the 28 versus 27 bits. There are numerical, conceptual, and grammatical errors that suggest it was not even edited or spell-checked before submission. And any reviewer surely would have caught the fact that you left out much of the requirements of the IEEE 754 Standard (for multiplication especially), while still labeling your circuits as IEEE.

If you handle exceptions with software, that software runs on hardware. You then need to include all of the hardware needed for that software in making a comparison with posits. Which I imagine adds quite a bit to the cost of floats, since a typical x86 now takes about 200 clock cycles to trap and handle subnormals. Actually, we don't have to imagine. We can simply look at the complexity of Berkeley's SoftFloat, which perfectly expresses the IEEE 754 Standard using only integer operations, and the complexity of SoftPosit, which perfectly expresses the Draft Posit Standard using only integer operations. SoftPosit has considerably fewer lines of code and is actually slightly faster than SoftFloat. John Hauser has been optimizing SoftFloat for over 20 years, but Cerlane Leong only spent about six months tuning SoftPosit, please note. What we have so far provides rather compelling evidence that a full hardware implementation of IEEE 754 will require more chip area and latency than a full posit hardware implementation.

When you and your co-authors wrote "Posits: The Good, The Bad, and the Ugly," I was similarly struck by the many errors in the paper, not the least of which was that you footnoted that it had been accepted for the CoNGA 2019 conference, and you published the paper with that claim on HAL; in fact, we had not even assigned it to reviewers yet! We were only able to accept the paper after major revision. It also was the only paper I've ever seen written about posit arithmetic that avoided (in its original version) any References to the original paper on posits, or the Draft Standard, or the extended "posit4" paper available on posithub.org. It does not appear that you are following well-established rules for scientific publication.

Let's step back and compare IEEE binary floats with posits using observations that I think everyone on this forum can agree with.

Both systems express numbers that are either exceptions or of the form m × 2where m and n are integers.

For simpler hardware, both systems can maintain m and n in decoded form except when storing to memory. They can both also maintain single-bit flags to make it fast to detect exceptions without requiring an OR tree on a large number of bits. Float hardware might flag a number as subnormal or NaN or an infinity, and posits need a NaR flag and a small unsigned integer indicating the regime length. The most hardware-friendly way to store signed integers m and n is with 2's complement, as shown by several decade's worth of computer design experience. The actual hardware for plus, minus, times, divide, and square root is then pretty much the same for floats and posits except for the final rounding.

The rounding for floats is made more complicated since four rounding modes must be supported. The rounding for posits is made more complicated by the need to know the regime length before rounding. It's not clear which one is more expensive, especially if the regime length is kept in decoded form by the functional units.

The functional units are slightly larger for posits because they have to support greater accuracy. For instance, a 32-bit posit needs a multiplier that can produce a 56-bit product versus only a 48-bit product for floats. If we estimate that multiplier area grows as the square of the product size, that makes the posit multiplier 36% larger than the float multiplier. For functional units that grow linearly in the number of bits, the posit hardware is more like 17% larger. However, these are compensated by the simpler exception-handling of posits. The 2008 version of the IEEE 754 Standard is 70 pages long, largely because of the complicated exception handling. Furthermore, floats require their own comparison instructions since they are quite different for the ones for 2's complement integers. Posits don't require any comparison instructions in a system that already supports 2's complement integers. 

Let's compare the relative effort to find m and n from each format. Decoding a normal float requires subtracting a bias from the exponent bits, and converting the fraction from sign-magnitude to 2's complement. Decoding a subnormal float requires a Count Leading Zeros (CLZ) instruction that then adjusts both m and n; since a floating-point pipeline (or SIMD instruction) requires data-independent timing, the delay and chip area must either include that worst-case cost of decoding a subnormal. (If you instead declare subnormals to be "rare" and handle them with software, then you still have to include the chip area for running that software and also state the delay for those cases.) Posits always have a CLZ operation for the regime decoding except for the two exception values, 0 and NaR. Finding m then requires a shift and an add; finding n is a sign-extended (arithmetic) shift since the fraction is already in 2's complement. So there seems to be very little difference between the float and posit decoding costs… certainly less than a factor of two.

Instead of giving floats every shortcut possible (leaving out much of the IEEE 754 requirements) and assuming the hardware maintains m and n in decoded form, but then burdening posits with decoding and encoding everything for every operation, we need an apples-to-apples comparison. The choices to do otherwise in your paper are all in the direction of making floats look good and posits look bad, and if I were a referee for the paper, I would certainly say it needs major revision before it can be published. At the very least, please edit the paper to remove claims that your circuits support the IEEE 754 Standard, because they do not.

John G.




To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/37c5acdc-56e3-4ca6-98ea-c21effda9681%40googlegroups.com.

luc.fo...@gmail.com

unread,
Sep 27, 2019, 7:39:03 AM9/27/19
to Unum Computing
Dear John,

I think it is worth pointing out a few misconceptions you seem to have about the floating point number decoding process.

John L. Gustafson wrote :

Let's compare the relative effort to find m and n from each format. Decoding a normal float requires subtracting a bias from the exponent bits, and converting the fraction from sign-magnitude to 2's complement.

We need not the sign-magnitude conversion and it is not always necessary to subtract the bias from the exponent.
For instance, when performing the addition you just need to compute the difference between the two exponents for proper operand alignment.
In the product case, we only need to subtract the bias from the exponent sum (which is also done in the posit case).
 
Decoding a subnormal float requires a Count Leading Zeros (CLZ) instruction that then adjusts both m and n;

There is a more efficient way of doing so: let us call 'isNormal' the result of a wide OR on the exponent bits. Then, your biased exponent would be E (read directly from the encoding) with the last bit OR-ed with (not isNormal).
Your significand is isNormal.F with F also directly read from the encoding. (It is not necessary to normalise the input operands in the case of addition and product. You can see how it is handled in our adder for instance).
 
since a floating-point pipeline (or SIMD instruction) requires data-independent timing, the delay and chip area must either include that worst-case cost of decoding a subnormal. (If you instead declare subnormals to be "rare" and handle them with software, then you still have to include the chip area for running that software and also state the delay for those cases.)

As shown above, it is not terribly expensive to handle the "decoding" of subnormal. It only add more logic inside the operator.

 
Posits always have a CLZ operation for the regime decoding except for the two exception values, 0 and NaR. Finding m then requires a shift and an add; finding n is a sign-extended (arithmetic) shift since the fraction is already in 2's complement. So there seems to be very little difference between the float and posit decoding costs… certainly less than a factor of two.

So let us compare : for floats we need a wide OR on WE bits for subnormal handling , a wide AND on WE bits and another wide OR on WF bits for infinite and NaN detection.

For posits we need a CLZO (which is a little bit more expensive than just a CLZ by the way) plus a shifter on N-2 bits for the "normal" posits, and a N-2 wide OR to detect NaR and zero.

To this regard, the x2 factor in cost and latency is not so surprising.

 

Instead of giving floats every shortcut possible (leaving out much of the IEEE 754 requirements) and assuming the hardware maintains m and n in decoded form, but then burdening posits with decoding and encoding everything for every operation, we need an apples-to-apples comparison. The choices to do otherwise in your paper are all in the direction of making floats look good and posits look bad, and if I were a referee for the paper, I would certainly say it needs major revision before it can be published. At the very least, please edit the paper to remove claims that your circuits support the IEEE 754 Standard, because they do not.

On the multiplier, it was never written than we support IEEE-754 standard.
On the adder, the only grief you are expressing concerns the support of multiple rounding directions.
The rounding offset is a function of the rounding mode, the rounding bit, the sticky bit of the result and the parity bit of the result.
You speak about 4 rounding modes which can be encoded on 2 bits.
The "final rounding add" bit is then a boolean function of 5 inputs, which fits in one FPGA LUT (or a small number of gates in ASIC).
To this regard, as it virtually does not change anything to the comparison, the results still allow to compare IEEE adder vs posit adder.

--
Luc Forget

yo.u...@gmail.com

unread,
Sep 27, 2019, 7:39:45 AM9/27/19
to Unum Computing
Dear John, 

As another co-author and a main developer of the papers you are discrediting, allow me to demonstrate how wrong every other paragraph you wrote in your previous message is.
I am very sorry if my grammar is not perfect, English is not my first language, and if you believe this discredits us regarding our scientific level, well, this is sad.
 

On Thursday, September 26, 2019 at 7:25:36 AM UTC+2, John L. Gustafson wrote:
Dear Florent,

The hardware cost of posits vs. floats is one of the most important topics for this forum. Progress is being made on a number of fronts regarding this issue, but the question needs to be answered carefully, rigorously, and dispassionately.


First of all, we do not defend IEEE, we do not care who wins or lose, we only care about science and what is best. So please consider this message is written "dispassionately".

 
I found a great many errors in your paper, not just the 28 versus 27 bits. There are numerical, conceptual, and grammatical errors that suggest it was not even edited or spell-checked before submission. And any reviewer surely would have caught the fact that you left out much of the requirements of the IEEE 754 Standard (for multiplication especially), while still labeling your circuits as IEEE.

There no mistake about the precision in our paper, only Florent did in a previous message on this forum. We always compare 27 bits for posits and 23 bits for floating-points (in both cases without the implicit bits) or 28 and 24 (when including the implicit bit).
Again sorry for the grammatical errors. Regarding the other types of errors, it appears that the reviewers of our latest paper on posit hardware implementation did not find them, specially as our paper was candidate for the best paper award at FPL2019!
As said previously by Florent, the operators that we implemented with the label IEEE are IEEE compliant. Regarding the multiplier, the paper states that the implementation is not IEEE compliant and it is not labeled IEEE. So your sentence is just a lie. Or maybe you don't really read the papers that don't promote posits blindly.   
 

If you handle exceptions with software, that software runs on hardware. You then need to include all of the hardware needed for that software in making a comparison with posits. Which I imagine adds quite a bit to the cost of floats, since a typical x86 now takes about 200 clock cycles to trap and handle subnormals. Actually, we don't have to imagine.

We handle exceptions in hardware, so this remark has nothing to do here. Again, it's written in the paper.
 
We can simply look at the complexity of Berkeley's SoftFloat, which perfectly expresses the IEEE 754 Standard using only integer operations, and the complexity of SoftPosit, which perfectly expresses the Draft Posit Standard using only integer operations. SoftPosit has considerably fewer lines of code and is actually slightly faster than SoftFloat. John Hauser has been optimizing SoftFloat for over 20 years, but Cerlane Leong only spent about six months tuning SoftPosit, please note. What we have so far provides rather compelling evidence that a full hardware implementation of IEEE 754 will require more chip area and latency than a full posit hardware implementation.

Well, this one is a funny one. Today, the 27th of September,  I checked out both projects. Softfloat has 23902 lines of code; Softposit has 27679 lines. Too bad you speak without knowing what you are talking about.
Plus, comparing the number of lines of code to estimate the hardware cost is really to be made fun of.  

 

When you and your co-authors wrote "Posits: The Good, The Bad, and the Ugly," I was similarly struck by the many errors in the paper, not the least of which was that you footnoted that it had been accepted for the CoNGA 2019 conference, and you published the paper with that claim on HAL; in fact, we had not even assigned it to reviewers yet!

And again... this is false. The paper was published as a preprint, with no associated conference. This is how, in France, we deal with copyright. Please stop lying at this point.
 

 
We were only able to accept the paper after major revision. It also was the only paper I've ever seen written about posit arithmetic that avoided (in its original version) any References to the original paper on posits, or the Draft Standard, or the extended "posit4" paper available on posithub.org.

There was no citations to the draft or the "Beating floating-point..." but they were discussed all along. Sorry we did not write your name on this paper that was to be published in your conference, it was latter corrected. I suppose it is a better paper now.

 
It does not appear that you are following well-established rules for scientific publication.

Oh. As stated before, our latest posit paper was nominated for the best paper award. I don't think that committees do that to papers that do not follow the well-established rules.

To the rest of this community who is actually asking themselves questions and respond to those scientifically, 

Best,

Yohann Uguen  

John L. Gustafson

unread,
Sep 30, 2019, 8:05:03 AM9/30/19
to yo.u...@gmail.com, Unum Computing, florent.d...@gmail.com, Vassil Dimitrov, Cerlane, luc.fo...@gmail.com
Dear Yohann,

An excerpt from the recent thread about hardware cost of posits vs. floats:

When you and your co-authors wrote "Posits: The Good, The Bad, and the Ugly," I was similarly struck by the many errors in the paper, not the least of which was that you footnoted that it had been accepted for the CoNGA 2019 conference, and you published the paper with that claim on HAL; in fact, we had not even assigned it to reviewers yet!

And again... this is false. The paper was published as a preprint, with no associated conference. This is how, in France, we deal with copyright. Please stop lying at this point.

Since you have called me a liar repeatedly on a public forum, I feel compelled to respond, and to cc the CoNGA organizers who witnessed this last December as well as your co-authors. On December 25, 2018, I wrote this email to Florent de Dinechin and Jean-Michel Muller, copying Cerlane Leong and Vassil Dimitrov (CoNGA 2019 co-organizers):


Dear Florent and Jean-Michel,

I've been meaning to write you both because I'm delighted to see you writing about posits, and very happy that you submitted a paper or two to CoNGA. I think in the panel discussion we plan at the end of the session, the topic should be on floats vs. posits, with you being the main one to advocate for floats, though you will not be the only one. That might be almost as interesting as the debate I had with Kahan! Our Program Chair Vassil Dimitrov, who I think you know, agrees with me that we are honored to have such strong interest and detailed analysis from world-class experts in IEEE 754 such as yourselves. I do not know the third author, Y Uguen, but please feel free to share this email with him or her.

However, I was a little alarmed to get this notice from Google Scholar, minutes ago:

Begin forwarded message:

From: Google Scholar Alerts <scholarale...@google.com>
Subject: Recommended articles
Date: December 25, 2018 at 4:48:46 PM GMT+8

F De Dinechin, L Forget, JM Muller, Y Uguen - 2018
Many properties of the IEEE-754 floating-point number system are taken for granted 
in modern computers and are deeply embedded in compilers and low-level softare 
routines such as elementary functions or BLAS. This article reviews such properties …
Twitter
Facebook

 

This alert is sent by Google Scholar. Google Scholar is a service by Google.

I opened the PDF, and it is shown as having been accepted for "CongA 2019, march 2019, Singapore". (We capitalize it as CoNGA, and March should be have an initial capital.) Since we only recently assigned reviewers, and certainly haven't completed any reviews or sent you our decision, isn't this a bit premature? I glanced through the paper and saw a number of serious technical errors that really should be addressed before the paper is made public in any way. For instance, it says the Draft Standard for posits uses 2 bits of exponent for 64-bit numbers; the correct number is 3 bits of exponent, and that affects your arguments about relative merits, the size of the exact accumulator, and on. There are several other conceptual mistakes that I'm sure the reviewers will describe and that I know you will want to correct.

You might want to use an original source for information about how posits work… instead of what others have written about them. Perhaps the best source is https://posithub.org/docs/Posits4.pdf.

Please consider taking this paper down until it can be properly refereed! Your fame in the area of computer arithmetic, like Kahan's, is such that mistakes in a pre-released paper like this will get recirculated and repeated in the community, leading to an unfair assessment of the relative merits of floats and posits. I think everyone will benefit if only a more polished, properly refereed version is eventually posted.

Thanks,
John


Florent quickly responded, to his credit:


Dear all,

Thank you for raising this.
I apologize for this situation.
It is the policy at our institutions that submitted papers developed using public money are published on open archives.
However the mention to the conference is indeed a mistake of mine due to end-of-year rush, I shouldn't have kept the conference template. It is very wrong indeed. The web page itself mentions that it is a "preprint or work in progress" but the pdf is wrong.
I sent a new version without the mention to CoNGA.
It may take a few days to be validated.

Then there will be a revision of this preprint taking into account the reviewer's comments (paper accepted or not).

Please accept my apologies.

Florent de Dinechin


Yohann, I would be very happy to partner with you and your colleagues in producing technical publications, even if it's just to copy-edit your papers and check for numerical errors. I don't need to be a co-author or even get mentioned in an Acknowledgments section. I have great respect for the expertise amassed in Lyon and I count on you as allies in figuring out the tradeoffs between floats and posits. There's no need to do the academic equivalent of a drive-by shooting, rushing a paper up on HAL before it has even been finished or reviewed by anyone. I know your policy requires that you publish on open forums all papers generated using public funds (good law!) but I don't believe it requires you to do so while the paper is still in process.

Best,
John

UGUEN Yohann

unread,
Sep 30, 2019, 8:40:31 AM9/30/19
to John L. Gustafson, Unum Computing, florent.d...@gmail.com, Vassil Dimitrov, Cerlane, luc.fo...@gmail.com
Dear John, 

It was brought up to me that this has been corrected as soon as possible when noticed, with of course no intention to pretend that the paper was already accepted.
This was clearly an upload mistake. 
However, discredit our work based on such mistakes still remains a bit confusing.

Regarding the drive-by shooting: publishing a preprint on HAL before the reviews are assigned.
This is the point of this pre-publication, so that there is an online version before anybody else (in that case the reviewers) can steal the work.

Now, if you have great respect for the expertise amassed in Lyon, stop publicly discrediting our work based on silly/false reasons.

Yohann
 

 


James Brakefield

unread,
Sep 30, 2019, 3:15:47 PM9/30/19
to Unum Computing
One possible cause of confusion is the INRIA versioning system.
If you download an early version, you may not notice later versions.
Now the https://hal.inria.fr/hal-02130912 web page contains links to all the versions of one paper.
Would this or similar links be the best way to reference work in progress?

Jim Brakefield

Cerlane

unread,
Oct 1, 2019, 7:58:06 AM10/1/19
to unum-co...@googlegroups.com
Hi all,

As the main developer of SoftPosit, I thought I should answer the comment from Yohann Uguen.
"Well, this one is a funny one. Today, the 27th of September,  I checked out both projects. SoftFloat has 23902 lines of code; Softposit has 27679 lines. Too bad you speak without knowing what you are talking about. Plus, comparing the number of lines of code to estimate the hardware cost is really to be made fun of. "

When developing SoftPosit, performance was my main concern followed by maintainability. I made some design decisions that sacrificed maintainability for performance that increased the total lines of code (LOC). For example, the decoding of regime, could have been made into a function since it would be used repeatedly but I hesitated to do so since a function call would reduce performance. Consequently when I thought of a smarter way to decode regime, I had to update the code in multiple places. Please note that SoftPosit is still very much a work in progress and contains the many research ideas we had to make things more practical. Consequently, it cannot be compared to the almost completed version, SoftFloat, which has 20 years of background work behind it.

There are also non-Posit Standard codes in SoftPosit that should not be counted if you want a fair comparison. John mentioned them (c_convertDecToPosit**.c files). These are files to convert a double to posit binary. I assume SoftFloat does not have any codes doing anything similar, i.e. converting posit to double. I also created additional functions, e.g. round function that basically works more like the round function of the math library to improve the functionality of the python version of SoftPosit. In general, any c codes with c_ prefix are files that you don't have any equivalent in SoftFloat.

While I know there are still quite some people  who like to count LOC, personally I am against that. I could have reduced the number of lines if I code in a dirtier and uglier way. Similarly SoftFloat can also do the same. We avoided that in view of maintainability. 

Similar I will not recommend one to compare SoftPosit and SoftFloat based on LOC. I will also not recommend to base the potential of hardware design on software LOC. As SoftPosit is still a work in progress, I am sure that there are still many tricks that one can apply both in software and hardware to improve performance, which should be fun to discover in the years to come. The current regime decode algorithm I have in SoftPosit is performing more than 10x faster than my first version. It was inspired from the many discussions (including disagreements) I had with John.

I look forward to the many great discussions in this Google Group. While our ideas might differ, these varying views help open our minds (at least mine). I am sure these discussions will be even more constructive and inspirational when conducted in a respectful manner independent of viewpoints. 

Cheers,
Cerlane
P.S. Apologies for the lack of progress in SoftPosit. With my new job/country, I am still struggling to settle down.

Benjamin Pedersen

unread,
Nov 24, 2019, 7:38:46 AM11/24/19
to Unum Computing
Hi Cerlane

tirsdag den 1. oktober 2019 kl. 13.58.06 UTC+2 skrev Cerlane:
For example, the decoding of regime, could have been made into a function since it would be used repeatedly but I hesitated to do so since a function call would reduce performance. Consequently when I thought of a smarter way to decode regime, I had to update the code in multiple places.

Can't you inline such a function or write it as a macro?

James Brakefield

unread,
Nov 24, 2019, 11:02:56 PM11/24/19
to Unum Computing
If I read this right?

In VHDL and Verilog functions are hardware generators, new instance of hardware for each function call.

Jim Brakefield

John Gustafson

unread,
Nov 25, 2019, 10:58:23 PM11/25/19
to Benjamin Pedersen, Unum Computing, Cerlane
I'm not sure Cerlane is reading messages to unum-computing and I worked closely with her on the regime decoding, so perhaps I can address this question. (And to James Brakefield, we are talking about SoftPosit and not HardPosit, so please don't think of VHDL or Verilog implications of these choices.) I'll cc her to get her attention, if she wants to add comments.

There may be a good way to write the regime decoding as a macro and thereby improve the software engineering of SoftPosit. However, the experiments we did showed that general Count Leading Zeros (CLZ) tricks (and the equivalent for runs of 1 bits) slow down SoftPosit decoding on average. The CLZ instruction is supported on all major processors and can take only a single clock cycle (on AMD processors, anyway) but it is difficult to get to it from C in a portable way. Cerlane and I gnashed our teeth a lot over this, because it would make SoftPosit both faster and easier to read if we could simply access the CLZ instruction in processors. C is great for accessing bitwise AND, OR, XOR, and shifts, but not CLZ.

The "while" loop approach surprised us with its speed, since at first glance it looks like it could require as many as 30 cycles when decoding the regime of a 32-bit posit. Then we realized: For half of all posits, the "while" tests only once and falls through to the next instruction! That's because half of all posits have either 01 or 10 as their regime. For those that don't, half of those have 001 or 110 as their regime. And so on. So the average regime length is a run of two bits! (Averaging over all possible posits with equal weighting). Thus, the software makes the common cases fast at the expense of a rather slow decoding for the rare cases where the regime is long. A hardware design for posits would use a constant-time CLZ so that worst-case latency is minimized.

I hope to find people to code up a HardPosit equivalent to Berkeley HardFloat, which was also done by John Hauser and is exceptionally concise (written in Chisel) and well-optimized. This will allow us to combine the CLZ-type regime decoding with left alignment of the fraction. We may also find that hardware is fast at adding two regimes (same run bit: catenate the runs; opposite run bit: annihilate bits until the longer regime remains) so there is no need to decode the regime-exponent field into a signed integer for things like multiplication of posits. The 2's complement aspect of posits can be exploited for posits just as it simplifies integer operations; hardware designs to date have tended to first transform the posit into the sign-magnitude bit fields of familiar floating-point, operate on those, and transform back to posit form, but that seems to be the hard way to do things.

John

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unum-computing/7fc47788-e5a6-4817-a013-84098a6c2d0f%40googlegroups.com.

Luc Forget

unread,
Nov 26, 2019, 2:52:06 AM11/26/19
to unum-co...@googlegroups.com

Le 26/11/2019 à 04:58, John Gustafson a écrit :
> The "while" loop approach surprised us with its speed, since at first
> glance it looks like it could require as many as 30 cycles when
> decoding the regime of a 32-bit posit. Then we realized: For half of
> all posits, the "while" tests only once and falls through to the next
> instruction! That's because half of all posits have either 01 or 10 as
> their regime. For those that don't, half of /those/ have 001 or 110 as
> their regime. And so on. So the /average/ regime length is a run of
> *two* bits! (Averaging over all possible posits with equal weighting).
> Thus, the software makes the common cases fast at the expense of a
> rather slow decoding for the rare cases where the regime is long. A
> hardware design for posits would use a constant-time CLZ so that
> worst-case latency is minimized.

It's true that half the posits have a regime of only one bit, but this
reasoning is not good for predicting real code execution time : this
only hold for values in [-16, -0.0625] U [0.0625, 16] for standard posit
32. The library user is likely to have values that goes outside this
range (if not, maybe using Fixed Point would be more adapted than using
posit !).

Why not to use the kind of algorithm described here ? 
https://stackoverflow.com/a/23862121

The idea is to first get a canonical representation of a value by
putting to one all the bits after the rightmost leading zero and then
map this canonical representation to the actual count.

You can adapt it to count ones by detecting the first bit and inverting
the value if needed. It has the advantage of having an execution time
almost independant from the data (there is a need of a special case for
0 and NaR).


> I hope to find people to code up a HardPosit equivalent to Berkeley
> HardFloat, which was also done by John Hauser and is exceptionally
> concise (written in Chisel) and well-optimized. This will allow us to
> combine the CLZ-type regime decoding with left alignment of the fraction.

Just for the record, combining CLZ and shift is what we do in our
architecture.

> We may also find that hardware is fast at adding two regimes (same run
> bit: catenate the runs; opposite run bit: annihilate bits until the
> longer regime remains) so there is no need to decode the
> regime-exponent field into a signed integer for things like
> multiplication of posits.

There was a discussion earlier on this thread highlighting the fact that
it would be a better design choice to have the posit decoding hardware
be part of the memory path and keep decoded representation inside the
processor.

As far as I understand it, this is not compatible with your suggestion.

Besides, what I understood from this suggestion is to first right or
left shift one of the operand according to regime bits from the other
before computing the regime of this new value (and normalizing it again).

So the algorithm would be : at a given step, shift right and fill with r
if the two current regime bits are the same, shift left if not.

Stop when one of the input operand regime has been "completely seen".

Maybe I misunderstood you, but if the idea is the above, its seems quite
complicated, add two shifters (one for the left, one for the right) for
a maximum Shift size of N-2 in both directions.

This only to save the addition of the range bits, which for posit 64 is
a signed addition of 7 bits...

And it is specialised for the product, as for the other operation we
cannot use such a trick.

So I am not sure this kind of approach is a good idea.


> The 2's complement aspect of posits can be exploited for posits just
> as it simplifies integer operations; hardware designs to date have
> tended to first transform the posit into the sign-magnitude bit fields
> of familiar floating-point, operate on those, and transform back to
> posit form, but that seems to be the hard way to do things.

Our implementation is using a 2's complement internal representation, to
gain on the decoding and encoding speed.
That plus some optimisation on arithmetic operators is the reason why
the cost and latency are better than previous implementations, but even
doing so, the decoding and encoding operation remains quite costly.

--

Luc Forget


Theodore Omtzigt

unread,
Nov 26, 2019, 8:59:07 AM11/26/19
to Unum Computing
"I hope to find people to code up a HardPosit equivalent to Berkeley HardFloat"
That work is already underway at:

but for a different reason, this is to enable a posit-vector machine using the RISC-V ISA.

We are trying to consolidate the innovations that Luc's team has demonstrated with their posit implementation with the work that we did on the posit data path generator.
John

To unsubscribe from this group and stop receiving emails from it, send an email to unum-co...@googlegroups.com.

marc.b....@gmail.com

unread,
Nov 27, 2019, 1:32:19 PM11/27/19
to Unum Computing
 FWIW: The short version is all major compilers have intrinsics for leading/trailing zero counting which are properly emitted as opcodes.

Art Scott

unread,
Oct 30, 2020, 5:20:52 PM10/30/20
to Unum Computing
Aloha all.

What is the status of these efforts?
Ongoing?
Active?
Good news?

Is it possible, is there an OPEN version, that might not need a lot of work to run through OPENLANE ...
Google sponsored FOSSi efabless Free SkyWarter 130 5 shuttles 2020 - 2021?
https://invite.skywater.tools/

Thanks
Reply all
Reply to author
Forward
0 new messages