The value of floating-point exceptions?

Marcus

unread,

Jul 21, 2021, 9:44:44 AM7/21/21

to

Hi!

I would like to ask a couple of simple questions in this group, related
to the IEEE 754 standard and floating-point exceptions.

(Some of you may know that I've been pondering this subject before in
other forums).

As we all know, support for floating-point exceptions comes with a cost
(sometimes significant, since it dictates many aspects of a HW
implementation for instance). If we imagine an alternative universe in
which the dominant floating-point standard did not require exceptions,
I suspect that most hardware (and language) implementations would most
likely be simpler, more energy efficient and more suitable for parallel
and pipelined floating-point operations.

My questions are:

1) What (in your opinion) are the benefits of floating-point exceptions?

2) In what situations have you had use for floating-point exceptions?

/Marcus

Marcus

unread,

Jul 21, 2021, 9:46:47 AM7/21/21

to

On 2021-07-21 15:44, Marcus wrote:
> Hi!
>
> I would like to ask a couple of simple questions in this group, related
> to the IEEE 754 standard and floating-point exceptions.
>
> (Some of you may know that I've been pondering this subject before in
> other forums).
>
> As we all know, support for floating-point exceptions comes with a cost
> (sometimes significant, since it dictates many aspects of a HW
> implementation for instance). If we imagine an alternative universe in
> which the dominant floating-point standard did not require exceptions,
> I suspect that most hardware (and language) implementations would most
> likely be simpler, more energy efficient and more suitable for parallel
> and pipelined floating-point operations.
>
> My questions are:

To answer my own questions... In my fields which over the past 2-3
decades include 3D graphics, ray tracing, 3D polygonal mesh manipulation
algorithms, compression algorithms for triangle meshes, audio DSP, audio
synthesis, audio compression, space/satellite signal processing, image
processing and compression algorithms, AI / neural networks, image-based
eye- and head-tracking, least squares solvers, and so on, etc, I have to
say:

>
> 1) What (in your opinion) are the benefits of floating-point exceptions?
>

None

> 2) In what situations have you had use for floating-point exceptions?
>

None

John Dallman

unread,

Jul 21, 2021, 9:57:49 AM7/21/21

to

In article <sd98c9$cln$1...@dont-email.me>, m.de...@this.bitsnbites.eu

(Marcus) wrote:

> 1) What (in your opinion) are the benefits of floating-point
> exceptions?

They allow detection of problems closer to the site of the bad code, or
when reading the bad data. The mathematical modeller I work on has many
iterative floating-point algorithms, and predicting what they will do for
large sets of input data is beyond human capabilities.

> 2) In what situations have you had use for floating-point
> exceptions?

Finding bugs of our own, compiler bugs and bad data in bug reports.

The lack of floating-point exceptions in actual implementations of ARM64
is one of my biggest worries about its increasing popularity.

John

Marcus

unread,

Jul 21, 2021, 10:17:24 AM7/21/21

to

Oh, interesting. It seems that Microsoft has recognized this. Quoting
their Windows ARM64 ABI conventions [1]:

"For processor variants that do have hardware floating-point exceptions,
the Windows kernel silently catches the exceptions and implicitly
disables them in the FPCR register. This trap ensures normalized
behavior across processor variants. Otherwise, code developed on a
platform without exception support may find itself taking unexpected
exceptions when running on a platform with support."

...which of course is the natural solution. In our software we do the
same thing (even disabling subnormals) in order to normalize on the
least common denominator, so to say (as soon as one target platform
lacks a feature, it needs to be disabled on _all_ platforms).

/Marcus

[1]
https://docs.microsoft.com/en-us/cpp/build/arm64-windows-abi-conventions?view=msvc-160#floating-point-exceptions

Terje Mathisen

unread,

Jul 21, 2021, 10:25:38 AM7/21/21

to

Mostly theoretical

>
> 2) In what situations have you had use for floating-point exceptions?

Never.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Marcus

unread,

Jul 21, 2021, 10:32:55 AM7/21/21

to

On 2021-07-21 15:56, John Dallman wrote:

> In article <sd98c9$cln$1...@dont-email.me>, m.de...@this.bitsnbites.eu
> (Marcus) wrote:
>
>> 1) What (in your opinion) are the benefits of floating-point
>> exceptions?
>
> They allow detection of problems closer to the site of the bad code, or
> when reading the bad data. The mathematical modeller I work on has many
> iterative floating-point algorithms, and predicting what they will do for
> large sets of input data is beyond human capabilities.
>
>> 2) In what situations have you had use for floating-point
>> exceptions?
>
> Finding bugs of our own, compiler bugs and bad data in bug reports.

For these use cases, would it not suffice with a compiler/software
solution similar to the sanitizers that are available in LLVM and GCC
for instance? E.g. see:

* https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
* https://dl.acm.org/doi/epdf/10.1145/3446804.3446848

I regularly enable sanitizers during test & development, and for most
purposes they tend to have an acceptable performance overhead.

John Dallman

unread,

Jul 21, 2021, 10:35:40 AM7/21/21

to

In article <sd9a9h$ro6$1...@dont-email.me>, m.de...@this.bitsnbites.eu

(Marcus) wrote:

> ...which of course is the natural solution.

It's the natural solution if you assume nobody will ever actually want
floating-point exceptions. A far more helpful solution would be what they
do on x86, which is to disable the exceptions by default and to provide a
CRT call to turn on the ones that you want.

> In our software we do the same thing (even disabling subnormals)
> in order to normalize on the least common denominator, so to say
> (as soon as one target platform lacks a feature, it needs to be
> disabled on _all_ platforms).

We're willing to try to make use of useful features of particular
platforms.

In our test harness, we turn on exceptions for overflow, invalid and
divide-by-zero, where we can. We leave denormal, inexact and underflow
disabled, since we want behaviour as much as possible like natural
numbers. We turn on flushing of denormals to zero where it's available.

The customers can use whatever floating-point handling they want, but we
warn them that exceptions on denormal, inexact or underflow will cause
problems.

John

Ivan Godard

unread,

Jul 21, 2021, 11:44:22 AM7/21/21

to

It's not that the semantics of exceptions are useless, it's that the
IEEE standard's approach to them is mis-designed for modern
architectures and hence IEEE exceptions as a programming device are
effectively useless.

The definition of flags and modes in the standard makes them implicit
arguments to every instruction - which inherently linearizes the FP
instruction stream, while modern chips want to do a lot of FPs
concurrently. During my active time on the committee Kahan was still
plumping to use exceptions to support trap-value-replacement. You could
do that in Mathlab, but not in hardware. That linearization not only
breaks multi-issue and SIMD, it also precludes speculative execution of
FP ops in general - no computation overruns in loops for example.

Once you abandon value-replacement, exceptions become what the word
"exception" means in other contexts: a non-recoverable reporting
mechanism that things went off the rails somewhere. As such, as in any
debugging assist, the design as much as possible should serve to narrow
the scope of "somewhere" while not hindering computations that remain on
the rails.

In the Mill we do that by elimination the global flags and modes and
moving the event reporting into metadata attached to the result of the
failing computation. You can speculatively or concurrently do FP ops on
a Mill, and when the erroneous result finally must be used
non-speculatively (if it ever is), and you have not asked to ignore
bogus results, then the resulting fault can report fairly well where
things went bad. That really helps, not because of some numerical nicety
but because real programmers spend their lives chasing bugs.

So I use exceptions everywhere that an exception would indicate a bug.
However, even in debugging I don't use inexact; I have never used an
algorithm where I cared that rounding had happened, and in practice
inexact always gets set and so is just nuisance noise. Inexact was
intended to be a support for using FP for fixed-point calculations, but
if I had a fixed-point problem I would use DFP.

Ivan

John Dallman

unread,

Jul 21, 2021, 11:48:52 AM7/21/21

to

In article <sd9b6k$2q4$1...@dont-email.me>, m.de...@this.bitsnbites.eu

(Marcus) wrote:

> > Finding bugs of our own, compiler bugs and bad data in bug
> > reports.
>
> For these use cases, would it not suffice with a compiler/software
> solution similar to the sanitizers that are available in LLVM and
> GCC for instance? E.g. see:
>
> * https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
> * https://dl.acm.org/doi/epdf/10.1145/3446804.3446848

Possibly. An advantage of floating-point exceptions is that they work
when testing the *same* binary that will be delivered to the customer.

We've been using floating-point traps with this product since the 1980s,
and Valgrind for over a decade, and it is time to try out compiler
sanitisers. When I get a chance...

The UndefinedBehaviour sanitiser looks most useful for:

-fsanitize=alignment, since the platforms that were fussy about that are
all dead or nearly so.

-fsanitize=float-divide-by-zero and -fsanitize=integer-divide-by-zero are
a partial substitute for floating-point traps.

These also look worth a try: -fsanitize=builtin, -fsanitize=bounds,
-fsanitize=null, -fsanitize=pointer-overflow,
-fsanitize=signed-integer-overflow, and
-fsanitize=unsigned-integer-overflow.

Less useful ones are:

-fsanitize=float-cast-overflow, because to a very close approximation, we
don't convert floats or doubles to ints.

The NSan sanitiser will catch overflows, but it's not clear to me that it
will catch invalid operands. It's also going to be pretty slow, because
everything is doubles, forcing the shadow values to be 128-bit and thus
partially done in software.

> I regularly enable sanitizers during test & development, and for
> most purposes they tend to have an acceptable performance overhead.

I'm guessing you don't have a couple of hundred machines and VMs
operating distributed testing systems to get the overnight build tested
in time for a ship/don't-ship decision the following afternoon?

Sanitiser overhead is going to demand separate builds, that aren't
required to complete their testing in time for that ship/no-ship decision,
but still may be useful.

Thanks!

John

BGB

unread,

Jul 21, 2021, 11:49:57 AM7/21/21

to

FWIW:
This was another feature I didn't bother with in BJX2;
If you want exceptions, they would need to be implemented in the
compiler as a "trap if you see a NaN here" instruction sequence.
But, this is unlikely, given their relative level of usefulness.

The compiler can go the other direction though, as I sit around
debugging some other arguably not-very-useful features:
__int128
Some arithmetic bugs/... showed up when "actually using it".
_BitInt(n), special where n>128, ...
eg, "_BitInt(768) ib0;", create a 768 bit integer value...
Decays into a basic integer type if n<=128.
Where 'n' is required to be greater than zero.
...

But, at least stuff like this can be sometimes useful...

EricP

unread,

Jul 21, 2021, 1:19:25 PM7/21/21

to

Why do you believe the HW cost of exception support
is sometimes significant? It looks to me that as long as one
triggers these at Retire/Write-Back, precise exceptions are dirt cheap.

There is an initial cost for the first exception,
and an incremental cost for each one after that.

Consider a DSP with no exceptions, no MMU so no access faults,
no privileged mode, no integer divide-by-zero etc. Nothing.

It seems to me that the initial cost of adding a precise exception to
that is 1 bit in the uOp along with the result. When the uOp reaches
write-back, WB sees the exception flag and inhibits the register write,
and diddles a wire that flushes the pipeline the same as an
indirect branch, and jams an unconditional jump IP into the fetch unit,
which jams an "I'm an exception" uOp into the pipeline,
and starts fetching from the new address.

In short, most of the logic is already present to handle
indirect branches so the initial cost is a few gates and a long wire.

The additional cost of FP exceptions is also seems small.
The FP status flags have to travel with the FP result too,
which write-back writes to the committed status flags.
The additional cost of FP exceptions is carrying the exception mask
bits with the uOp, and a AND-OR on those bits in Retire/WB.

Anton Ertl

unread,

Jul 21, 2021, 1:34:49 PM7/21/21

to

Marcus <m.de...@this.bitsnbites.eu> writes:
>As we all know, support for floating-point exceptions comes with a cost
>(sometimes significant, since it dictates many aspects of a HW
>implementation for instance).

Unfortunately, wrt. IEEE FP "exception" has a different meaning (it's
a condition that, by default, results in setting a sticky flag) than
in computer architecture (a control-flow change). Which one do you
mean?

>If we imagine an alternative universe in
>which the dominant floating-point standard did not require exceptions,
>I suspect that most hardware (and language) implementations would most
>likely be simpler, more energy efficient and more suitable for parallel
>and pipelined floating-point operations.

Detecting IEEE exceptions is part of computation of the result; you
have to handle divide-by-zero, inexact, invalid, overflow, and
underflow conditions when computing the result. Ok, you then need to
propagate the flags such that a later instruction can read it, but
looking at the integer condition codes present in the dominant
architectures, that does not seem to be a big problem.

The dynamic rounding mode seems to be a bigger problem and has been
discussed here repeatedly.

>My questions are:
>
>1) What (in your opinion) are the benefits of floating-point exceptions?

Numerical experts use them for special purposes. I have seen and
forgotten examples that looked sensible to me.

>2) In what situations have you had use for floating-point exceptions?

None. I hardly use FP.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Jul 21, 2021, 2:27:52 PM7/21/21

to

There are 6 people in the world who want FP exceptions with the semantic
anywhere close to what IEEE 754 specifies. It is these same 6 people (and
18 professors) that want rounding modes anywhere close to what was
specified.
<
Something like 99% of all FP applications, turn these features off and
live with overflow->infinity, underflow->zero, and I still don't know anyone
other than Kahan and Coonen that want inexact reported. 100% of
physics programs start with data that is ALREADY inexact........

>
> 2) In what situations have you had use for floating-point exceptions?
<

None:
<
3) Is making exception detection in HW hard at all ?
<
No, absolutely not. Detecting and raising exceptions in HW adds 0.0001%
complexity to the design and construction of HW FP units {In a processor
that has precise memory exceptions and recovers from branch mispredictions.}
<
It is for this reason that all HW provides FP exceptions--the cost is minuscule
at most and nearly zero at best.
>
>
> /Marcus

MitchAlsup

unread,

Jul 21, 2021, 2:34:24 PM7/21/21

to

On Wednesday, July 21, 2021 at 9:35:40 AM UTC-5, John Dallman wrote:
> In article <sd9a9h$ro6$1...@dont-email.me>, m.de...@this.bitsnbites.eu
> (Marcus) wrote:
>
> > ...which of course is the natural solution.
> It's the natural solution if you assume nobody will ever actually want
> floating-point exceptions. A far more helpful solution would be what they
> do on x86, which is to disable the exceptions by default and to provide a
> CRT call to turn on the ones that you want.
<

I mentioned above that FP exceptions add essentially zero cost to the FP units.
<
What I did not mention, is that My 66000 allows the thread to manage its own
rounding modes and exception recognition. {It can only damage itself!}

<
> > In our software we do the same thing (even disabling subnormals)
> > in order to normalize on the least common denominator, so to say
> > (as soon as one target platform lacks a feature, it needs to be
> > disabled on _all_ platforms).
<

Once you buy off on FMAC units, denorms add almost zero cost (around 2%
for 64-bit FMACs) it makes no sense to set/assume denorms to zero.
This is NOT and argument that denorms were the right thing to put into
754 at the beginning. But its like the steering wheel location in a car--
its too late to change it now.

MitchAlsup

unread,

Jul 21, 2021, 2:44:09 PM7/21/21

to

On Wednesday, July 21, 2021 at 10:44:22 AM UTC-5, Ivan Godard wrote:
> It's not that the semantics of exceptions are useless, it's that the
> IEEE standard's approach to them is mis-designed for modern
> architectures and hence IEEE exceptions as a programming device are
> effectively useless.
>
> The definition of flags and modes in the standard makes them implicit
> arguments to every instruction - which inherently linearizes the FP
> instruction stream, while modern chips want to do a lot of FPs
> concurrently.
<

Not really {when done right--which x86 did not do BTW}
<
As long as there is not a reader of the flags, the flags can be accumulated
with an OR gate from the writers in any exception free order.
<
As long as there is no writer to the mode bits, these can be passed en massé
to the function units. In fact since these mode bits are necessarily different
between threads on the same core, one has to pass the mode bits with each
instruction issued (at least to the stations.)

<
> During my active time on the committee Kahan was still
> plumping to use exceptions to support trap-value-replacement. You could
> do that in Mathlab, but not in hardware. That linearization not only
> breaks multi-issue and SIMD, it also precludes speculative execution of
> FP ops in general - no computation overruns in loops for example.
<

Yes, this was a step too far (way too far, essentially serializing FP)
<
However, I found ways to implement this at low cost--but the first cost to
lower is the raising of an exception has to come down from 1,000s of
instructions or cycles to 10s !! The exception control transfer has to be
efficient--within spitting distance of a subroutine call.

>
> Once you abandon value-replacement, exceptions become what the word
> "exception" means in other contexts: a non-recoverable reporting
> mechanism that things went off the rails somewhere. As such, as in any
> debugging assist, the design as much as possible should serve to narrow
> the scope of "somewhere" while not hindering computations that remain on
> the rails.
>
> In the Mill we do that by elimination the global flags and modes and
> moving the event reporting into metadata attached to the result of the
> failing computation. You can speculatively or concurrently do FP ops on
> a Mill, and when the erroneous result finally must be used
> non-speculatively (if it ever is), and you have not asked to ignore
> bogus results, then the resulting fault can report fairly well where
> things went bad. That really helps, not because of some numerical nicety
> but because real programmers spend their lives chasing bugs.
>
> So I use exceptions everywhere that an exception would indicate a bug.
> However, even in debugging I don't use inexact; I have never used an
> algorithm where I cared that rounding had happened, and in practice
> inexact always gets set and so is just nuisance noise. Inexact was
> intended to be a support for using FP for fixed-point calculations, but
> if I had a fixed-point problem I would use DFP.
<

But even when used to perform exact floating point arithmetic, the inexact
flag necessarily gets set !! by 754 mandated rules !! this is completely fubar !
<
At least My 66000 can perform these and not end up setting the inexact flag.
>

MitchAlsup

unread,

Jul 21, 2021, 2:48:46 PM7/21/21

to

On Wednesday, July 21, 2021 at 12:19:25 PM UTC-5, EricP wrote:
> Marcus wrote:
> > Hi!
> >
> > I would like to ask a couple of simple questions in this group, related
> > to the IEEE 754 standard and floating-point exceptions.
> >
> > (Some of you may know that I've been pondering this subject before in
> > other forums).
> >
> > As we all know, support for floating-point exceptions comes with a cost
> > (sometimes significant, since it dictates many aspects of a HW
> > implementation for instance). If we imagine an alternative universe in
> > which the dominant floating-point standard did not require exceptions,
> > I suspect that most hardware (and language) implementations would most
> > likely be simpler, more energy efficient and more suitable for parallel
> > and pipelined floating-point operations.
> >
> > My questions are:
> >
> > 1) What (in your opinion) are the benefits of floating-point exceptions?
> >
> > 2) In what situations have you had use for floating-point exceptions?
> >
> >
> > /Marcus
> Why do you believe the HW cost of exception support
> is sometimes significant? It looks to me that as long as one
> triggers these at Retire/Write-Back, precise exceptions are dirt cheap.
<

Essentially zero, compared to the cost of denorm support in the 2% range
{gates, area, power}.

>
> There is an initial cost for the first exception,
> and an incremental cost for each one after that.
>
> Consider a DSP with no exceptions, no MMU so no access faults,
> no privileged mode, no integer divide-by-zero etc. Nothing.
>
> It seems to me that the initial cost of adding a precise exception to
> that is 1 bit in the uOp along with the result. When the uOp reaches
> write-back, WB sees the exception flag and inhibits the register write,
> and diddles a wire that flushes the pipeline the same as an
> indirect branch, and jams an unconditional jump IP into the fetch unit,
> which jams an "I'm an exception" uOp into the pipeline,
> and starts fetching from the new address.
>
> In short, most of the logic is already present to handle
> indirect branches so the initial cost is a few gates and a long wire.
<

Once you can flush instructions from the pipe (like a branch does)
the HW overhead in the pipe is essentially zero. The HW overhead
in the FU is essentially zero. There is a standard that mandates them
so why not just take the easy route through this feature space.
<
IEEE 754-2019 compatible :: YES in all respects.

Marcus

unread,

Jul 21, 2021, 3:10:07 PM7/21/21

to

On 2021-07-21 17:47, John Dallman wrote:
> In article <sd9b6k$2q4$1...@dont-email.me>, m.de...@this.bitsnbites.eu
> (Marcus) wrote:
>
>>> Finding bugs of our own, compiler bugs and bad data in bug
>>> reports.
>>
>> For these use cases, would it not suffice with a compiler/software
>> solution similar to the sanitizers that are available in LLVM and
>> GCC for instance? E.g. see:
>>
>> * https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
>> * https://dl.acm.org/doi/epdf/10.1145/3446804.3446848
>
> Possibly. An advantage of floating-point exceptions is that they work
> when testing the *same* binary that will be delivered to the customer.
>
> We've been using floating-point traps with this product since the 1980s,
> and Valgrind for over a decade, and it is time to try out compiler
> sanitisers. When I get a chance...

I strongly recommend it. I find that sanitizers are very useful and
accurate. It can be quite "hilarious" to run ubsan on old code bases
(lots of "this should never have worked" moments).

>
> The UndefinedBehaviour sanitiser looks most useful for:
>
> -fsanitize=alignment, since the platforms that were fussy about that are
> all dead or nearly so.
>
> -fsanitize=float-divide-by-zero and -fsanitize=integer-divide-by-zero are
> a partial substitute for floating-point traps.
>
> These also look worth a try: -fsanitize=builtin, -fsanitize=bounds,
> -fsanitize=null, -fsanitize=pointer-overflow,
> -fsanitize=signed-integer-overflow, and
> -fsanitize=unsigned-integer-overflow.
>
> Less useful ones are:
>
> -fsanitize=float-cast-overflow, because to a very close approximation, we
> don't convert floats or doubles to ints.
>
>
> The NSan sanitiser will catch overflows, but it's not clear to me that it
> will catch invalid operands. It's also going to be pretty slow, because
> everything is doubles, forcing the shadow values to be 128-bit and thus
> partially done in software.
>
>> I regularly enable sanitizers during test & development, and for
>> most purposes they tend to have an acceptable performance overhead.
>
> I'm guessing you don't have a couple of hundred machines and VMs
> operating distributed testing systems to get the overnight build tested
> in time for a ship/don't-ship decision the following afternoon?

I think it all depends on how you design your system. We have about 40
machines, each with 16-32 CPU cores (a total of about 2000 vCPU:s in
cloud terms), and they build and test about 50 different SW
configurations, in roughly 10-15 minutes (it's been a full time job the
last few years to keep that time at an acceptable level). My philosophy
has always been to test as much as possible /before/ a change goes into
mainline (a.k.a. "stable mainline" or "not rocket science" - don't break
stuff so that developers can't work at full speed).

BTW we recently activated sanitizers for all unit tests running
on those machines. Just buy more hardware ;-) (it's not quite that
simple, but the sentiment is: don't allow HW limitations to get in the
way of SW quality).

Marcus

unread,

Jul 21, 2021, 3:41:50 PM7/21/21

to

On 2021-07-21 20:27, MitchAlsup wrote:
> On Wednesday, July 21, 2021 at 8:44:44 AM UTC-5, Marcus wrote:
>> Hi!
>>
>> I would like to ask a couple of simple questions in this group, related
>> to the IEEE 754 standard and floating-point exceptions.
>>
>> (Some of you may know that I've been pondering this subject before in
>> other forums).
>>
>> As we all know, support for floating-point exceptions comes with a cost
>> (sometimes significant, since it dictates many aspects of a HW
>> implementation for instance). If we imagine an alternative universe in
>> which the dominant floating-point standard did not require exceptions,
>> I suspect that most hardware (and language) implementations would most
>> likely be simpler, more energy efficient and more suitable for parallel
>> and pipelined floating-point operations.
>>
>> My questions are:
>>
>> 1) What (in your opinion) are the benefits of floating-point exceptions?
> <
> There are 6 people in the world who want FP exceptions with the semantic
> anywhere close to what IEEE 754 specifies. It is these same 6 people (and
> 18 professors) that want rounding modes anywhere close to what was
> specified.
> <
> Something like 99% of all FP applications, turn these features off and
> live with overflow->infinity, underflow->zero, and I still don't know anyone
> other than Kahan and Coonen that want inexact reported. 100% of
> physics programs start with data that is ALREADY inexact........

Thanks, that confirms my feelings.....

>>
>> 2) In what situations have you had use for floating-point exceptions?
> <
> None:
> <
> 3) Is making exception detection in HW hard at all ?
> <
> No, absolutely not. Detecting and raising exceptions in HW adds 0.0001%
> complexity to the design and construction of HW FP units {In a processor
> that has precise memory exceptions and recovers from branch mispredictions.}
> <
> It is for this reason that all HW provides FP exceptions--the cost is minuscule
> at most and nearly zero at best.

I admit that I am just a layman in this matter, but I think that there
are more dimensions to this problem, and the related costs.

For instance, the very notion of floating-point exceptions (be they
exact HW traps, inexact "something went wrong" traps, or sticky flags
that you can poll, or something else) requires that you can make some
use of them - for instance take some corrective action in software,
using means that usually depend on the programming language.

In certain classes of CPU and programming language combinations there
are straight forward ways of how to deal with this (e.g. HW traps +
language exceptions), but in other environments the answer is not as
simple.

For instance, how should an overflow exception be handled in a GPU
shader language? I'm pretty sure that no HW/language combination
supports that scenario today (and thus by definition can not be
IEEE 754 compliant), and I'm also pretty sure that the cost for adding
that support would be non-zero - not only for the GPU ALU pipeline, but
for the entire HW/driver/SW stack.

What's more - such support would most likely /get in the way/ for
programmers since even fewer than the 6 people that you mentioned
earlier would even be remotely interested in such functionality.

So the rebellion in me wants to say: "Ask not how hard it is to add
floating-point exceptions to a CPU pipeline - ask what the point is
of doing so."

/Marcus

Marcus

unread,

Jul 21, 2021, 3:57:23 PM7/21/21

to

On 2021-07-21 18:58, Anton Ertl wrote:
> Marcus <m.de...@this.bitsnbites.eu> writes:
>> As we all know, support for floating-point exceptions comes with a cost
>> (sometimes significant, since it dictates many aspects of a HW
>> implementation for instance).
>
> Unfortunately, wrt. IEEE FP "exception" has a different meaning (it's
> a condition that, by default, results in setting a sticky flag) than
> in computer architecture (a control-flow change). Which one do you
> mean?

It does not matter for answering the questions. I am aware of the
different possibilities and interpretations. For instance I think that
some TI DSP:s use sticky flags rather than HW traps.

That said I see problems with both implementations (traps vs flags), if
nothing else from a programmer's point of view.

>
>> If we imagine an alternative universe in
>> which the dominant floating-point standard did not require exceptions,
>> I suspect that most hardware (and language) implementations would most
>> likely be simpler, more energy efficient and more suitable for parallel
>> and pipelined floating-point operations.
>
> Detecting IEEE exceptions is part of computation of the result; you
> have to handle divide-by-zero, inexact, invalid, overflow, and
> underflow conditions when computing the result. Ok, you then need to
> propagate the flags such that a later instruction can read it, but
> looking at the integer condition codes present in the dominant
> architectures, that does not seem to be a big problem.
>
> The dynamic rounding mode seems to be a bigger problem and has been
> discussed here repeatedly.

I agree - that's the second grudge that I have with the IEEE 754
standard. As a programmer I have gotten nothing but problems from the
dynamic rounding modes. As a hardware developer I have just ignored the
problem (I only implement a single rounding mode).

>
>> My questions are:
>>
>> 1) What (in your opinion) are the benefits of floating-point exceptions?
>
> Numerical experts use them for special purposes. I have seen and
> forgotten examples that looked sensible to me.
>
>> 2) In what situations have you had use for floating-point exceptions?
>
> None. I hardly use FP.

Same - except I use FP all the time.

>
> - anton
>

MitchAlsup

unread,

Jul 21, 2021, 4:06:19 PM7/21/21

to

When I was doing a GPU, I actually ask the question in front of the majority
of the design team:: "What, exactly, does it mean when 1,357 FP calculations
all overflow in the same clock cycle ?" What do we want the semantics to be?"
<
Nobody had an answer...........which tells you more about the problem than
of the team......

<
> I'm pretty sure that no HW/language combination
> supports that scenario today (and thus by definition can not be
> IEEE 754 compliant), and I'm also pretty sure that the cost for adding
> that support would be non-zero - not only for the GPU ALU pipeline, but
> for the entire HW/driver/SW stack.
<

In a GPU you can isolate the thread with the exception and run the rest
to completion--then rerun the thread in isolation--you can get away with
this because of the embarrassingly large amounts of parallelism.
<
This, however, fails in GPGPU applications, so it provides no insight
en the large.

>
> What's more - such support would most likely /get in the way/ for
> programmers since even fewer than the 6 people that you mentioned
> earlier would even be remotely interested in such functionality.
>
> So the rebellion in me wants to say: "Ask not how hard it is to add
> floating-point exceptions to a CPU pipeline - ask what the point is
> of doing so."
<

The thing is that memory has exceptions, DIV has exceptions that are
already present, DECODE may have exceptions, Stores may have late
exceptions; and once the pipeline has been configured to deal with
memory and and the others, the infrastructure is already present.
<
So I ask:: "Why not" ???
<
It is only when you can get rid of memory exceptions that you can get
rid of the pipeline infrastructure. {On the other hand this is already a
"solved problem" in computer design, so it is not perceived as even
hard, just work.
>
> /Marcus

Stefan Monnier

unread,

Jul 21, 2021, 4:15:48 PM7/21/21

to

> For instance, the very notion of floating-point exceptions (be they
> exact HW traps, inexact "something went wrong" traps, or sticky flags
> that you can poll, or something else) requires that you can make some
> use of them - for instance take some corrective action in software,
> using means that usually depend on the programming language.

I always assumed (for no concrete reason) that the main users of those
flags are very special-purpose code such as the implementation of the
`sin` or `atan` functions, or the implementation of 128bit floats
primitives on top of 64bit float hardware ops, ...

Stefan

Michael S

unread,

Jul 21, 2021, 4:21:40 PM7/21/21

to

I very certain that exceptions do not help sin or atan.
I am only like 95% certain that exceptions don't help your second case,
but I am not aware of environments that do anything like that in practice.

MitchAlsup

unread,

Jul 21, 2021, 4:34:22 PM7/21/21

to

Things like ATAN do the following--to avoid exceptions !!
<
double ATAN2( double y, double x )
{ // IEEE 754-2019 quality ATAN2
// deal with NANs
if( ISNAN( x ) ) return x;
if( ISNAN( y ) ) return y;
// deal with infinities
if( x == +∞ && |y|== +∞ ) return copysign( π/4, y );
if( x == +∞ ) return copysign( 0.0, y );
if( x == -∞ && |y|== +∞ ) return copysign( 3π/4, y );
if( x == -∞ ) return copysign( π, y );
if( |y|== +∞ ) return copysign( π/2, y );
// deal with signed zeros
if( x == 0.0 && y != 0.0 ) return copysign( π/2, y );
if( x >=+0.0 && y == 0.0 ) return copysign( 0.0, y );
if( x <=-0.0 && y == 0.0 ) return copysign( π, y );
// calculate ATAN2 high performance style
if( x > 0.0 )
{
if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
if( y < 0.0 && |y| > |x| ) return + ATAN( y / x );
if( y > 0.0 && |y| < |x| ) return + ATAN( y / x );
if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
}
if( x < 0.0 )
{
if( y < 0.0 && |y| > |x| ) return + π/2 + ATAN( x / y );
if( y < 0.0 && |y| > |x| ) return + π - ATAN( y / x );
if( y > 0.0 && |y| < |x| ) return + π - ATAN( y / x );
if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
}
}

>
>
> Stefan

luke.l...@gmail.com

unread,

Jul 21, 2021, 5:19:47 PM7/21/21

to

On Wednesday, July 21, 2021 at 4:44:22 PM UTC+1, Ivan Godard wrote:

> The definition of flags and modes in the standard makes them implicit
> arguments to every instruction - which inherently linearizes the FP
> instruction stream, while modern chips want to do a lot of FPs
> concurrently.

i solved this with data-dependent fail-first. the same trick
shouuld be possible with VVM on an OoO engine by relying on
"Precise Shadowing"

> During my active time on the committee Kahan was still
> plumping to use exceptions to support trap-value-replacement. You could
> do that in Mathlab, but not in hardware. That linearization not only
> breaks multi-issue and SIMD, it also precludes speculative execution of
> FP ops in general - no computation overruns in loops for example.

not quite true. it breaks "systems which don't properly do Shadowing"
and, yes, SIMD is inherently and by design incapable of linear Shadowing.

by "Shadowing" i am referring (in Mitch's Scoreboard 2nd Chapter) to the
possibility of some operation causing irreversible damage, so what is done
is: the instruction *and all downstream instructions* are "Shadowed",
i.e. *all* are prevented and prohibited from writing (to memory, to regfiles)
until such time as it has been determined that the opportunity for "damage"
is 100% determined.

if "no" then the "Shadow" is dropped, and the instruction *and all downstream*
instructions - whether they be multi-issue or single-issue - are permitted
to write.

if "yes" (e.g. an FP exception) then the *instructions are cancelled*, that one
*and* all down-stream ones.

where that all goes to s**t is SIMD. SIMD inherently and by design is
completely incapable of even recognising the concept that some random
number of its elements had an exception.

which makes SIMD worse than useless, it makes it a massive hindrance.

where this can be "fixed" is if you have something called "Data-Dependent
Fail-First". like LOAD/STORE ffirst, you allow speculative execution in
a linear and sequential fashion. anything that could raise an exception
TRUNCATES the execution - automatically - to the point where it *would not*
have occurred.

this can also even be done on Predicated SIMD, by creating an automatic
(implicit) predicate mask, made up of the bits which did *not* fail, looking
for the first bit in that (implicit) mask, throwing away everything beyond
that point and ANDing it with the "real" Predicate mask on the SIMD
back-end ALUs.

> In the Mill we do that by elimination the global flags and modes and
> moving the event reporting into metadata attached to the result of the
> failing computation. You can speculatively or concurrently do FP ops on
> a Mill, and when the erroneous result finally must be used
> non-speculatively (if it ever is), and you have not asked to ignore
> bogus results, then the resulting fault can report fairly well where
> things went bad. That really helps, not because of some numerical nicety
> but because real programmers spend their lives chasing bugs.

i suspect that the Mill, which has those "result invalid" tags, could also
do the same trick, spotting when the exception occurred and invalidating
all downstream results from that point onwards.

all that would then be needed would be a way to report back to the
program exactly what that point was, and the program may either run
a TRAP or the programmer may explicitly check for that point...

... but what did *not* need to happen was that the parallel execution
of a batch of operations gets poisoned by explicit error checking,
drastically slowing down inner loops.

l.

Stephen Fuld

unread,

Jul 21, 2021, 8:31:31 PM7/21/21

to

On 7/21/2021 6:46 AM, Marcus wrote:
> On 2021-07-21 15:44, Marcus wrote:
>> Hi!
>>
>> I would like to ask a couple of simple questions in this group, related
>> to the IEEE 754 standard and floating-point exceptions.
>>
>> (Some of you may know that I've been pondering this subject before in
>> other forums).
>>
>> As we all know, support for floating-point exceptions comes with a cost
>> (sometimes significant, since it dictates many aspects of a HW
>> implementation for instance). If we imagine an alternative universe in
>> which the dominant floating-point standard did not require exceptions,
>> I suspect that most hardware (and language) implementations would most
>> likely be simpler, more energy efficient and more suitable for parallel
>> and pipelined floating-point operations.
>>
>> My questions are:
>
> To answer my own questions... In my fields which over the past 2-3
> decades include 3D graphics, ray tracing, 3D polygonal mesh manipulation
> algorithms, compression algorithms for triangle meshes, audio DSP, audio
> synthesis, audio compression, space/satellite signal processing, image
> processing and compression algorithms, AI / neural networks, image-based
> eye- and head-tracking, least squares solvers, and so on, etc, I have to
> say:

Without in any way deprecating your experience, which is much greater
than mine, it seems to be oriented toward the kinds of situations where
errors are "contained" either in space or time. E.g. if an audio DSP
messes up, it probably won't be noticed by the vast majority of users,
and in any event, its effect will be gone in a small fraction of a second.

Contrast this with the types of applications where errors might
propagate or expand either in space or time, and thus cause noticeable
problems. I am thinking of perhaps FE modeling, or physics simulations.
John Dallman could probably provide more and better examples.

Thus your response might be biased toward the "they don't matter" side
of the question.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

BGB

unread,

Jul 21, 2021, 11:55:39 PM7/21/21

to

On 7/21/2021 2:57 PM, Marcus wrote:
> On 2021-07-21 18:58, Anton Ertl wrote:
>> Marcus <m.de...@this.bitsnbites.eu> writes:
>>> As we all know, support for floating-point exceptions comes with a cost
>>> (sometimes significant, since it dictates many aspects of a HW
>>> implementation for instance).
>>
>> Unfortunately, wrt. IEEE FP "exception" has a different meaning (it's
>> a condition that, by default, results in setting a sticky flag) than
>> in computer architecture (a control-flow change). Which one do you
>> mean?
>
> It does not matter for answering the questions. I am aware of the
> different possibilities and interpretations. For instance I think that
> some TI DSP:s use sticky flags rather than HW traps.
>
> That said I see problems with both implementations (traps vs flags), if
> nothing else from a programmer's point of view.
>

Some cases could at least be encoded using bit patterns in NaNs, though
cases which do not result in a NaN would effectively be lost.

Ideally, something like 'inexact' would be carried along with the value
so that one could determine whether or not a particular calculation
needed to be rounded. Though, alas, the standard FP formats make no
provision for this.

In theory, one could make the assumption that any result where the
low-order bits are non-zero is inexact, and potentially the FPU could
use a rounding hack where an inexact result is never rounded such that
the low-order bits are all zeroes. Spreading this over multiple bits
would means that on-average it only carries a fraction of a bit of
precision loss.

Ironically, this also plays well with "limited carry propagation"
rounding, since a chain of ones being rounded to zeroes is the main case
where this would occur, and will not typically occur with limited carry
rounding. Though, there is still the possibility of a previously-inexact
operation landing on zero by chance. In this case, there would be some
logic of "if result would land on zero, and inputs are marked inexact,
round so that it does not land on zero".

Worth making it a defined behavior? Probably not.

Ironically, it would make doing integer math using doubles actually less
accurate, since rather than one having a full 2^52 to play with (and
typically always getting an exact result over this range), they would
only (safely) have ~ 2^48 or so before the wacky rounding rules kick in
(assuming the 'inexact' status is based on the low-order 4 bits).

>>
>>> If we imagine an alternative universe in
>>> which the dominant floating-point standard did not require exceptions,
>>> I suspect that most hardware (and language) implementations would most
>>> likely be simpler, more energy efficient and more suitable for parallel
>>> and pipelined floating-point operations.
>>
>> Detecting IEEE exceptions is part of computation of the result; you
>> have to handle divide-by-zero, inexact, invalid, overflow, and
>> underflow conditions when computing the result. Ok, you then need to
>> propagate the flags such that a later instruction can read it, but
>> looking at the integer condition codes present in the dominant
>> architectures, that does not seem to be a big problem.
>>
>> The dynamic rounding mode seems to be a bigger problem and has been
>> discussed here repeatedly.
>
> I agree - that's the second grudge that I have with the IEEE 754
> standard. As a programmer I have gotten nothing but problems from the
> dynamic rounding modes. As a hardware developer I have just ignored the
> problem (I only implement a single rounding mode).
>

Likewise. If anything, it might make sense to have it as part of the
instruction.

Say, one has instructions which either do round-to-nearest or
truncate-towards-zero. Global flags are kind of a poor option, more so
if they have a non-local range of effect.

Though, ironically, given the way the FPU worked in SH-4 (one needed to
reload the status register to switch between operating on single and
double precision values), this would also implicitly clear the sticky
flags and reset the rounding mode to whatever is encoded in value being
reloaded into the register.

>>
>>> My questions are:
>>>
>>> 1) What (in your opinion) are the benefits of floating-point exceptions?
>>
>> Numerical experts use them for special purposes. I have seen and
>> forgotten examples that looked sensible to me.
>>
>>> 2) In what situations have you had use for floating-point exceptions?
>>
>> None. I hardly use FP.
>
> Same - except I use FP all the time.
>

I use FP fairly often, but it is still "a fairly small part of the pie"
if compared with integer operations.

Marcus

unread,

Jul 22, 2021, 1:43:54 AM7/22/21

to

I have worked with things like IIR filters where errors will propagate
indefinitely. Same thing with many other of the problems that I've
worked with: if you get a NaN somewhere it will end up everywhere.

>
> Thus your response might be biased toward the "they don't matter" side
> of the question.
>

That is true. For all application areas that I mentioned, and all that I
can think of, I personally end up in the "they don't matter" camp.

Hence my questions - I am looking for valid reasons to use floating-
point exceptions, but since I have been unable to find them myself I
need help from others.

/Marcus

Marcus

unread,

Jul 22, 2021, 2:10:10 AM7/22/21

to

Mitch gave a good example (ATAN2): You check conditions *before* doing
calculations, to *avoid* exceptions. On a machine with sticky flags you
have the option to check conditions after the calculation instead, but
that is probably more work (you would have to deduce *what* went wrong).

Plus (and this is the kicker IMO): If your code *relies* on certain
floating-point exceptions to be enabled, you would have to make sure
that they are. In a library function this means that you would have to
push the current FPU configuration, set up your desired FPU
configuration, and at the function exit pop the FPU configuration of the
caller.

This can get really messy if you also have to push/pop the exception
handler (in case of trapping HW), or preserve old exception state (in
case of HW with sticky flags). And also - how would you propagate
legitimate exceptions to the caller (if the caller has enabled
exceptions, it may be expecting to get exceptions from the library
call)?

/Marcus

Thomas Koenig

unread,

Jul 22, 2021, 2:48:59 AM7/22/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> Plus (and this is the kicker IMO): If your code *relies* on certain
> floating-point exceptions to be enabled, you would have to make sure
> that they are. In a library function this means that you would have to
> push the current FPU configuration, set up your desired FPU
> configuration, and at the function exit pop the FPU configuration of the
> caller.

This is exactly the model that Fortran follows if you use its IEEE
features (and they are supported).

https://j3-fortran.org/doc/year/18/18-007r1.pdf , clause 17, has all
the gory details.

You can change things like the underflow, halting or rounding mode
on entry to a procedure, but all the chapters specify that "the
processor shall not change the XXX mode on entry, and on return
shall ensure that the XXX mode is the same as it was on entry".

> This can get really messy if you also have to push/pop the exception
> handler (in case of trapping HW), or preserve old exception state (in
> case of HW with sticky flags). And also - how would you propagate
> legitimate exceptions to the caller (if the caller has enabled
> exceptions, it may be expecting to get exceptions from the library
> call)?

Fortran has no model of exceptions except for the IEEE ones.

If you want to see what IEEE support in a programming language
could look like, I think the draft Fortran standard is a good text
to look at.

Marcus

unread,

Jul 22, 2021, 3:52:05 AM7/22/21

to

Thanks for the story. What did you do in the end? Just ignore overflow?

>> I'm pretty sure that no HW/language combination
>> supports that scenario today (and thus by definition can not be
>> IEEE 754 compliant), and I'm also pretty sure that the cost for adding
>> that support would be non-zero - not only for the GPU ALU pipeline, but
>> for the entire HW/driver/SW stack.
> <
> In a GPU you can isolate the thread with the exception and run the rest
> to completion--then rerun the thread in isolation--you can get away with
> this because of the embarrassingly large amounts of parallelism.

That's only part of the problem. Then what do you do with the exception?
Do you have different privilege levels? Stack traces and status
registers so that you can determine what went wrong, repair the damage
and resume execution? And how would you expose that functionality in a
shader language, so that the programmer can make use of the exception
functionality?

> <
> This, however, fails in GPGPU applications, so it provides no insight
> en the large.

How does it fail (compared to running a GLSL shader for instance)?

>>
>> What's more - such support would most likely /get in the way/ for
>> programmers since even fewer than the 6 people that you mentioned
>> earlier would even be remotely interested in such functionality.
>>
>> So the rebellion in me wants to say: "Ask not how hard it is to add
>> floating-point exceptions to a CPU pipeline - ask what the point is
>> of doing so."
> <
> The thing is that memory has exceptions, DIV has exceptions that are
> already present, DECODE may have exceptions, Stores may have late
> exceptions; and once the pipeline has been configured to deal with
> memory and and the others, the infrastructure is already present.
> <
> So I ask:: "Why not" ???

My main argument for the "not" stand point is that it just complicates
the situation for programmers. Nobody wants floating-point exceptions,
and even fewer understands them, but yet we all get to deal with them.

I don't know how many months of my career I have spent on FPU
configuration issues, but they have all been about ensuring that FPU
features are turned off. For almost all software that I have worked
with the priority order has been:

1. Reproducibility.
2. Performance.
3. Accuracy.

Reproducibility means that the behavior should be the same, regardless
of which machine the software is running on. If one target does not
support a certain feature (such as exceptions) that feature will be
turned off on all targets. An extra complication is when you're
developing a library - you need to take care of preserving the FPU
configuration and state across *all* API call boundaries, so that the
caller of the library is free to set up exception handling etc as it
wishes, without interfering with how the library works, and without the
library interfering with how the caller application works.

Performance usually means that exceptions are off the table (as are
subnormals - if any of the target platforms has a performance penalty
when subnormals are enabled).

Accuracy comes in at a distant third place, as most of the applications
I work with either use physical data as input (and that data is already
subject to measurement errors), or they produce results that only need
to be convincing or pleasing to an observer. In any case, if a floating-
point value is rounded or if a value is flushed to zero (underflow), so
be it. *If* accuracy is an issue - switch up from single precision to
double-precision. If you're already on double-precision, redesign your
algorithms (use a more suitable solver etc).

Yes, exceptions are only part of the problem (rounding modes and
subnormals are the other main parts), but my point is that all they have
ever done for me as a software developer is to cause problems & bugs,
and they have cost me many many hours that I would rather have spent on
more productive things.

Oh, and my second argument would be that by implementing floating-point
exceptions you are acknowledging parts of the standard that simply
should not be. (again - this is the little rebellion in me talking)

> <
> It is only when you can get rid of memory exceptions that you can get
> rid of the pipeline infrastructure. {On the other hand this is already a
> "solved problem" in computer design, so it is not perceived as even
> hard, just work.

This is probably why some DSP:s use sticky bits instead of traps for
floating-point exceptions: They don't have memory exceptions.

>>
>> /Marcus

Marcus

unread,

Jul 22, 2021, 4:18:45 AM7/22/21

to

On 2021-07-22 08:48, Thomas Koenig wrote:
> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>
>> Plus (and this is the kicker IMO): If your code *relies* on certain
>> floating-point exceptions to be enabled, you would have to make sure
>> that they are. In a library function this means that you would have to
>> push the current FPU configuration, set up your desired FPU
>> configuration, and at the function exit pop the FPU configuration of the
>> caller.
>
> This is exactly the model that Fortran follows if you use its IEEE
> features (and they are supported).
>
> https://j3-fortran.org/doc/year/18/18-007r1.pdf , clause 17, has all
> the gory details.

Thanks for the reference.

>
> You can change things like the underflow, halting or rounding mode
> on entry to a procedure, but all the chapters specify that "the
> processor shall not change the XXX mode on entry, and on return
> shall ensure that the XXX mode is the same as it was on entry". >
>> This can get really messy if you also have to push/pop the exception
>> handler (in case of trapping HW), or preserve old exception state (in
>> case of HW with sticky flags). And also - how would you propagate
>> legitimate exceptions to the caller (if the caller has enabled
>> exceptions, it may be expecting to get exceptions from the library
>> call)?
>
> Fortran has no model of exceptions except for the IEEE ones.
>
> If you want to see what IEEE support in a programming language
> could look like, I think the draft Fortran standard is a good text
> to look at.
>

So things seem to be well covered in Fortran land.

BTW, one of my favorite war stories has to do with how a particular
virus killer on a particular machine managed to re-configure the FPU
control register in our process. It took me several weeks (IIRC) to
track down that bug (the only machine that could reproduce the bug was
400 km away and I had to use FTP & RDP over a 64kbit line to deploy and
debug the software). Anyway, I would assume that not even Fortran would
be immune to such errors.

Back in C++ land I've developed a "scoped FPU configuration" class
that's about 200 LOC (+200 LOC unit tests), littered with ifdefs,
compiler specific intrinsics and assembly language for different CPU
architectures. And if you forget to instantiate that class at the
start of a single API call, ...

In C the situation would be even worse (because you wouldn't have RAII).

So, just saying: the problem is real. Floating-point exceptions have a
cost that extends way beyond hardware support - language designers and
implementers as well as software developers *have* to deal with them,
at a non-zero cost.

/Marcus

Terje Mathisen

unread,

Jul 22, 2021, 6:20:58 AM7/22/21

to

I know that directed rounding can help for NR-style sqrt() code, i.e. I
believe Alpha saved a final iteration by judicious use of directed
rounding in one of the final stages.

Personally I want my HW to support both, i.e. rounding modes are part of
the opcode, with one option (the default?) being "use the current system
default", i.e. the only option on x87.

This makes the global rounding mode a serializing resource unless you
rename it.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

luke.l...@gmail.com

unread,

Jul 22, 2021, 6:55:27 AM7/22/21

to

On Wednesday, July 21, 2021 at 6:19:25 PM UTC+1, EricP wrote:

> It seems to me that the initial cost of adding a precise exception to
> that is 1 bit in the uOp along with the result.

... which has to propagate across all instructions issued from that point...
(therefore you need a Shadow *Matrix* not a Shadow Vector...)

> When the uOp reaches
> write-back, WB sees the exception flag and inhibits the register write,

... of *all* downstream instructions issued after the one that has
the exception flag....

> and diddles a wire that flushes the pipeline the same as an
> indirect branch,

...termed "Go Die"... (see diagram)

https://libre-soc.org/3d_gpu/shadow.svg

> and jams an unconditional jump IP into the fetch unit,
> which jams an "I'm an exception" uOp into the pipeline,
> and starts fetching from the new address.

just like any other "trap" or "illegal instruction": store some
state (at least the PC) and go.

> In short, most of the logic is already present to handle
> indirect branches so the initial cost is a few gates and a long wire.

wires plural. unless you are happy to stall after the first possible
instruction that could throw an exception.

do you want to have the instruction that *causes* the exception
to be cancelled, but the ones that were issued after it be allowed
to proceed and thus "damage" memory and regfiles just as badly
as if the exception itself was unnoticed? if so, use just one wire.
if not, it has to be one long wire per instruction that's permitted
to run ahead.

it's also how you can do Precise Exceptions on LD/ST, Write-after-Write
Hazard avoidance, Multi-issue, Predication, and also as you say
Branch speculation: it's all the same logic.

why is it all the same logic?

because all of those can create "damage" if allowed to proceed, therefore
the "writing" phase has to be held back, and that hold-back has to be
"cascaded"

l.

luke.l...@gmail.com

unread,

Jul 22, 2021, 7:05:14 AM7/22/21

to

On Thursday, July 22, 2021 at 1:31:31 AM UTC+1, Stephen Fuld wrote:

> Without in any way deprecating your experience, which is much greater
> than mine, it seems to be oriented toward the kinds of situations where
> errors are "contained" either in space or time. E.g. if an audio DSP
> messes up, it probably won't be noticed by the vast majority of users,
> and in any event, its effect will be gone in a small fraction of a second.

ohh trust me, it will be noticed (i worked for CEDAR Audio for 18 months,
early in my career).

an overflow of a single audio sample is equivalent to a transient high-frequency
spike. the change can be so large that it can cause Power Amplifiers to
overload, which in turn amplifies the distortion, which in turn results in a higher
rate of change to the speaker coils, which in turn creates not just excessive
heating of the speaker coil it also drives them beyond the physical material
characteristics, causing the driver cone in some cases to rip apart or even
explode.

you REALLY do not want random transient artefacts introduced into audio.
the best worst-case case behaviour is to "saturate" (to min/max) which results
in clipping, but at least it does not result in trying to drive the speaker one way
in under 0.025 milliseconds and then drive it back the other way again.

even a 1 Watt speaker driven to full distortion overload can put out over 96 dB
in the high frequency range, which is more than enough to create lasting hearing
damage if sustained.

l.

BGB

unread,

Jul 22, 2021, 11:08:48 AM7/22/21

to

On 7/22/2021 6:05 AM, luke.l...@gmail.com wrote:
> On Thursday, July 22, 2021 at 1:31:31 AM UTC+1, Stephen Fuld wrote:
>
>> Without in any way deprecating your experience, which is much greater
>> than mine, it seems to be oriented toward the kinds of situations where
>> errors are "contained" either in space or time. E.g. if an audio DSP
>> messes up, it probably won't be noticed by the vast majority of users,
>> and in any event, its effect will be gone in a small fraction of a second.
>
> ohh trust me, it will be noticed (i worked for CEDAR Audio for 18 months,
> early in my career).
>
> an overflow of a single audio sample is equivalent to a transient high-frequency
> spike. the change can be so large that it can cause Power Amplifiers to
> overload, which in turn amplifies the distortion, which in turn results in a higher
> rate of change to the speaker coils, which in turn creates not just excessive
> heating of the speaker coil it also drives them beyond the physical material
> characteristics, causing the driver cone in some cases to rip apart or even
> explode.
>

For most smaller speakers, it tends to just result in fairly loud and
obvious pops. Eliminating these sorts of pops are one of the more
annoying aspects of working with audio.

> you REALLY do not want random transient artefacts introduced into audio.
> the best worst-case case behaviour is to "saturate" (to min/max) which results
> in clipping, but at least it does not result in trying to drive the speaker one way
> in under 0.025 milliseconds and then drive it back the other way again.
>

Yeah, this is fairly standard practice.

> even a 1 Watt speaker driven to full distortion overload can put out over 96 dB
> in the high frequency range, which is more than enough to create lasting hearing
> damage if sustained.
>

Pretty much.

It is also kinda funny that a speaker can generate a pop seemingly
somewhat louder than its normal level of audio playback.

Though, the result is that if one goes the PC speaker route, they can
get surprisingly loud results from a small speaker.

Interestingly though, if one uses a bigger speaker as the PC speaker in
a PC (with a current limiting resistor), then the sound is a lot
"softer" than the normal PC speaker. Turns out to not matter much though
as usually about the only time it sees much use is when the bios is
starting.

Mostly came up as an issue in my most recent PC build as apparently
newer PC cases have stopped including speakers, so if one wants one,
they need to supply their own.

EricP

unread,

Jul 22, 2021, 12:58:37 PM7/22/21

to

luke.l...@gmail.com wrote:
> On Wednesday, July 21, 2021 at 6:19:25 PM UTC+1, EricP wrote:
>
>> It seems to me that the initial cost of adding a precise exception to
>> that is 1 bit in the uOp along with the result.
>

> .... which has to propagate across all instructions issued from that point...

> (therefore you need a Shadow *Matrix* not a Shadow Vector...)

Not for what I'm thinking.

>> When the uOp reaches
>> write-back, WB sees the exception flag and inhibits the register write,
>

> .... of *all* downstream instructions issued after the one that has
> the exception flag....

I haven't read '754 and am going by what x86/x64 basically does as a guide.
So there may be something I have missed.

Fp exceptions are like other faults except the FpStatus register
is updated before exceptions are checked for. If an unmasked exception
is then detected the data register write back is not performed.
If there is no unmasked exception, the result write back is performed.

For an In-Order uArch, the FpStatus register is written at WB like any
other register except the new status bits are OR'd into the FpStatus
rather than overwriting.

Next if the current FpControl register exception mask indicates
there is an unmasked exception, the data register WB is inhibited,
and the exception is triggered, purging the pipeline.

Each calculating FP uOp produces a fp result and 6 FpStatus bits.
If a _potential_ exception occurred, that fp operation is responsible for
producing the correct substitute result in case the exception is masked.
Note that the correct substitute result may depend on flags in the
FpControl register at the time the fp operation reaches WB.

The simplest way to manage FP instructions that explicitly read or write
the FpStatus and FpControl registers is to flush the FP pipelines
before and after those instructions. Faster alternatives are possible,
such as keeping a future version of the FpControl at the front
of the FP pipeline and merging control bits into each queued uOp,
and using those control bits to select the correct substitute result.

>> and diddles a wire that flushes the pipeline the same as an
>> indirect branch,
>

> ....termed "Go Die"... (see diagram)

>
> https://libre-soc.org/3d_gpu/shadow.svg
>
>> and jams an unconditional jump IP into the fetch unit,
>> which jams an "I'm an exception" uOp into the pipeline,
>> and starts fetching from the new address.
>
> just like any other "trap" or "illegal instruction": store some
> state (at least the PC) and go.
>
>> In short, most of the logic is already present to handle
>> indirect branches so the initial cost is a few gates and a long wire.
>
> wires plural. unless you are happy to stall after the first possible
> instruction that could throw an exception.

Yes, 1 wire per different exception IP vector to be stuffed into Fetch.
A priority selector in Fetch is hard wired to choose the signal
from the oldest source, so for example WB detected exceptions
take precedence over Decode detected exceptions.
The chosen wire is used to generate the vector IP.

> do you want to have the instruction that *causes* the exception
> to be cancelled, but the ones that were issued after it be allowed
> to proceed and thus "damage" memory and regfiles just as badly
> as if the exception itself was unnoticed? if so, use just one wire.
> if not, it has to be one long wire per instruction that's permitted
> to run ahead.

I'm talking about normal exceptions in a scalar processor
so all the subsequent instructions are cancelled.
(Vector and register-SIMD or real-SIMD (multiple PE's) have to
sort out reasonable error semantics for each of their own situations.)

There are two kinds of exceptions, faults and traps.

Faults roll back leaving the IP pointing at the faulting
instruction and (mostly) leave registers and memory values unchanged.
I say "mostly" because as noted above, the FpStatus register is
defined as updated before the exception is checked for.

Traps complete the instruction updating the IP and registers and/or memory,
then trigger the exception. Single Step is an example of a trap.

> it's also how you can do Precise Exceptions on LD/ST, Write-after-Write
> Hazard avoidance, Multi-issue, Predication, and also as you say
> Branch speculation: it's all the same logic.
>
> why is it all the same logic?
>
> because all of those can create "damage" if allowed to proceed, therefore
> the "writing" phase has to be held back, and that hold-back has to be
> "cascaded"
>
> l.

Yes, which is part of what the Load Store Queue does -
synchronize instruction Retire with release of its pending store.
An exception cancels all subsequent instructions thus preventing
future stores from reaching Retire.

MitchAlsup

unread,

Jul 22, 2021, 2:24:34 PM7/22/21

to

On Thursday, July 22, 2021 at 11:58:37 AM UTC-5, EricP wrote:
> luke.l...@gmail.com wrote:
> > On Wednesday, July 21, 2021 at 6:19:25 PM UTC+1, EricP wrote:
> >
> >> It seems to me that the initial cost of adding a precise exception to
> >> that is 1 bit in the uOp along with the result.
> >
> > .... which has to propagate across all instructions issued from that point...
> > (therefore you need a Shadow *Matrix* not a Shadow Vector...)
> Not for what I'm thinking.
> >> When the uOp reaches
> >> write-back, WB sees the exception flag and inhibits the register write,
> >
> > .... of *all* downstream instructions issued after the one that has
> > the exception flag....
>
> I haven't read '754 and am going by what x86/x64 basically does as a guide.
> So there may be something I have missed.
>
> Fp exceptions are like other faults except the FpStatus register
> is updated before exceptions are checked for.
<

This sounds dangerous. You should not be updating visible state until
writeback. What you can do is to carry this information along with the
result and deal with it all at writeback.

<
> If an unmasked exception
> is then detected the data register write back is not performed.
> If there is no unmasked exception, the result write back is performed.
>
> For an In-Order uArch, the FpStatus register is written at WB like any
> other register except the new status bits are OR'd into the FpStatus
> rather than overwriting.
>
> Next if the current FpControl register exception mask indicates
> there is an unmasked exception, the data register WB is inhibited,
> and the exception is triggered, purging the pipeline.
<

Only younger instructions are flushed. There may still be older instructions
that still need to complete.

>
> Each calculating FP uOp produces a fp result and 6 FpStatus bits.
<

And probably another 5-8 bits so everything can be deferred until
writeback.

<
> If a _potential_ exception occurred, that fp operation is responsible for
> producing the correct substitute result in case the exception is masked.
<

The easiest way to ensure this is to sent the FpConrol stuff to the FU
(as an operand) so that if an exception needs to be raised, FU can
determine what is the appropriate result.

This selection process needs to occur the cycle before the FETCH of the
exception vector is performed. Sometimes this is easier orchestrated
in the DECODE stage than in the fetch stage (because DECODE has the
branch target adder whereas the FETCH stage only has an incrementer).

<
> > do you want to have the instruction that *causes* the exception
> > to be cancelled, but the ones that were issued after it be allowed
> > to proceed and thus "damage" memory and regfiles just as badly
> > as if the exception itself was unnoticed? if so, use just one wire.
> > if not, it has to be one long wire per instruction that's permitted
> > to run ahead.
> I'm talking about normal exceptions in a scalar processor
> so all the subsequent instructions are cancelled.
> (Vector and register-SIMD or real-SIMD (multiple PE's) have to
> sort out reasonable error semantics for each of their own situations.)
>
> There are two kinds of exceptions, faults and traps.
>
> Faults roll back leaving the IP pointing at the faulting
> instruction and (mostly) leave registers and memory values unchanged.
> I say "mostly" because as noted above, the FpStatus register is
> defined as updated before the exception is checked for.
>
> Traps complete the instruction updating the IP and registers and/or memory,
> then trigger the exception. Single Step is an example of a trap.
<

My 66000 integrated these into a single concept. In both cases the IP is
left pointing at the offending instruction. One can return from exception
and replay the excepting instruction or one can return from trap and skip
over the instruction. Leaving the IP pointing at the instruction allows
the trap handler to look at the instruction for work qualification. Both
IP and the actual instruction-specifier are available to the handler, as
are operands to the instruction.

Quadibloc

unread,

Jul 22, 2021, 5:59:34 PM7/22/21

to

On Wednesday, July 21, 2021 at 8:35:40 AM UTC-6, John Dallman wrote:
> In article <sd9a9h$ro6$1...@dont-email.me>, m.de...@this.bitsnbites.eu
> (Marcus) wrote:

> > In our software we do the same thing (even disabling subnormals)
> > in order to normalize on the least common denominator, so to say
> > (as soon as one target platform lacks a feature, it needs to be
> > disabled on _all_ platforms).

> We're willing to try to make use of useful features of particular
> platforms.

What he is doing may make perfect sense if his firm is engaged in
supporting a wide variety of platforms, and in order to do this at a
manageable cost, needs to ensure that its software is highly
portable.

So I can't criticize him for doing it that way, even if this might not
be a way of doing things that would ever occur to me in my
situation.

My habits were formed in the good old days, when software was
written for one particular platform, with no thought that anybody
would ever want to use it anywhere else.

John Savard

MitchAlsup

unread,

Jul 22, 2021, 7:14:22 PM7/22/21

to

On Thursday, July 22, 2021 at 4:59:34 PM UTC-5, Quadibloc wrote:
> On Wednesday, July 21, 2021 at 8:35:40 AM UTC-6, John Dallman wrote:
> > In article <sd9a9h$ro6$1...@dont-email.me>, m.de...@this.bitsnbites.eu
> > (Marcus) wrote:
>
> > > In our software we do the same thing (even disabling subnormals)
> > > in order to normalize on the least common denominator, so to say
> > > (as soon as one target platform lacks a feature, it needs to be
> > > disabled on _all_ platforms).
<

The intent expressed here is the contrapositive of what IEEE 754 intended.

<
> > We're willing to try to make use of useful features of particular
> > platforms.
<

Do all platforms have a way of converting denorm operands into zero ?
Do all platforms have a way to suppress the creation of denorm results ?
.....and return zero instead ?

<
> What he is doing may make perfect sense if his firm is engaged in
> supporting a wide variety of platforms, and in order to do this at a
> manageable cost, needs to ensure that its software is highly
> portable.
>
> So I can't criticize him for doing it that way, even if this might not
> be a way of doing things that would ever occur to me in my
> situation.
>
> My habits were formed in the good old days, when software was
> written for one particular platform, with no thought that anybody
> would ever want to use it anywhere else.
<

Your age is showing...........
>
> John Savard

Marcus

unread,

Jul 23, 2021, 6:46:33 AM7/23/21

to

On 2021-07-23 01:14, MitchAlsup wrote:
> On Thursday, July 22, 2021 at 4:59:34 PM UTC-5, Quadibloc wrote:
>> On Wednesday, July 21, 2021 at 8:35:40 AM UTC-6, John Dallman wrote:
>>> In article <sd9a9h$ro6$1...@dont-email.me>, m.de...@this.bitsnbites.eu
>>> (Marcus) wrote:
>>
>>>> In our software we do the same thing (even disabling subnormals)
>>>> in order to normalize on the least common denominator, so to say
>>>> (as soon as one target platform lacks a feature, it needs to be
>>>> disabled on _all_ platforms).
> <
> The intent expressed here is the contrapositive of what IEEE 754 intended.

I agree. I tried approaching the IEEE 754 working group [1] with a
suggestion to standardize a leaner subset of the current standard that
would better acknowledge the current reality (i.e. that many floating-
point implementations lack some of the mandatory IEEE 754 features),
and to analyze the effects of such implementations w.r.t. the numerical
and operational guarantees that the standard aims to provide. At least
to preempt an explosion of fragmented, non-conforming implementations.

To no avail...

> <
>>> We're willing to try to make use of useful features of particular
>>> platforms.
> <
> Do all platforms have a way of converting denorm operands into zero ?
> Do all platforms have a way to suppress the creation of denorm results ?
> .....and return zero instead ?
> <

Based on the platforms that I have worked with: Yes & yes (possibly
excluding x87, which is a different story altogether). Some implicitly
do this as the only way of operation (e.g. ARMv7 NEON, TI C66x), while
others provide a configuration or two (e.g. x86 SSE/AVX, ARMv8 NEON,
POWER).

There are surely platforms where you can't enforce that behavior, but I
have not yet used/programmed such CPU:s (IIRC RISC-V has no such
option, for instance).

The most dodgy ISA specification in this regard that I have come across
is actually the POWER ISA. It has a "non-IEEE mode"-flag in the FPSCR
register that essentially activates unpredictable behavior (e.g. "in
non-IEEE mode an implementation _may_ return 0 instead of a
denormalized number").

You may be surprised by how much of the currently running software in
the world runs with denorms-are-zero & flush-to-zero semantics. For
instance the Intel C/Fortran compilers disable subnormals as soon as
you turn on optimizations. GCC and Clang do the same when you use
-ffast-math (_lots_ of performance sensitive programs do this! [2]).
D3D11 single precision has this as the specified behavior. The GLSL
and OpenCL specifications leave it as an implementation detail. Etc...

/Marcus

[1]
https://listserv.ieee.org/cgi-bin/wa?A1=ind20&L=STDS-754&X=O049A572CEC0474BAF0&Y=m%40bitsnbites.eu#23
[2] https://github.com/search?q=%22-ffast-math%22&type=code

Ivan Godard

unread,

Jul 23, 2021, 6:54:17 AM7/23/21

to

Denorms, rounding modes, exceptions, and other IEEE features are there
to support algorithms for which stability is an issue. Rotating a view
in a game is not concerned with algorithmic stability; N-body
gravitational modeling is concerned. Some people care if the moon lander
lands on the moon surface, or ten meters above or below it.

Quadibloc

unread,

Jul 23, 2021, 8:06:01 AM7/23/21

to

On Thursday, July 22, 2021 at 5:14:22 PM UTC-6, MitchAlsup wrote:

> Do all platforms have a way of converting denorm operands into zero ?
> Do all platforms have a way to suppress the creation of denorm results ?
> .....and return zero instead ?

If not, then there's no practical way to "disable" gradual underflow on
those platforms to make them compatible with those that aren't.

Of course, why _that_ level of compatibility is important is unclear.

After all, FORTRAN programs were regarded as 'transportable' if
they would run without error on platforms with wildly differing
floating-point formats; that they might yield slightly different numerical
results was not seen as an issue.

John Savard

Quadibloc

unread,

Jul 23, 2021, 8:15:12 AM7/23/21

to

On Friday, July 23, 2021 at 4:46:33 AM UTC-6, Marcus wrote:

> I agree. I tried approaching the IEEE 754 working group [1] with a
> suggestion to standardize a leaner subset of the current standard that
> would better acknowledge the current reality (i.e. that many floating-
> point implementations lack some of the mandatory IEEE 754 features),
> and to analyze the effects of such implementations w.r.t. the numerical
> and operational guarantees that the standard aims to provide. At least
> to preempt an explosion of fragmented, non-conforming implementations.

> To no avail...

I can see why there would be issues.

The characteristics of the "leaner subset" would depend *strongly* on
the level of the implementation.

Thus, for the kind of floating-point arithmetic unit I might prefer, the
leaner subset would have these characteristics:

- Denormals would work. What would not guarantee is that they
would *fail* when they were supposed to.

- Accurate rounding would be guaranteed for addition, subtraction,
and multiplication, but *not* division, which would instead be accurate
to something like 0.6 or 0.51 units in the last place.

That's because I would want to implement division using one of the
fast algorithms which require significant additional overhead to get
accurate rounding...

and I intend to support denormals with no loss in speed by converting
floats to a format with no hidden bit inside registers, so it would be
extra work to reproduce the exact numerical range and precision of
the IEEE 754 floating-point format. Instead, the denormals - and some
numeric range beyond them - would still have full precision, and only
get rounded down when it came time to store values in memory.

People implementing floating-point in a *different* manner would have
entirely different choices for what parts of the IEEE 754 standard they
would want omitted from a 'leaner subset'. An implementation with
significantly less hardware would omit the entire denormal range, but
support exact rounding for division, for example.

So there's a 'lightweight' leaner subset, and there's a 'high-performance'
leaner subset, at the very least. Which are disjoint rather than one being
a subset of the other.

John Savard

Quadibloc

unread,

Jul 23, 2021, 8:19:30 AM7/23/21

to

On Friday, July 23, 2021 at 4:54:17 AM UTC-6, Ivan Godard wrote:
> Some people care if the moon lander
> lands on the moon surface, or ten meters above or below it.

That's definitely a good thing to care about.

Does the fancy stuff in IEEE 754 really help with that?

Or would doing everything in double precision, even if one
were using something like the old System/360 floating
format, do a better job?

Accurate and reliable calculations by computers are a very
important thing. How much the approach taken by IEEE 754
contributes to that goal, let alone the more elaborate notions
presented by people like John Gustavson, is, however, an open
question, I would think.

John Savard

Ivan Godard

unread,

Jul 23, 2021, 10:11:04 AM7/23/21

to

Back when I was active on the IEEE committee, I once asked Kahan
whether, if quad (128-bit) were as fast as double, would he still have
denorms. He answered an unequivocal "No!".

MitchAlsup

unread,

Jul 23, 2021, 11:30:20 AM7/23/21

to

On Friday, July 23, 2021 at 7:19:30 AM UTC-5, Quadibloc wrote:
> On Friday, July 23, 2021 at 4:54:17 AM UTC-6, Ivan Godard wrote:
> > Some people care if the moon lander
> > lands on the moon surface, or ten meters above or below it.
> That's definitely a good thing to care about.
>
> Does the fancy stuff in IEEE 754 really help with that?
<

Numerical stability has a LOT more to do with algorithm choice than
the underlying FP arithmetic.

>
> Or would doing everything in double precision, even if one
> were using something like the old System/360 floating
> format, do a better job?
<

No, doubling the widths just hides the problem and makes the actual
problem harder to find. You can take the point of view that that is
enough, but it is not in reality.

>
> Accurate and reliable calculations by computers are a very
> important thing. How much the approach taken by IEEE 754
> contributes to that goal, let alone the more elaborate notions
> presented by people like John Gustavson, is, however, an open
> question, I would think.
<

IEEE 754 was put together by people who really understood FP arithmetic
but HW not so much.
>
> John Savard

Quadibloc

unread,

Jul 23, 2021, 1:43:25 PM7/23/21

to

On Friday, July 23, 2021 at 9:30:20 AM UTC-6, MitchAlsup wrote:

> IEEE 754 was put together by people who really understood FP arithmetic
> but HW not so much.

My own very limited understanding of hardware was put to use
in my reply to Marcus' comment about the need for a "leaner
subset" of IEEE 754; I noted that if one were designing a large
system aiming at the fastest possible speed, one would want
to leave _certain_ features of IEEE 754 out, whereas if one were
designing a really tiny system, one would want to leave *different*
features out... which was probably too much to ask, thus
skewering any hopes of adding such a subset to the standard.

Since IEEE 754, in its original form, simply ratified what Intel
had planned to include in its forthcoming (but then secret)
8087, though, a knowledge of hardware certainly was...
applied... to its contents, even if the committee that signed
off on it did not include hardware experts.

And what I see, then, as the issue would be...

The fact that IEEE 754 derived from Intel's 8087 work meant that
the standard was capable of being implemented in hardware once.

Of course, that meant that it was implementable many times in
other systems *of comparable size*.

But denormals were a pain in very small systems... and the
requirement for perfect rounding was a pain in larger systems
that aimed at the highest possible speed using either Goldschmidt
or Newton-Raphson for division.

So that's where I see that having HW knowleldge on the IEEE
754 committee would have helped. With your much greater
knowledge of hardware, you may see other things which are
more important than what I thought of.

John Savard

Quadibloc

unread,

Jul 23, 2021, 1:45:12 PM7/23/21

to

On Friday, July 23, 2021 at 9:30:20 AM UTC-6, MitchAlsup wrote:

> No, doubling the widths just hides the problem and makes the actual
> problem harder to find. You can take the point of view that that is
> enough, but it is not in reality.

It's enough to get one Moon lander on the surface. But solving the
real problem is still better - but it's nice if there's a way to meet
deadlines and the like.

John Savard

MitchAlsup

unread,

Jul 23, 2021, 1:59:21 PM7/23/21

to

On Friday, July 23, 2021 at 12:43:25 PM UTC-5, Quadibloc wrote:
> On Friday, July 23, 2021 at 9:30:20 AM UTC-6, MitchAlsup wrote:
>
> > IEEE 754 was put together by people who really understood FP arithmetic
> > but HW not so much.
<
> My own very limited understanding of hardware was put to use
> in my reply to Marcus' comment about the need for a "leaner
> subset" of IEEE 754; I noted that if one were designing a large
> system aiming at the fastest possible speed, one would want
> to leave _certain_ features of IEEE 754 out, whereas if one were
> designing a really tiny system, one would want to leave *different*
> features out... which was probably too much to ask, thus
> skewering any hopes of adding such a subset to the standard.
>
> Since IEEE 754, in its original form, simply ratified what Intel
> had planned to include in its forthcoming (but then secret)
> 8087, though, a knowledge of hardware certainly was...
> applied... to its contents, even if the committee that signed
> off on it did not include hardware experts.
<

IEEE 754 had to compromise between x87 and 881 the intel
and Moto chips both nearing tapeout, one maybe as far along
as having seen silicon. This is the part about when rounding versus
normalization transpires.

>
> And what I see, then, as the issue would be...
>
> The fact that IEEE 754 derived from Intel's 8087 work meant that
> the standard was capable of being implemented in hardware once.
<

Remember that an FADD was 84-240 cycles..........but I digress......

>
> Of course, that meant that it was implementable many times in
> other systems *of comparable size*.
>
> But denormals were a pain in very small systems... and the
> requirement for perfect rounding was a pain in larger systems
> that aimed at the highest possible speed using either Goldschmidt
> or Newton-Raphson for division.
<

Having watched this from inside:
a) HW designers know a lot more about this today than in 1980
b) even systems that started out as IEEE-format gradually went
closer and closer to full IEEE-compliant (GPUs) until there is no
useful difference in the quality of the arithmetic.
c) once 754-2009 came out the overhead to do denorms went to
zero, and there is no reason to avoid full speed denorms in practice.
(BGB's small FPGA prototyping environment aside.)
d) HW designers have learned how to perform all of the rounding
modes at no overhead compared to RNE.

Anton Ertl

unread,

Jul 23, 2021, 2:42:56 PM7/23/21

to

Ivan Godard <iv...@millcomputing.com> writes:
>Back when I was active on the IEEE committee, I once asked Kahan
>whether, if quad (128-bit) were as fast as double, would he still have
>denorms. He answered an unequivocal "No!".

That is hard to believe, given that a major argument for denormal
numbers is that

a-b=0

should give the same result as

a=b

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Jul 23, 2021, 3:13:48 PM7/23/21

to

On Friday, July 23, 2021 at 1:42:56 PM UTC-5, Anton Ertl wrote:
> Ivan Godard <iv...@millcomputing.com> writes:
> >Back when I was active on the IEEE committee, I once asked Kahan
> >whether, if quad (128-bit) were as fast as double, would he still have
> >denorms. He answered an unequivocal "No!".
> That is hard to believe, given that a major argument for denormal
> numbers is that
>
> a-b=0
>
> should give the same result as
>
> a=b
<

a==b
>
> - anton
> --
> 'Anton should try for "industrial quality" comp.arch responses.'
> Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

BGB

unread,

Jul 23, 2021, 3:19:27 PM7/23/21

to

As mentioned various times, for example, my subset looks like:
No denormals, they are treated as zeroes and flushed to zero (1);
Hardware only does FADD/FSUB/FMUL, and optionally FMAC;
Rounding is only "probably" correct (2);
FDIV and FSQRT are done in software (cost reasons, mostly, 3);
No FP status flags or similar;
Rounding behavior is hard-wired
(typically either round-to-nearest or truncate);
...

I have basically tried to achieve an FPU that is "acceptably cheap" for
my uses.

1: Besides just their impact on FADD/FMUL, they have a fairly obvious
impact on the relative cost of format-conversion operators.

The dedicated FP conversion ops perform rounding at least, though the
converters used in SIMD ops may truncate instead (in which case, the
logic is mostly bit-repacking with a few special cases to deal with
out-of-range exponents, which causes the results to become either Inf or
Zero).

Denormals mostly make sense for getting more accuracy out of smaller
formats, but then again, if a person is using a smaller format, then
typically precision isn't that important in the first place.

It would almost make more sense IMO to standardize of denormal-as-zero
semantics.

2: There are only a small number of bits below the ULP in my case, and
the final rounding step has "limited carry propagation".

Note that the FADD/FSUB unit does implement twos complement operations
internally such that the limited carry propagation does not normally
effect the visible result of operations (unlike what would happen with
ones' complement).

Internally, the FADD/FSUB unit is using a mantissa big enough to manage
conversion to/from 64 bit integers, and which generally gives ~ 10 bits
below the ULP.

So, probably, FADD/FSUB falls within a 0.51 ULP window.

FMUL shaves it a little closer, but still falls within "most of the
time" territory.

If the program depends on bit-exact results, well, one is probably still
going to have problems. Hardly anything depends on this though.

In general, the implementation is correct enough that doing integer math
via the FPU works as expected (and there is some software that depends
on the assumption that doing integer math via FPU ops works correctly).

...

3: My initial attempts at hardware FDIV and FSQRT were "not free" and
were actually slower than what I could be managed in software. This is
due mostly to software being able to use a "more proper" Newton-Raphson,
which converges more quickly. This operation is still essentially serial
whether or not it was done in hardware or software, essentially limited
by the internal latency of the FADD and FMUL units.

The "starting guess" for these operations is easy enough to generate
using basic integer arithmetic.

I had partly experimented with a 96-bit format (binary128 with the low
32-bits ignored/set-to-zero), but this caused the FPU to be too
expensive to afford in a dual-core configuration with the FPGA I am using.

This FPU would also have given slightly higher precision (probability of
a correctly-rounded result) with double operations.

( Decided to leave out a big thing related to how the 96-bit FPU could
potentially also be used for 64-bit integer multiply and divide, then
going into more about integer operations and large-integer formats in
general. )

Though, FWIW, even without a large-format FPU, one can still implement
float128 semi-acceptably with partial hardware-support for 128-bit
integer operations (which in-turn are also semi-useful in implementing
operations on larger integer types, ...).

Note that even with hardware support for the 96-bit format, the idea was
that "__float128" would still use runtime calls for the full 128-bit
format, whereas "long double" would use the same representation but may
use the hardware-supported operations for a little extra speed.

If I can afford the bigger FPU with SMT (with the FPU as probably a
shared resource between the two threads), it might be worthwhile, but I
have my own concerns for the viability of SMT (some of the code I had
written for a 12R6W register file is, concerning...).

I have a few possible trick that could reduce SMT cost some, at the cost
of adding extra cycles (using interlocks pulling the results from a
snapshot of the pipeline outputs from the previous cycle, rather than
using forwarding of the current results, along most of the "lower
priority" paths). Where in this case, the main register-file module also
partly assumes control over the management of pipeline interlocks.

A more traditional VLIW might have avoided this issue by not using
register forwarding, which combined with no interlocks, would mean not
being able to use ALU results or similar until 2 cycles later (or 4
cycles if one needs to wait for writeback).

A few other parts of the core would likely require a fair bit of
restructuring (still deciding what I would do with "ExUnit", ...).

...

Thomas Koenig

unread,

Jul 23, 2021, 3:22:13 PM7/23/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> IEEE 754 was put together by people who really understood FP arithmetic
> but HW not so much.

Who was on that committee?

I have read that Intel and Motorola were there, but what about
Cray, CDC or IBM? They should have had some understanding of the
difficulties involved (but then again at least Cray was famous
for not caring about such niceties, and IBM was still caught in
their horrible radix-16 system at that time and probably could
not even dream that they would one day implement another floating
point format).

MitchAlsup

unread,

Jul 23, 2021, 4:13:02 PM7/23/21

to

On Friday, July 23, 2021 at 2:22:13 PM UTC-5, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > IEEE 754 was put together by people who really understood FP arithmetic
> > but HW not so much.
<
> Who was on that committee?
<

Ivan will be here shortly to list the perpetrators.

>
> I have read that Intel and Motorola were there, but what about
> Cray, CDC or IBM? They should have had some understanding of the
> difficulties involved (but then again at least Cray was famous
> for not caring about such niceties, and IBM was still caught in
> their horrible radix-16 system at that time and probably could
> not even dream that they would one day implement another floating
> point format).
<

IBM's was worse the simply radix 16, it was also a truncation system
with a guard digit (½ a byte) in calculations. To make maters worse, its
competitors were 36-bit, 48-bit, and 60-bit (single precision).

BGB

unread,

Jul 23, 2021, 5:16:14 PM7/23/21

to

Bleh...

Granted, I don't expect doing it in software on an 8086 would have been
exactly all that much faster.

I guess a baseline requirement for "semi-fast" software floating point
is having hardware-supported integer operations which are comparable in
size to the floating-point type in question (if not greater).

Also having enough register space that the values can be held in
registers, which the 8086 didn't exactly have going for it either.

...

Though, kind of funny that Binary16 was a recent development; presumably
it would have been fairly useful in that era.

>>
>> Of course, that meant that it was implementable many times in
>> other systems *of comparable size*.
>>
>> But denormals were a pain in very small systems... and the
>> requirement for perfect rounding was a pain in larger systems
>> that aimed at the highest possible speed using either Goldschmidt
>> or Newton-Raphson for division.
> <
> Having watched this from inside:
> a) HW designers know a lot more about this today than in 1980
> b) even systems that started out as IEEE-format gradually went
> closer and closer to full IEEE-compliant (GPUs) until there is no
> useful difference in the quality of the arithmetic.
> c) once 754-2009 came out the overhead to do denorms went to
> zero, and there is no reason to avoid full speed denorms in practice.
> (BGB's small FPGA prototyping environment aside.)
> d) HW designers have learned how to perform all of the rounding
> modes at no overhead compared to RNE.

Probably.

I suspect the boards and FPGAs I am using are fairly common in the
hobbyist space.

But, yeah, I had started working on a dedicated FMAC unit at one point,
but partly shelved this effort for now, as my initial "viability tests"
didn't look promising (it would have been fairly expensive), and the
design would have had a higher latency for FADD and FMUL than my current
FPU.

John Dallman

unread,

Jul 23, 2021, 5:36:33 PM7/23/21

to

In article <fc5a33d0-7c17-4855...@googlegroups.com>,

jsa...@ecn.ab.ca (Quadibloc) wrote:

> Since IEEE 754, in its original form, simply ratified what Intel
> had planned to include in its forthcoming (but then secret)
> 8087, though, a knowledge of hardware certainly was...
> applied... to its contents, even if the committee that signed
> off on it did not include hardware experts.

At that stage, Intel do not seem to have understood very well how to make
/fast/ hardware. Remember the iAPX 432?

The obvious design goal of the 8087 was very compact code, leading to the
floating-point register stack, and its bad effects on IPC. Its arithmetic
works fine, but it seems to have been intended for assembly language
programming, rather than compiled languages.

John

BGB

unread,

Jul 23, 2021, 5:38:02 PM7/23/21

to

I would expect for a physical system, the inherent "noisiness" of
physical reality would matter a lot more for the outcome a lot more than
any bit-exact arithmetic would help make the landing.

Some expertly crafted and computed flight plan could be thrown off by
much larger factors like which direction the wind was blowing during
liftoff, solar wind or flars blowing the spacecraft slightly off course,
collisions with space dust, ...

Presumably these missions have some sort of closed-loop control, such as
course-correction, ability to detect the distance from the target, ...

Granted, this is assuming it is using sufficient numerical precision.
Trying to do long range navigation using binary32 or similar is probably
just asking for it to crash into the surface (or miss the moon entirely,
say because jitter in the math miscalculated the location of the moon by
like 150km or something, ...).

MitchAlsup

unread,

Jul 23, 2021, 6:23:59 PM7/23/21

to

On Friday, July 23, 2021 at 4:38:02 PM UTC-5, BGB wrote:
> On 7/23/2021 12:45 PM, Quadibloc wrote:
> > On Friday, July 23, 2021 at 9:30:20 AM UTC-6, MitchAlsup wrote:
> >
> >> No, doubling the widths just hides the problem and makes the actual
> >> problem harder to find. You can take the point of view that that is
> >> enough, but it is not in reality.
> >
> > It's enough to get one Moon lander on the surface. But solving the
> > real problem is still better - but it's nice if there's a way to meet
> > deadlines and the like.
> >
> I would expect for a physical system, the inherent "noisiness" of
> physical reality would matter a lot more for the outcome a lot more than
> any bit-exact arithmetic would help make the landing.
>
> Some expertly crafted and computed flight plan could be thrown off by
> much larger factors like which direction the wind was blowing during
> liftoff, solar wind or flars blowing the spacecraft slightly off course,
> collisions with space dust, ...
>
> Presumably these missions have some sort of closed-loop control, such as
> course-correction, ability to detect the distance from the target, ...
<

Only if you intend "closed loop" to incorporate ground based radars and
using a on spacecraft sextant to measure some star "angles", and having
ground based computers figure out the adjustments and radio them up
to the spacecraft, punching them into the computer and causing a correction
burn; as "closed loop".
<
But this is an important point:: many feed back systems have enough
time between corrections that the accuracy of the FP numbers did not
have to be "all that great".

>
>
> Granted, this is assuming it is using sufficient numerical precision.
> Trying to do long range navigation using binary32 or similar is probably
> just asking for it to crash into the surface (or miss the moon entirely,
> say because jitter in the math miscalculated the location of the moon by
> like 150km or something, ...).
<

Remember we are still flying fighter jets with computer controls where the
jet will invert itself when crossing the equator--all because it too too many
instructions to get either SIN() or COS() right.

Quadibloc

unread,

Jul 23, 2021, 7:22:44 PM7/23/21

to

On Friday, July 23, 2021 at 1:13:48 PM UTC-6, MitchAlsup wrote:
> On Friday, July 23, 2021 at 1:42:56 PM UTC-5, Anton Ertl wrote:

> > should give the same result as
> >
> > a=b
> <
> a==b

That's only a correction if you assume all pseudocode must be in C.

On my web page, I have defined a higher-level language where a
multiple assignment statement doesn't look like

a,b,c=5

or

a=b=c=5

but instead

. a/b/c=5

and there is a reason for that. Or maybe more than one reason:

1) The "." is a short form of the keyword "LET". By beginning all statements
with a keyword, the need for reserved words in the language can be completely
avoided, without complicated gyrations as done in FORTRAN.

2) A READ statement may look like this:

READ [5,10,END=99,ERR=999] X,Y,Z

therefore the syntax of the language must allow assignments, not just expressions,
to be passed as arguments to subroutines.

3) The following statement assigns _T (true) to L1 and L2 if I and J are equal,
and _F to L1 and L2 otherwise.

. L1/L2=I=J

Only the first equals sign in an assignment statement indicates an assignment.
Within the expression following that equals sign, any other equals signs are
equality operators.
Therefore, the equals sign can't be used as a separator for multiple assignments.
Neither can the comma, because then you're breaking the assignment into two
pieces; a READ statement may _also_ look like this:

READ [5,10,END/ERR=99] X,Y,Z

Essentially, the language is intended to look a lot like FORTRAN, but it
also borrows from AWK the idea of having a comma at the end of the text
on a line indicate a continuation (no semicolons after each statement).
Parentheses are used within expressions to control the order of evaluation,
for function argument lists, _and_ array subscripts; it's in the case of the
device, format clause in I/O statements that square brackets were needed
to avoid an ambiguity.

John Savard

BGB

unread,

Jul 23, 2021, 8:50:28 PM7/23/21

to

More or less, they would presumably have some way to determine if they
are on-course to the target, and some hydrazine rockets or similar to
allow for fine adjustment.

But, yeah, I meant as opposed to doing the space launch and then trying
to fly an entirely precomputed path and then assume that the spacecraft
gets to its destination if one did all the math correctly (and then have
it "all go south" because the wind changed direction on launch day
relative to what was calculated in the simulations; or because the
spacecraft was getting pushed by solar winds, ...).

>>
>>
>> Granted, this is assuming it is using sufficient numerical precision.
>> Trying to do long range navigation using binary32 or similar is probably
>> just asking for it to crash into the surface (or miss the moon entirely,
>> say because jitter in the math miscalculated the location of the moon by
>> like 150km or something, ...).
> <
> Remember we are still flying fighter jets with computer controls where the
> jet will invert itself when crossing the equator--all because it too too many
> instructions to get either SIN() or COS() right.
>

Hmm...

I think I had also remembered something like fighter jet computers
crashing when flying over parts of Jordan or similar, because they were
not meant to deal with the altitude going negative...

Chris M. Thomasson

unread,

Jul 23, 2021, 8:54:07 PM7/23/21

to

Skinwalker ranch made a radar altimeter made my Garmin report around 40
feet to ground when they were several thousand of feet off the ground.
5000 feet would be 10000 feet from sea level.

Quadibloc

unread,

Jul 24, 2021, 2:01:35 AM7/24/21

to

On Friday, July 23, 2021 at 4:23:59 PM UTC-6, MitchAlsup wrote:

> But this is an important point:: many feed back systems have enough
> time between corrections that the accuracy of the FP numbers did not
> have to be "all that great".

I remember specifically that one particular failed Ariane launch was
held up as what bad floating-point can cause... so, of course, while there
are cases where feedback makes things less of an issue, there are other
cases where numerical accuracy is critical.

John Savard

Quadibloc

unread,

Jul 24, 2021, 2:05:13 AM7/24/21

to

On Saturday, July 24, 2021 at 12:01:35 AM UTC-6, Quadibloc wrote:

> I remember specifically that one particular failed Ariane launch was
> held up as what bad floating-point can cause...

Ah. A Google search led me to what really happened.

The maiden flight of Ariane 5 failed because conversion from
floating-point to 16-bit integer caused an exception, as the
floating-point valule was not within the range of such integers.

No doubt bad software design, but not really necessarily
about the kind of numerical analysis concern that IEEE 754
addresses.

John Savard

BGB

unread,

Jul 24, 2021, 3:14:10 AM7/24/21

to

IME:
Inexact rounding is hardly ever the source of problems (excluding
integer-exact math in codecs, but this sort of stuff is typically
fixed-point).

However, more serious bugs, like unexpected overflow, saturation, or
traps, these sorts of things can ruin ones' day.

Likewise for algorithms whose domain is "all real numbers except zero",
which just so happens to fail catastrophically and nuke the program if
the input just so happens to be zero.

Insufficient precision can also be a source of problems.

Issues with the use of single precision coordinates in games trying to
deal with large worlds is a fairly common example (typically a lot of
workarounds are needed to make all the math work for worlds much larger
than a few km or so).

> John Savard
>

Thomas Koenig

unread,

Jul 24, 2021, 4:36:13 AM7/24/21

to

BGB <cr8...@gmail.com> schrieb:

> On 7/23/2021 12:45 PM, Quadibloc wrote:
>> On Friday, July 23, 2021 at 9:30:20 AM UTC-6, MitchAlsup wrote:
>>
>>> No, doubling the widths just hides the problem and makes the actual
>>> problem harder to find. You can take the point of view that that is
>>> enough, but it is not in reality.
>>
>> It's enough to get one Moon lander on the surface. But solving the
>> real problem is still better - but it's nice if there's a way to meet
>> deadlines and the like.
>>
>
> I would expect for a physical system, the inherent "noisiness" of
> physical reality would matter a lot more for the outcome a lot more than
> any bit-exact arithmetic would help make the landing.
>
> Some expertly crafted and computed flight plan could be thrown off by
> much larger factors like which direction the wind was blowing during
> liftoff, solar wind or flars blowing the spacecraft slightly off course,
> collisions with space dust, ...

One major issue was gravitational anomalies of the Moon, which are
big enough to have a significant impact on the orbit of spacecraft.

> Presumably these missions have some sort of closed-loop control, such as
> course-correction, ability to detect the distance from the target, ...

Having a spacecraft travel around in space is an engieering problem.
You always have error bars on every measurement and assumption -
mass of your spacecraft, duration and efficiency of a burn, actual
mass expended on a burn, orientation, ...

Any of the above uncertainties is _much_ higher than the floating
point precision of a 32-bit real, let alone of a 64-bit real.

Of course, if you subtract two numbers of almost equal magnigude,
that could be much different...

This is why space missions have course corrections somehwere in
the middle of their trajectory. After sufficient time has elapsed
to gather data on the velocity and position of the spacecraft, but
not yet enough that a course correction would be too expensive,
you alter the spacecraft's velocity by a relatively small amount
of delta v.

> Granted, this is assuming it is using sufficient numerical precision.
> Trying to do long range navigation using binary32 or similar is probably
> just asking for it to crash into the surface (or miss the moon entirely,
> say because jitter in the math miscalculated the location of the moon by
> like 150km or something, ...).

A few years ago (pre-Corona) I visited the Space Center in Houston.
A tour guide for the Saturn-V on exhibition there told the group
that he had worked on the courses for Apollo, and that they had
actually mostly used analytical methods. I was a bit surprised
at that, I would have thought numerical solution of ODEs would
have been employed more.

Thomas Koenig

unread,

Jul 24, 2021, 4:38:27 AM7/24/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> Remember we are still flying fighter jets with computer controls where the
> jet will invert itself when crossing the equator--all because it too too many
> instructions to get either SIN() or COS() right.

From what I read, that bug was found and fixed in 1986.

Thomas Koenig

unread,

Jul 24, 2021, 4:58:02 AM7/24/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

At the time it was designed, yes. Even Henry S. Warren, who worked
for IBM, has a scathing page dedicated to that particular decision.

At the time IEEE was formulated, the 32-bit systems had pretty
much taken over, I think (of course Cray used 64 bit for single
precision, they had no 32-bit format).

Anton Ertl

unread,

Jul 24, 2021, 5:27:29 AM7/24/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>MitchAlsup <Mitch...@aol.com> schrieb:
>
>> IEEE 754 was put together by people who really understood FP arithmetic
>> but HW not so much.
>
>Who was on that committee?

Obviously enough hardware manufacturers that IEEE 754 was very quickly
implemented in most new hardware (often with microcode or software
assist for denormal numbers).

>I have read that Intel and Motorola were there, but what about
>Cray, CDC or IBM?

IBM certainly adopted IEEE 754. CDC was on the way out by that time.

Concerning Cray: the followons (Cray Y-MP) to the old Cray designs
were of course compatible with the old designs, but Cray (the company)
also did the T3D and T3E, which used Alpha CPUs and thus IEEE 754.
But I think the question was more what Crays customers wanted. They
had earlier bought machines that put speed above correctness (IIRC for
division), so Cray had a reason to believe that their customers would
also choose speed over IEEE 754. However, Cray suffered difficult
times in the 1980s, so maybe there were too few customers who agreed
with that choice.

Anyway, given that we have fast IEEE 754 hardware, the IEEE 754 people
either understood hardware enough or were lucky (I think it was the
former). Admittedly the dynamic rounding mode stuff is slow or
expensive to implement on OoO CPUs, and to a lesser degree on
pipelined CPUs, and who knows how well the people on the IEEE 754
committee understood that (pipelining was not widely used at the time,
and OoO even less), but I think that those who want dynamic rounding
mode changes prefer slowness to not having it at all, and those who
claim it is unnecessary't should not care if it is slow; and the
hardware cost of slow dynamic rounding modes is small.

Anton Ertl

unread,

Jul 24, 2021, 6:17:58 AM7/24/21

to

j...@cix.co.uk (John Dallman) writes:
>At that stage, Intel do not seem to have understood very well how to make
>/fast/ hardware. Remember the iAPX 432?

The iAPX 432 seems to be the overambitious project that growing
companies tend to engage in, many of which fail. IBM Stretch did not
achieve its performance goals, either, and was also considered a
failure.

By contrast, the 8087 seems to be very successful, and its
architecture even more so, surviving until this day (although with
competition from SSE2 since its introduction in 2000 and especially
since AMD64 in 2003, where SSE2 became the standard FP instruction set).

>The obvious design goal of the 8087 was very compact code, leading to the
>floating-point register stack, and its bad effects on IPC.

What bad effects on IPC do you mean? Earlier this year
<2021Jan...@mips.complang.tuwien.ac.at>
<2021Jan...@mips.complang.tuwien.ac.at> I measured daxpy
implemented in C:

void daxpy(double ra, double *f_x, double *f_y, long stride, unsigned long ucount)
{
for (; ucount>0; ucount--) {
*f_y += ra * *f_x;
f_x = (double *)(((char *)f_x)+stride);
f_y = (double *)(((char *)f_y)+stride);
}
}

and compiled with

gcc -O -mfpmath=387

and with

gcc -O -mfpmath=sse

The resulting code was:

387 SSE2
start: start:
fld %st(0) movapd %xmm0,%xmm1
fmull (%rdi) mulsd (%rdi),%xmm1
faddl (%rsi) addsd (%rsi),%xmm1
fstpl (%rsi) movsd %xmm1,(%rsi)
add %rdx,%rdi add %rdx,%rdi
add %rdx,%rsi add %rdx,%rsi
sub $0x1,%rcx sub $0x1,%rcx
jne start jne start

Both loops execute at 2 cycles/iteration (IPC=4) on a Skylake. I
don't see a bad effect in IPC from the stack architecture of the 387,
despite the 387 having taken a back seat since AMD64 became the
dominant instruction set on Intel CPUs.

>Its arithmetic
>works fine, but it seems to have been intended for assembly language
>programming, rather than compiled languages.

Compilers can generate code for stack machines just fine, actually it
is very easy to generate code for stack machines. Therefore, the
Burroughs B5500 architecture was designed as a stack architecture to
go with the Burroughs Algol compiler. The transputer (another stack
machine) was designed in tandem with the Occam language and compiler.
Both machines were not intended to be programmed much in assembly
language.

One case where register machines have an advantage is for partial
redundancy elimination (a generalization of common subexpression
elimination and loop-invariant code motion). I am positively
surprised by gcc keeping ra on the stack in the 387 example above.

Branimir Maksimovic

unread,

Jul 24, 2021, 6:31:20 AM7/24/21

to

Could you provide test program?
I have tested : https://github.com/siposcsaba89/eigen_fast_math_test
on M1
-rw-r--r-- 1 bmaxa staff 9 Jul 24 12:25 .gitignore
-rw-r--r-- 1 bmaxa staff 14120 Jul 24 12:26 CMakeCache.txt
drwxr-xr-x 13 bmaxa staff 416 Jul 24 12:26 CMakeFiles
-rw-r--r-- 1 bmaxa staff 453 Jul 24 12:25 CMakeLists.txt
-rw-r--r-- 1 bmaxa staff 1069 Jul 24 12:25 LICENSE
-rw-r--r-- 1 bmaxa staff 5418 Jul 24 12:26 Makefile
-rw-r--r-- 1 bmaxa staff 56 Jul 24 12:25 README.md
-rw-r--r-- 1 bmaxa staff 1550 Jul 24 12:26 cmake_install.cmake
-rw-r--r-- 1 bmaxa staff 476 Jul 24 12:25 main.cpp
-rwxr-xr-x 1 bmaxa staff 238077 Jul 24 12:26 test_mat_inverse
bmaxa@Branimirs-Air eigen_fast_math_test % ./test_mat_inverse
0.999303 0.0145721 -0.0343709 -0.00753874
-0.0145243 0.999893 0.00164037 0.52384
0.0343912 -0.00114001 0.999408 -5.68314
0 0 0 1
0.999303 -0.0145243 0.0343911 0.210591
0.0145721 0.999893 -0.00114001 -0.530153
-0.034371 0.00164037 0.999408 5.67865
-0 0 -0 1
1 4.93871e-19 1.89553e-18 -1.47451e-17
2.99736e-19 1 -3.26763e-19 -1.11022e-16
-5.65987e-18 1.09284e-19 1 1.77636e-15
0 0 0 1
and
bmaxa@Branimirs-Air eigen_perf_test % ls -la
total 2496
drwxr-xr-x 11 bmaxa staff 352 Jul 24 12:30 .
drwxr-xr-x 55 bmaxa staff 1760 Jul 24 12:29 ..
drwxr-xr-x 12 bmaxa staff 384 Jul 24 12:29 .git
-rw-r--r-- 1 bmaxa staff 12 Jul 24 12:29 .gitignore
-rw-r--r-- 1 bmaxa staff 14045 Jul 24 12:30 CMakeCache.txt
drwxr-xr-x 13 bmaxa staff 416 Jul 24 12:30 CMakeFiles
-rw-r--r-- 1 bmaxa staff 511 Jul 24 12:29 CMakeLists.txt
-rw-r--r-- 1 bmaxa staff 5609 Jul 24 12:30 Makefile
-rw-r--r-- 1 bmaxa staff 1540 Jul 24 12:30 cmake_install.cmake
-rwxr-xr-x 1 bmaxa staff 1235420 Jul 24 12:30 eigen_perf_test
-rw-r--r-- 1 bmaxa staff 1279 Jul 24 12:29 eigen_perf_test.cpp
bmaxa@Branimirs-Air eigen_perf_test % ./eigen_perf_test
Result: -0.000151013 -0.000110068 7.73141e-05 -1.32445e-05 1.02876e-05 -8.45934e-05
compiler is Clang, version 12.0.5
10000 iteration took 0.153339 seconds

Result: -6.36653e-08 -1.11854e-05 -1.19188e-05 -1.15604e-05 9.72996e-07 -1.43802e-05
compiler is Clang, version 12.0.5
100000 iteration took 1.28075 seconds

Result: -1.06923e-07 6.38881e-07 -3.13472e-07 6.69333e-07 -1.99308e-07 -7.20017e-07
compiler is Clang, version 12.0.5
1000000 iteration took 12.8297 seconds
https://github.com/siposcsaba89/eigen_perf_test

--
bmaxa now listens Smile (explicit version) by Lily Allen from Triple J: Hottest 100, Volume 14

Ivan Godard

unread,

Jul 24, 2021, 7:55:36 AM7/24/21

to

On 7/23/2021 1:13 PM, MitchAlsup wrote:
> On Friday, July 23, 2021 at 2:22:13 PM UTC-5, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>>> IEEE 754 was put together by people who really understood FP arithmetic
>>> but HW not so much.
> <
>> Who was on that committee?
> <
> Ivan will be here shortly to list the perpetrators.

Can't say for the original 754 - I wasn't involved until the '00s

Anton Ertl

unread,

Jul 24, 2021, 12:51:44 PM7/24/21

to

Branimir Maksimovic <branimir....@gmail.com> writes:
>> Both loops execute at 2 cycles/iteration (IPC=4) on a Skylake. I

...

>Could you provide test program?

http://www.complang.tuwien.ac.at/anton/tmp/axpy.zip

Today I measure 1.8 cycles per iteration (IPC=4.44) from this program
(both compiled for 387 and SSE2) on a Skylake. Strange.

BGB

unread,

Jul 24, 2021, 2:03:35 PM7/24/21

to

On 7/24/2021 3:36 AM, Thomas Koenig wrote:
> BGB <cr8...@gmail.com> schrieb:
>> On 7/23/2021 12:45 PM, Quadibloc wrote:
>>> On Friday, July 23, 2021 at 9:30:20 AM UTC-6, MitchAlsup wrote:
>>>
>>>> No, doubling the widths just hides the problem and makes the actual
>>>> problem harder to find. You can take the point of view that that is
>>>> enough, but it is not in reality.
>>>
>>> It's enough to get one Moon lander on the surface. But solving the
>>> real problem is still better - but it's nice if there's a way to meet
>>> deadlines and the like.
>>>
>>
>> I would expect for a physical system, the inherent "noisiness" of
>> physical reality would matter a lot more for the outcome a lot more than
>> any bit-exact arithmetic would help make the landing.
>>
>> Some expertly crafted and computed flight plan could be thrown off by
>> much larger factors like which direction the wind was blowing during
>> liftoff, solar wind or flars blowing the spacecraft slightly off course,
>> collisions with space dust, ...
>
> One major issue was gravitational anomalies of the Moon, which are
> big enough to have a significant impact on the orbit of spacecraft.
>

Makes sense, but yeah, more factors into the point that unpredictable
physical variables are a lot more likely to factor in to mission success
or failure than whether or not floating-point operations have exact
rounding.

In cases where it matters, it is usually more a case of wanting multiple
implementations to give results which are consistent, rather than
necessarily maximizing accuracy.

One could have a floating point definition which specifies the use of
truncation, and potentially how the results are truncated internally
during operations, rather than one which assumes an infinitely precise
result.

For example, what if we defined double-precision FADD/FSUB relative to
the behavior of a 64-bit twos complement integer mantissa?...

Or, FMUL relative to a triangular multiplier which multiplies two 56-bit
inputs and produces a 56-bit (truncated) output?...

Not necessarily as a "universal standard", but rather as a target which
can be:
Relatively cost-effectively be implemented bit-exact in hardware;
Can be mostly emulated in software mostly using 64-bit integer
operations (*1).

*1: Well, sorta; FMUL would need special treatment to be bit-exact. One
would need to make some extra effort to approximate the behavior of a
truncated multiplier (naively using a 64*64->128 bit widening multiplier
and discarding the low bits would not produce bit-exact results).

Bit-exact would require getting the same results for the bits "hanging
off the bottom", with such a multiplier probably being defined relative
to the behavior of a collection of 16*16->32 sub-multipliers or similar.

There are other options, such as approximating a "smooth bottom"
truncated result, but this is not likely to be cost-effective relative
to the more "jagged" version.

...

Worthwhile is also debatable, as an FPU built to be able to support,
say, an FP96 format, would not produce bit-identical results with one
built to only support Binary64/Double, without special case logic to
artificially truncate or discard bits in the intermediate results, ...

Granted, an "FP standard" whose definitions depend on the size of the
largest output format supported by the FPU in question seems "kinda
useless" on this front.

The "cheaper" alternative is to not make any requirements here, and give
rounding behavior in terms of a probability.

As can be noted though, integer multiply via FP generally survives such
a multiplier because inputs which produce an in-range result will have
zeroes in the low-order parts of the mantissa, and thus all the
sub-multiplies which would have "fallen off the bottom" would have
contained multiplies against zero (and thus not had a visible effect on
the result).

>> Presumably these missions have some sort of closed-loop control, such as
>> course-correction, ability to detect the distance from the target, ...
>
> Having a spacecraft travel around in space is an engieering problem.
> You always have error bars on every measurement and assumption -
> mass of your spacecraft, duration and efficiency of a burn, actual
> mass expended on a burn, orientation, ...
>
> Any of the above uncertainties is _much_ higher than the floating
> point precision of a 32-bit real, let alone of a 64-bit real.
>
> Of course, if you subtract two numbers of almost equal magnigude,
> that could be much different...
>
> This is why space missions have course corrections somehwere in
> the middle of their trajectory. After sufficient time has elapsed
> to gather data on the velocity and position of the spacecraft, but
> not yet enough that a course correction would be too expensive,
> you alter the spacecraft's velocity by a relatively small amount
> of delta v.
>

Makes sense.

>> Granted, this is assuming it is using sufficient numerical precision.
>> Trying to do long range navigation using binary32 or similar is probably
>> just asking for it to crash into the surface (or miss the moon entirely,
>> say because jitter in the math miscalculated the location of the moon by
>> like 150km or something, ...).
>
> A few years ago (pre-Corona) I visited the Space Center in Houston.
> A tour guide for the Saturn-V on exhibition there told the group
> that he had worked on the courses for Apollo, and that they had
> actually mostly used analytical methods. I was a bit surprised
> at that, I would have thought numerical solution of ODEs would
> have been employed more.
>

IME:

Most of what one might want to calculate in practice, can be done using
algebra.

If it can't be done directly, one can subdivide it into smaller
timesteps and do it incrementally. At small enough timesteps, pretty
much everything becomes linear.

Similarly, a lot of stuff one could do with ODE's could instead be done
using B-splines or similar, ...

Also, divide is one of those operations one generally wants to avoid
when possible, not just for speed reasons, but because it is more prone
to adding instability: if you divide by a number which happens to
approach 0, then the numbers involved can get huge. Calculations which
give wonky results, or spit out Inf, NaN, or raises a fault when given
certain inputs, are not desirable.

Usually better when possible to find an alternative which avoids the use
of division, or at least eliminate cases where divide-by-zero exists as
a possibility.

Though, this does seem to be one of those points of division between
doing math on a computer relative to traditional mathematics, which
likes to throw in divide operators all over the place.

MitchAlsup

unread,

Jul 24, 2021, 2:26:06 PM7/24/21

to

On Saturday, July 24, 2021 at 1:03:35 PM UTC-5, BGB wrote:
> On 7/24/2021 3:36 AM, Thomas Koenig wrote:
> > BGB <cr8...@gmail.com> schrieb:

<snip>

>
> Also, divide is one of those operations one generally wants to avoid
> when possible, not just for speed reasons, but because it is more prone
> to adding instability: if you divide by a number which happens to
> approach 0, then the numbers involved can get huge. Calculations which
> give wonky results, or spit out Inf, NaN, or raises a fault when given
> certain inputs, are not desirable.
<

There were several cases in my transcendental studies where one could
write a Newton-Raphson iteration using SQRT(x) or 1/SQRT(x); and in every
case I looked at, the one using 1/SQRT(x) converged faster.

>
> Usually better when possible to find an alternative which avoids the use
> of division, or at least eliminate cases where divide-by-zero exists as
> a possibility.
>
> Though, this does seem to be one of those points of division between
> doing math on a computer relative to traditional mathematics, which
> likes to throw in divide operators all over the place.
<

In traditional math, DIV is "just another operator" with exactly the same
domain and range as any other operator (just like differentiate and
integrate are other operators along with DIV and CURL.) all being of
infinite precision and perfect arithmetic.
<
Computer arithmetics don't have those properties

Thomas Koenig

unread,

Jul 24, 2021, 3:17:13 PM7/24/21

to

BGB <cr8...@gmail.com> schrieb:

> IME:
>
> Most of what one might want to calculate in practice, can be done using
> algebra.
>
> If it can't be done directly, one can subdivide it into smaller
> timesteps and do it incrementally. At small enough timesteps, pretty
> much everything becomes linear.

You just described the first-order Euler method of solving ODEs :-)

>
> Similarly, a lot of stuff one could do with ODE's could instead be done
> using B-splines or similar, ...

An ordinary differential equation is a relation between different
quantities that you can solve given starting (and/or boundary)
conditions. There are many methods - Euler, higher-order
Runge-Kutta, Runge-Kutta, hopefully with adaptive stepsize control,
by a predictor-corrector method, Richardson extrapolation, Adams,
expict vs. explicit, ...

Soving an ODE via B-splines could be done via error minimization,
which would lead you towards the field of finite elements.

> Also, divide is one of those operations one generally wants to avoid
> when possible, not just for speed reasons, but because it is more prone
> to adding instability: if you divide by a number which happens to
> approach 0, then the numbers involved can get huge.

You can not reasonably avoid division, nor should you - the problem
is usually not the division itself, but the subtraction beforehand.

However, if you have something like

f(xn) = f(x) + dfdx(x) / (xn-x)

you can play some games of adjusting your stepsize so xn-n
can be represented exactly in floating point number, but
frankly, it usually isn't worth the other.

>Calculations which
> give wonky results, or spit out Inf, NaN, or raises a fault when given
> certain inputs, are not desirable.
>
> Usually better when possible to find an alternative which avoids the use
> of division, or at least eliminate cases where divide-by-zero exists as
> a possibility.

If your stepsize becomes zero, you're in a heap of trouble
already and probably should have errored out long before.

Chris M. Thomasson

unread,

Jul 24, 2021, 3:44:45 PM7/24/21

to

Fwiw, I did a fractal encoding thing for pure fun where I map symbols to
complex roots. The code only encodes/decodes data in a Julia set with a
power of 16 to conveniently map into hexbytes. This is a pure
experiment, but it works. Rounding errors and other floating point
issues can make it crap out by decrypting part of the plaintext with
some "junk" on it. So, I found an arbitrary precision lib in JavaScript,
Decimal.js, and put it up online. Here it is:

http://fractallife247.com/test/rifc_cipher

A ciphertext is a single complex number. For instance here is one that
contains my name:

real:
-0.70928383564905214400492596591643890200098992164665980782966227733203960188288097070737389345985516069300117982413622497654113697

imag:
0.75006448767684071252250616852543657203512420946592887596427538863664520584158627985390890157772176873489867565028334553930789721

It is hardcoded to use 128 points of precision.

To decrypt it, you copy and paste the real and imaginary parts into
their respective textboxes and click decrypt. Can you get it to work?

This is sensitive to floating point issues. Also, it can take a long
time to compute for larger plaintexts. I really need to put the
processing into a WebWorker, or even in an animation loop where each
frame decodes a bit of the work. Right now, it "freezes" the ui during
long computations.

Oh, btw, here is an implementation of it using C++ and doubles:

https://github.com/ChrisMThomasson/fractal_cipher/blob/master/RIFC/cpp/ct_rifc_sample.cpp

Iirc the C++ code can handle different powers, even negative ones. Its
not hardcoded to 16 symbols.

Branimir Maksimovic

unread,

Jul 24, 2021, 4:10:32 PM7/24/21

to

On 2021-07-24, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:

> Branimir Maksimovic <branimir....@gmail.com> writes:
>>> Both loops execute at 2 cycles/iteration (IPC=4) on a Skylake. I
> ...
>>Could you provide test program?
>
> http://www.complang.tuwien.ac.at/anton/tmp/axpy.zip
>
> Today I measure 1.8 cycles per iteration (IPC=4.44) from this program
> (both compiled for 387 and SSE2) on a Skylake. Strange.
>
> - anton

Heh, I just need timing code now:
in gas, ARMv8 equivalent of rdtsc:
init_time:
mrs x0,CNTPCT_EL0 ; counter
adrp x8,elapsed@PAGE
str x0, [x8,elapsed@PAGEOFF]
ret
time_me:
mrs x8,cntfrq_el0 ; clock
ucvtf d1,x8
mrs x8,CNTPCT_EL0 ; counter
adrp x9,elapsed@PAGE
ldr x9,[x9,elapsed@PAGEOFF]
sub x8,x8,x9
ucvtf d0,x8
fdiv d0,d0,d1
str d0,[sp]
b _printf
Just I dunno how to measure ticks in C :P
modified also flags for gcc:
bmaxa@Branimirs-Air axpy % cat Makefile
all: axpy-sse axpy-387

axpy-sse: axpy-sse.o axpy-main-sse.o
gcc axpy-sse.o axpy-main-sse.o -o $@

axpy-387: axpy-387.o axpy-main-387.o
gcc axpy-387.o axpy-main-387.o -o $@

axpy-sse.o: axpy.c
gcc-11 -c -O -march=armv8.4-a+simd -D"Float=double" axpy.c -o $@

axpy-387.o: axpy.c
gcc-11 -c -O -march=armv8.4-a -D"Float=double" axpy.c -o $@

axpy-main-sse.o: axpy-main.c
gcc-11 -c -O -march=armv8.4-a+simd -D"Float=double" axpy-main.c -o $@

axpy-main-387.o: axpy-main.c
gcc-11 -c -O -march=armv8.4-a -D"Float=double" axpy-main.c -o $@

--
bmaxa now listens Sex & Violence by Exploited from Totally Exploited

Quadibloc

unread,

Jul 24, 2021, 5:24:21 PM7/24/21

to

On Friday, July 23, 2021 at 2:13:02 PM UTC-6, MitchAlsup wrote:

> IBM's was worse the simply radix 16, it was also a truncation system
> with a guard digit (½ a byte) in calculations. To make maters worse, its
> competitors were 36-bit, 48-bit, and 60-bit (single precision).

And it originally went out the door without even the guard digit: the
results of arithmetic done that way were so bad, they had to fix that
in every computer they sold as well as in all future ones.

I hadn't looked at things that way, though, as far as precision went.

Since, unlike the STRETCH, the IBM 360 never went in for *bit*
addressing, they could have designed it with a 36-bit word.

Packed decimal would be less efficient, with one wasted bit in
each byte; but at least in one of those bytes, that bit could be
used for the sign, instead of using a whole digit for that.

If one wanted lower-case, putting characters in 9-bit bytes would
be just fine. The punched card code, though, would now be more'
complicated, since with 12, 11, 0, 8, and 9 used independently,
there were seven bits left to encode the remaining three bits with
one punch or none.

Using another punch for an additional binary bit means that only six
are left, so at least one combination with two of them punched would
also have to be allowed.

One could use 4 as the additional punch, and 2 and 6 as the additional
combination of the rest, this would help keep the card strong.

But with a 36-bit word, obviously an additional complication to the
instruction set would have been needed: now the computer would
have to be able to handle not just character instructions for 9-bit
upper-case only characters, but also for 6-bit characters, three to
a halfword, since upper-case only characters were what was usually
used, and if they could be handled efficiently, it would be insisted
upon.

As individual packed decimal digits couldn't be addressed, the fact that
one could only address these characters in multiples of three shouldn't
be too much of an issue; there would be a way to unpack them into
nine-bit characters when required.

John Savard

Branimir Maksimovic

unread,

Jul 24, 2021, 7:25:27 PM7/24/21

to

Here it is on M1:
bmaxa@Branimirs-Air axpy % make
gcc-11 -c -O -march=armv8.4-a+simd -D"Float=double" axpy.c -o axpy-sse.o
gcc-11 -c -O -march=armv8.4-a+simd -D"Float=double" axpy-main.c -o axpy-main-sse.o
as timing.gas -o timing.o
gcc axpy-sse.o axpy-main-sse.o timing.o -o axpy-sse
gcc-11 -c -O -march=armv8.4-a -D"Float=double" axpy.c -o axpy-387.o
gcc-11 -c -O -march=armv8.4-a -D"Float=double" axpy-main.c -o axpy-main-387.o
gcc axpy-387.o axpy-main-387.o timing.o -o axpy-387
bmaxa@Branimirs-Air axpy % ./axpy-387
stride: 0.000004 secs
axpy: 0.356832 secs
bmaxa@Branimirs-Air axpy % ./axpy-sse
stride: 0.000004 secs
axpy: 0.357641 secs
bmaxa@Branimirs-Air axpy % cat axpy-main.c
#include <stdlib.h>
void axpy(Float ra, Float *f_x, Float *f_y, long stride, unsigned long ucount);
extern void init_time(void);
extern void time_me(const char* format);
int main()
{
long stride=10;
long i;
char *x=malloc(16000);
char *y=malloc(16000);
char *px=x, *py=y;
init_time();
for (i=0; i<1000; i++) {
*(Float *)px=1.0;
*(Float *)py=0.0;
px+=stride;
py+=stride;
}
time_me("stride: %f secs\n");
init_time();
for (i=0; i<1000000; i++)
axpy(1.000001, (Float *)x, (Float *)y, stride, 1000);
time_me("axpy: %f secs\n");
}

bmaxa@Branimirs-Air axpy % cat timing.gas
.text
.globl _init_time
.globl _time_me
.align 4
_init_time:

mrs x0,CNTPCT_EL0 ; counter
adrp x8,elapsed@PAGE
str x0, [x8,elapsed@PAGEOFF]
ret

_time_me:

mrs x8,cntfrq_el0 ; clock
ucvtf d1,x8
mrs x8,CNTPCT_EL0 ; counter
adrp x9,elapsed@PAGE
ldr x9,[x9,elapsed@PAGEOFF]
sub x8,x8,x9
ucvtf d0,x8
fdiv d0,d0,d1
str d0,[sp]
b _printf

.data
.bss
.align 8
elapsed: .space 8

--
bmaxa now listens 03. Yorgos Kazantzis - Sorocos

Quadibloc

unread,

Jul 24, 2021, 11:40:14 PM7/24/21

to

On Saturday, July 24, 2021 at 3:24:21 PM UTC-6, Quadibloc wrote:

> Since, unlike the STRETCH, the IBM 360 never went in for *bit*
> addressing, they could have designed it with a 36-bit word.

I've tried to imagine what a 360 with a nine-bit byte might be like.

I felt that it might be possible to design the RX format so that
displacements could grow from 12 bits to 15 bits:

opcode: 9 bits
destination register: 4 bits
index register: 4 bits
base register: 4 bits
displacement: 15 bits

But that would mean the RR format would have to look
like this:

opcode: 9 bits
destination register: 4 bits
source register: 4 bits
opcode: 1 bit

And the SS format instructions would be a mess:

opcode: 8 bits
source base register: 1 bit
length: 8 bits
destination base register: 4 bits
destination address: 15 bits
source base register (continued): 3 bits
source address: 15 bits

so I might just have to resign myself to 14-bit
displacements instead.

John Savard

Quadibloc

unread,

Jul 25, 2021, 2:56:44 AM7/25/21

to

On Saturday, July 24, 2021 at 9:40:14 PM UTC-6, Quadibloc wrote:

> so I might just have to resign myself to 14-bit
> displacements instead.

But, on the other hand, if I do _that_, I end up with
opcode fields that are far larger than necessary.

John Savard

Terje Mathisen

unread,

Jul 25, 2021, 4:19:43 AM7/25/21

to

Ivan Godard wrote:
> On 7/23/2021 5:19 AM, Quadibloc wrote:
>> On Friday, July 23, 2021 at 4:54:17 AM UTC-6, Ivan Godard wrote:
>>> Some people care if the moon lander
>>> lands on the moon surface, or ten meters above or below it.
>>
>> That's definitely a good thing to care about.
>>
>> Does the fancy stuff in IEEE 754 really help with that?
>>
>> Or would doing everything in double precision, even if one
>> were using something like the old System/360 floating
>> format, do a better job?
>>
>> Accurate and reliable calculations by computers are a very
>> important thing. How much the approach taken by IEEE 754
>> contributes to that goal, let alone the more elaborate notions
>> presented by people like John Gustavson, is, however, an open
>> question, I would think.
>>
>> John Savard
>>
>
>
> Back when I was active on the IEEE committee, I once asked Kahan
> whether, if quad (128-bit) were as fast as double, would he still have
> denorms. He answered an unequivocal "No!".

The funny part is of course that since then, due to the universal
inclusion of FMAC, denorms no longer have any speed penalty, just a
single-digit percentage gate increase.

I.e. no reason to skip it even on quad where I'm guessing the huge FMAC
post-normalization network would be one of the largest single features.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Branimir Maksimovic

unread,

Jul 25, 2021, 4:37:21 AM7/25/21

to

Hm, seems that on ARMv8 simd can't be switched off, compiler produces indentical code
for both cases.. -mfpu option is also not present on aarch64...

--
bmaxa now listens Volim Te by Lollobrigida from Lollobrigida Inc.

Thomas Koenig

unread,

Jul 25, 2021, 5:11:25 AM7/25/21

to

Quadibloc <jsa...@ecn.ab.ca> schrieb:

> On Friday, July 23, 2021 at 2:13:02 PM UTC-6, MitchAlsup wrote:
>
>> IBM's was worse the simply radix 16, it was also a truncation system
>> with a guard digit (½ a byte) in calculations. To make maters worse, its
>> competitors were 36-bit, 48-bit, and 60-bit (single precision).
>
> And it originally went out the door without even the guard digit: the
> results of arithmetic done that way were so bad, they had to fix that
> in every computer they sold as well as in all future ones.
>
> I hadn't looked at things that way, though, as far as precision went.
>
> Since, unlike the STRETCH, the IBM 360 never went in for *bit*
> addressing, they could have designed it with a 36-bit word.

That was not in the cards.

Gene Amdahl wanted a 24 - bit machine, but got overruled by
management because they insisted on a power of two for the number
of bits, and because the 7-bit ASCII standard was already on
the horizon.

If he had had his way, the /360 would have gone the way of the
CDC 3000 series and other 24-bit systems a long time ago.

There are ample examples how to deal with a 36-bit words:
The IBM 701 and following and the PDP-10, for example.

Terje Mathisen

unread,

Jul 25, 2021, 7:05:20 AM7/25/21

to

MitchAlsup wrote:
> Having watched this from inside:
> a) HW designers know a lot more about this today than in 1980
> b) even systems that started out as IEEE-format gradually went
> closer and closer to full IEEE-compliant (GPUs) until there is no
> useful difference in the quality of the arithmetic.
> c) once 754-2009 came out the overhead to do denorms went to
> zero, and there is no reason to avoid full speed denorms in practice.
> (BGB's small FPGA prototyping environment aside.)

I agree.

> d) HW designers have learned how to perform all of the rounding
> modes at no overhead compared to RNE.

This is actually dead easy since all the other modes are easier than
RNE: As soon as you have all four bits required for RNE (i.e.
sign/ulp/guard/sticky) then the remaining rounding modes only need
various subsets of these, so you use the rounding mode to route one of 5
or 6 possible 16-entry one-bit lookup tables into the rounding circuit
where it becomes the input to be added into the ulp position of the
final packed (sign/exp/mantissa) fp result.

Since the hidden bit is already hidden at this point, andy rounding
overflow of the mantissa from 0xfff.. to 0x000.. will cause the exponent
term to be incremented, possibly all the way to Inf. In all cases, this
is the exactly correct behaviour.

Stefan Monnier

unread,

Jul 25, 2021, 9:19:51 AM7/25/21

to

> Gene Amdahl wanted a 24 - bit machine, but got overruled by
> management because they insisted on a power of two for the number
> of bits, and because the 7-bit ASCII standard was already on
> the horizon.

A good reminder that management's decisions can be quite sane, even when
it displeases the engineers,

Stefan

Quadibloc

unread,

Jul 25, 2021, 10:21:29 AM7/25/21

to

That all depends. If Gene Amdahl wanted to build something like,
say, an SDS 9300, yes, management was right, I will agree.

However, IBM also built the AN/FSQ-31 and AN/FSQ-32. These were
48-bit machines, and if the IBM 360 had looked something like them,
it could well have been just as successful.

John Savard

Anton Ertl

unread,

Jul 25, 2021, 11:35:09 AM7/25/21

to

Branimir Maksimovic <branimir....@gmail.com> writes:
>Hm, seems that on ARMv8 simd can't be switched off, compiler produces indentical code
>for both cases.. -mfpu option is also not present on aarch64...

It will be once you can choose between Neon and SVE (and maybe
Helium).

Even if the code for axpy contains a vectorized variant for stride=8,
this variant will not run, because stride=10 (because this test has
originally been written to test the speed difference between 80-bit
387 FP and 64-bit 387 FP). I very much doubt that they perform
autovectorization for stride=10.

Testing this with gcc-10.2 and clang-11.0 on AMD64 with -O3 -mavx2,
gcc produces a simple scalar loop, while clang produces an unrolled
scalar loop (no vectorized variants to be seen).

Michael S

unread,

Jul 25, 2021, 11:47:59 AM7/25/21

to

IIRC, clang supports -fno-vectorize

BGB

unread,

Jul 25, 2021, 12:14:08 PM7/25/21

to

On 7/25/2021 6:05 AM, Terje Mathisen wrote:
> MitchAlsup wrote:
>> Having watched this from inside:
>> a) HW designers know a lot more about this today than in 1980
>> b) even systems that started out as IEEE-format gradually went
>> closer and closer to full IEEE-compliant (GPUs) until there is no
>> useful difference in the quality of the arithmetic.
>> c) once 754-2009 came out the overhead to do denorms went to
>> zero, and there is no reason to avoid full speed denorms in practice.
>> (BGB's small FPGA prototyping environment aside.)
>
> I agree.
>
>> d) HW designers have learned how to perform all of the rounding
>> modes at no overhead compared to RNE.
>
> This is actually dead easy since all the other modes are easier than
> RNE: As soon as you have all four bits required for RNE (i.e.
> sign/ulp/guard/sticky) then the remaining rounding modes only need
> various subsets of these, so you use the rounding mode to route one of 5
> or 6 possible 16-entry one-bit lookup tables into the rounding circuit
> where it becomes the input to be added into the ulp position of the
> final packed (sign/exp/mantissa) fp result.
>

Oddly enough, the extra cost to rounding itself is not the main issue
with multiple rounding modes, but more the question of how the bits get
there (if one doesn't already have an FPU status register or similar).

Granted, could in theory put these bits in SR or similar, but, yeah...

It would be better IMO if it were part of the instruction, but there
isn't really any good / non-annoying way to encode this. Probably the
"least awful" would probably be to use an Op64 encoding, which then uses
some of the Immed extension bits to encode a rounding mode.

* FFw0_00ii_F0nm_5eo8 FADD Rm, Ro, Rn, Imm8
* FFw0_00ii_F0nm_5eo9 FSUB Rm, Ro, Rn, Imm8
* FFw0_00ii_F0nm_5eoA FMUL Rm, Ro, Rn, Imm8

Where the Imm8 field encodes the rounding mode, say:
00 = Round to Nearest.
01 = Truncate.

Or could go the SR route, but I don't want FPU behavior to depend on SR.

> Since the hidden bit is already hidden at this point, andy rounding
> overflow of the mantissa from 0xfff.. to 0x000.. will cause the exponent
> term to be incremented, possibly all the way to Inf. In all cases, this
> is the exactly correct behaviour.
>

Yep.

Main limiting factor though is that for bigger formats (Double or FP96),
propagating the carry that far can be an issue.

In the vast majority of cases, the carry gets absorbed within the low 8
or 16 bits or so (or if it doesn't, leave these bits as-is).

For narrowing conversions to Binary16 or Binary32, full width rounding
is both easier and more useful.

For FADD/FSUB, the vast majority of cases where a very long stream of
1's would have occured can be avoided by doing the math internally in
twos complement form.

Though, in this case, one can save a little cost by implementing the
"twos complement" as essentially ones' complement with a carry bit input
to the adder (one can't arrive at a case where both inputs are negative
with FADD).

Cases can occur though where the result mantissa comes up negative
though, which can itself require a sign inversion. The only alternative
is to compare mantissa input values by value if the exponents are equal,
which is also fairly expensive.

Though, potentially one could use the rounding step to "absorb" part of
the cost of the second sign inversion.

Another possibility here could be to have an adder which produces two
outputs, namely both ((A+B)+Cin) and (~(A+B)+(!Cin)), and then using the
second output if the first came up negative.

...

MitchAlsup

unread,

Jul 25, 2021, 1:23:00 PM7/25/21

to

And this is why they are put in control/status registers.

<
< Probably the
> "least awful" would probably be to use an Op64 encoding, which then uses
> some of the Immed extension bits to encode a rounding mode.
<

The argument against having them in instructions is that this prevents
someone from running the code several times with different rounding
modes set to detect any sensitivity to the actually chosen rounding mode.
Kahan said he uses this a lot.

>
>
> * FFw0_00ii_F0nm_5eo8 FADD Rm, Ro, Rn, Imm8
> * FFw0_00ii_F0nm_5eo9 FSUB Rm, Ro, Rn, Imm8
> * FFw0_00ii_F0nm_5eoA FMUL Rm, Ro, Rn, Imm8
>
> Where the Imm8 field encodes the rounding mode, say:
> 00 = Round to Nearest.
> 01 = Truncate.
>
> Or could go the SR route, but I don't want FPU behavior to depend on SR.
<

When one has multi-threading and control/status register, one simply
reads the RM field and delivers it to the FU as an operand. A couple
of interlock checks means you don't really have to stall the pipeline
because these modes don't change all that often.

<
> > Since the hidden bit is already hidden at this point, andy rounding
> > overflow of the mantissa from 0xfff.. to 0x000.. will cause the exponent
> > term to be incremented, possibly all the way to Inf. In all cases, this
> > is the exactly correct behaviour.
> >
> Yep.
>
> Main limiting factor though is that for bigger formats (Double or FP96),
> propagating the carry that far can be an issue.
<

Koogie-Stone adders !

>
> In the vast majority of cases, the carry gets absorbed within the low 8
> or 16 bits or so (or if it doesn't, leave these bits as-is).
>
> For narrowing conversions to Binary16 or Binary32, full width rounding
> is both easier and more useful.
>
>
>
> For FADD/FSUB, the vast majority of cases where a very long stream of
> 1's would have occured can be avoided by doing the math internally in
> twos complement form.
>
> Though, in this case, one can save a little cost by implementing the
> "twos complement" as essentially ones' complement with a carry bit input
> to the adder (one can't arrive at a case where both inputs are negative
> with FADD).
<

This is a standard trick that everyone should know--I first saw it in the
PDP-8 in the Complement and increment instruction--but it has come in
handy several times and is the way operands are negated and complemented
in My 66000. The operand is conditionally complemented with a carry in
conditionally asserted. IF the operand is being processed is integer there
is an adder that deals with the carry in. If the operand is logical, there is
no adder and the carry in is ignored.

Ivan Godard

unread,

Jul 25, 2021, 1:36:03 PM7/25/21

to

However pitting them in status regs mucks up any code that actually does
care about mode; interval arithmetic for example. Especially because
changing the mode commonly costs a pipe flush (yes, you can put the
status in the decoder and decorate the op in the pipe with it, but that
adds five bits to the op state). And then there's save/restore of the
mode across calls.

Status reg and ignoring the software is a good hardware solution. :-(

Branimir Maksimovic

unread,

Jul 25, 2021, 1:50:40 PM7/25/21

to

On 2021-07-25, Anton Ertl <an...@mips.complang.tuwien.ac.at> wrote:
> Branimir Maksimovic <branimir....@gmail.com> writes:
>>Hm, seems that on ARMv8 simd can't be switched off, compiler produces indentical code
>>for both cases.. -mfpu option is also not present on aarch64...
>
> It will be once you can choose between Neon and SVE (and maybe
> Helium).
>
> Even if the code for axpy contains a vectorized variant for stride=8,
> this variant will not run, because stride=10 (because this test has
> originally been written to test the speed difference between 80-bit
> 387 FP and 64-bit 387 FP). I very much doubt that they perform
> autovectorization for stride=10.
>
> Testing this with gcc-10.2 and clang-11.0 on AMD64 with -O3 -mavx2,
> gcc produces a simple scalar loop, while clang produces an unrolled
> scalar loop (no vectorized variants to be seen).
>
> - anton

You are right this is what gcc-11 produces @Branimirs-Air axpy % cat axpysimd.s
.arch armv8.4-a+crc
.text
.align 2
.globl _axpy
_axpy:
LFB0:
cbz x3, L1
mov x4, 0
L3:
ldr d2, [x0, x4]
fmul d2, d0, d2
ldr d1, [x1, x4]
fadd d1, d1, d2
str d1, [x1, x4]
add x4, x4, x2
subs x3, x3, #1
bne L3
L1:
ret
pure scalar code...

--
bmaxa now listens Sick Muse by Metric from Fantasies

Branimir Maksimovic

unread,

Jul 25, 2021, 1:57:13 PM7/25/21

to

thanks!

>> --
>> bmaxa now listens Volim Te by Lollobrigida from Lollobrigida Inc.

Thomas Koenig

unread,

Jul 25, 2021, 2:05:33 PM7/25/21

to

Branimir Maksimovic <branimir....@gmail.com> schrieb:

> You are right this is what gcc-11 produces @Branimirs-Air axpy % cat axpysimd.s

I don't find the source code, so...

> pure scalar code...

Did you use restrict on the pointers? If not, the compiler
has to assume all sorts of aliasing issues, which usually
preculde vectorization.
>
>
>

Branimir Maksimovic

unread,

Jul 25, 2021, 2:29:45 PM7/25/21

to

bmaxa@Branimirs-Air axpy % cat axpy.c

void axpy(Float ra, Float *f_x, Float *f_y, long stride, unsigned long ucount)

{
for (; ucount>0; ucount--) {
*f_y += ra * *f_x;
f_x = (Float *)(((char *)f_x)+stride);
f_y = (Float *)(((char *)f_y)+stride);
}
}
bmaxa@Branimirs-Air axpy % gcc-11 -c -O -march=armv8.4-a+simd -D"Float=double" -S axpy.c -o axpysimd.s
bmaxa@Branimirs-Air axpy %

>>
>>
>>

--
bmaxa now listens Knights of Cydonia by Muse from Black Holes and Revelations

Thomas Koenig

unread,

Jul 25, 2021, 3:42:16 PM7/25/21

to

Branimir Maksimovic <branimir....@gmail.com> schrieb:

> On 2021-07-25, Thomas Koenig <tko...@netcologne.de> wrote:
>> Branimir Maksimovic <branimir....@gmail.com> schrieb:
>>
>>> You are right this is what gcc-11 produces @Branimirs-Air axpy % cat axpysimd.s
>>
>> I don't find the source code, so...
>>
>>> pure scalar code...
>>
>> Did you use restrict on the pointers? If not, the compiler
>> has to assume all sorts of aliasing issues, which usually
>> preculde vectorization.
> bmaxa@Branimirs-Air axpy % cat axpy.c
> void axpy(Float ra, Float *f_x, Float *f_y, long stride, unsigned long ucount)
> {
> for (; ucount>0; ucount--) {
> *f_y += ra * *f_x;
> f_x = (Float *)(((char *)f_x)+stride);
> f_y = (Float *)(((char *)f_y)+stride);
> }
> }

Try

$ cat a.c
void axpy(Float ra, Float const * restrict f_x,
Float * restrict f_y,

long stride, unsigned long ucount)
{
for (; ucount>0; ucount--) {
*f_y += ra * *f_x;
f_x = (Float *)(((char *)f_x)+stride);
f_y = (Float *)(((char *)f_y)+stride);
}
}

$ gcc -DFloat=double -march=native -O3 -S a.c

and on my home system you get something like

.L3:
vmovsd (%rdi), %xmm1
addq %rdx, %rdi
vfmadd213sd (%rsi), %xmm0, %xmm1
vmovsd %xmm1, (%rsi)
addq %rdx, %rsi
decq %rcx
jne .L3

so it's at least a bit better.

Vectorization with strides which are unknown at compile time is
difficult, which is why you need either LTO or VVM :-)

Branimir Maksimovic

unread,

Jul 25, 2021, 4:01:46 PM7/25/21

to

same on arm:
_axpy:
LFB0:
mov x4, 0
cbz x3, L1
.p2align 3,,7

L3:
ldr d2, [x0, x4]

subs x3, x3, #1
ldr d1, [x1, x4]
fmadd d1, d2, d0, d1

str d1, [x1, x4]
add x4, x4, x2

bne L3

> Vectorization with strides which are unknown at compile time is
> difficult, which is why you need either LTO or VVM :-)

oh lto :P

--
bmaxa now listens The Skank Heads by Skunk Anansie from Post Orgasmic Chill

Quadibloc

unread,

Jul 25, 2021, 5:39:17 PM7/25/21

to

On Sunday, July 25, 2021 at 11:23:00 AM UTC-6, MitchAlsup wrote:

> Koogie-Stone adders !

Kogge-Stone adders, please!

John Savard

BGB

unread,

Jul 25, 2021, 6:40:53 PM7/25/21

to

Yeah, to be useful, it kinda needs to be per-instruction.
This means putting it in the encoding, as a register-based mode is a bit
too coarse grain to be particularly useful.

> Status reg and ignoring the software is a good hardware solution. :-(

Unless one adds logic to save/restore the FPU control state as part of
the ABI, then it effectively becomes global state.

Unless the register needs to be regularly reloaded for some other
reason, then one has the issue that, if some random piece of code in
some library somewhere decides to change the FPU rounding mode, then
everything else in the program may quietly start producing subtly
different results, which I personally feel is a *worse* scenario than
not having an option to change the rounding mode in the first place.

Having it either fixed in the implementation, or encoded as part of the
instruction, avoids this scenario in that the same instruction sequence
with the same inputs will always produce the same results.

Also don't want to add another register just to mandate that it be
saved/restored via the C ABI, as this would add a lot of extra cost for
a fairly obscure use-case.

In this case, it would almost make more sense to add the bits into
GBR(63:48), along with an intrinsic to modify them, and the compiler
would force the function to do a GBR save/restore if this intrinsic is used:
GBR(47: 0): Global Base Register (used to access .data/.bss);
GBR(63:48): Repurposed as FPSCR State.

Would behave as dynamic state within a given program or DLL, but revert
to defaults across DLL boundaries. Likewise, returning from the function
which updated the rounding mode would automatically revert it to
whatever value it held previously.

Say, FPSCR:
(3:0): Rounding Mode
0=Nearest
1=Truncate
2=+Inf
3=-Inf
4=(?) Nearest via Frac(2), Frac(1:0)=Status
...
(7:4): Sticky Bits (Inexact, Underflow, Overflow, Inv-Op)

This would be kind of an ugly hack though...

Mode 4 would allow using Binary64 to hold a 50-bit integer exactly; The
low 2 bits could serve as a result status (00=Exact,
01=Inexact/Underflow, 10/11=Reserved).

It also effectively moves the ULP over by 2 bits, rounding the number in
a way which is more appropriate for flonum operations. The inexact
status would be sticky, such that inexact inputs may not yield an exact
output. This distinction would be N/A for flonums.

Note that the high 16 bits of LR are already used for saving restoring
some SR state bits (WEX mode and predicate flags and similar).

Though, could be better to just add rounding modes via an Op64 encoding,
and skip the use of any register bits.

Personally I feel any sticky bits are also borderline useless unless one
can tell which value they apply to.

OTOH:
Overflow -> Inf
Invalid Operation -> NaN
Underflow -> Inexact + Zero (Mode 4)
Inexact -> Inexact (Mode 4)

Modes 0..3 would lose both Inexact and Underflow status, but these are
likely to be niche cases anyways.

...