My MRISC32 ISA

Marcus

unread,

Feb 1, 2021, 3:55:48 PM2/1/21

to

Hi all!

I'm new here on comp.arch, and if you don't mind I would like to share
some of my work and ideas with you.

For some time now I have been working on a hobby project. It started
out with the desire to learn more about CPU design, but it has evolved
into something bigger than I had originally anticipated.

It's now a rather stable user space 32-bit RISC ISA (I call it MRISC32),
a simulator (written in C++), a simple in-order single issue CPU design
in VHDL (I call it MRISC32-A1), a GNU toolchain (GCC + binutils +
newlib) and a simple SoC computer for FPGA:s with an MRISC32-A1 core
(I call it MC1).

The design is inspired by the early Cray architectures as well as
early RISC designs such as MIPS. I've tried to make is as simple as
possible yet powerful and scalable. One of the key features is the
tightly integrated support for vector operations and floating-point
(it's much more like Cray than most contemporary solutions, and it's
much simpler than ARM SVE or RISC-V V).

It bears some resemblance to RISC-V and ForwardCom (by Agner Fog), but
with some important differences. Actually, if I had known about RISC-V
when I started the project I would probably not have created my own ISA.
In retrospect I'm glad I didn't know about it.

My background is mainly in software development, and in the past I have
programmed for 8-, 16-, 32- and 64-bit architectures (e.g. 6502,
68000/30/60, ARMv7 and AArch64), so my approach has been to make an ISA
that is fun to program for, yet easy to implement in hardware. So far
I feel that I have succeeded, since programming for the MRISC32 ISA so
far has been a quite enjoyable experience.

The VHDL implementation has mainly served as a way for me to better
understand and identify good and bad ISA design decisions. Also, I see
this 32-bit ISA as a prototype for a more mature 64-bit ISA. I selected
32 bits since I wanted to make things easy (it's my first CPU/ISA
design) and I wanted to be able to a fit the design into a low end
FPGA.

I added some links below with more info. Feedback and comments are
welcome!

Regards,

Marcus

Some resources:

- https://mrisc32.bitsnbites.eu (a "home page" if you will)
- https://github.com/mrisc32 (GitHub organization with related repos)

The ISA manual (incomplete):

- https://mrisc32.bitsnbites.eu/doc/mrisc32-instruction-set-manual.pdf

Some blog posts:

- https://www.bitsnbites.eu/category/hardware-development/mrisc32/

I keep notes for the 64-bit ISA (MRISC64) here:

- https://github.com/mbitsnbites/mrisc64

A live demo (video capture) of the CPU in action in an FPGA:

- https://vimeo.com/494653227

DOOM running in the simulator (slower than the FPGA CPU, but with
enough RAM to make it possible):

- https://vimeo.com/472547489

MitchAlsup

unread,

Feb 1, 2021, 6:13:13 PM2/1/21

to

On Monday, February 1, 2021 at 2:55:48 PM UTC-6, Marcus wrote:
> Hi all!
>
> I'm new here on comp.arch, and if you don't mind I would like to share
> some of my work and ideas with you.
>
> For some time now I have been working on a hobby project. It started
> out with the desire to learn more about CPU design, but it has evolved
> into something bigger than I had originally anticipated.
>
> It's now a rather stable user space 32-bit RISC ISA (I call it MRISC32),
> a simulator (written in C++), a simple in-order single issue CPU design
> in VHDL (I call it MRISC32-A1), a GNU toolchain (GCC + binutils +
> newlib) and a simple SoC computer for FPGA:s with an MRISC32-A1 core
> (I call it MC1).
>
> The design is inspired by the early Cray architectures as well as
> early RISC designs such as MIPS. I've tried to make is as simple as
> possible yet powerful and scalable. One of the key features is the
> tightly integrated support for vector operations and floating-point
> (it's much more like Cray than most contemporary solutions, and it's
> much simpler than ARM SVE or RISC-V V).

Congradulations in getting vectors included.

Did you end up with a vector regiseter file ?

I got a reg-reg vector ISA put together without a vector register file--
which allow me to vectorize things like C-string-library, C-mem-library,
most everyday loops can be vectorized, essentially all smaller FP
loops can be vectorized,..... and the compiler almost does not need
to know about vectorizations (no alias analysis, no loop caried
dependency analysis,...).

Be happy to send you a copy of my ISA for your reading pleasure.

>
> It bears some resemblance to RISC-V and ForwardCom (by Agner Fog), but
> with some important differences. Actually, if I had known about RISC-V
> when I started the project I would probably not have created my own ISA.
> In retrospect I'm glad I didn't know about it.
>
> My background is mainly in software development, and in the past I have
> programmed for 8-, 16-, 32- and 64-bit architectures (e.g. 6502,
> 68000/30/60, ARMv7 and AArch64), so my approach has been to make an ISA
> that is fun to program for, yet easy to implement in hardware. So far
> I feel that I have succeeded, since programming for the MRISC32 ISA so
> far has been a quite enjoyable experience.
>
> The VHDL implementation has mainly served as a way for me to better
> understand and identify good and bad ISA design decisions. Also, I see
> this 32-bit ISA as a prototype for a more mature 64-bit ISA. I selected
> 32 bits since I wanted to make things easy (it's my first CPU/ISA
> design) and I wanted to be able to a fit the design into a low end
> FPGA.

Yes, indeed, understanding the HW design at those levels will make you a
better architect in the future--being exposed to more of the disease is
actually a blessing.

>
> I added some links below with more info. Feedback and comments are
> welcome!
>
> Regards,
>
> Marcus
>

Thanks...........

Marcus

unread,

Feb 2, 2021, 3:09:36 AM2/2/21

to

On 2021-02-02 00:13, MitchAlsup wrote:
> On Monday, February 1, 2021 at 2:55:48 PM UTC-6, Marcus wrote:
>> Hi all!
>>
>> I'm new here on comp.arch, and if you don't mind I would like to share
>> some of my work and ideas with you.
>>
>> For some time now I have been working on a hobby project. It started
>> out with the desire to learn more about CPU design, but it has evolved
>> into something bigger than I had originally anticipated.
>>
>> It's now a rather stable user space 32-bit RISC ISA (I call it MRISC32),
>> a simulator (written in C++), a simple in-order single issue CPU design
>> in VHDL (I call it MRISC32-A1), a GNU toolchain (GCC + binutils +
>> newlib) and a simple SoC computer for FPGA:s with an MRISC32-A1 core
>> (I call it MC1).
>>
>> The design is inspired by the early Cray architectures as well as
>> early RISC designs such as MIPS. I've tried to make is as simple as
>> possible yet powerful and scalable. One of the key features is the
>> tightly integrated support for vector operations and floating-point
>> (it's much more like Cray than most contemporary solutions, and it's
>> much simpler than ARM SVE or RISC-V V).
>
> Congradulations in getting vectors included.

Thanks!

> Did you end up with a vector regiseter file ?

Yes I did. For the vector model that I'm used to reason about it makes
a lot of sense to keep (relatively) big chunks of hot data in register
state, and using a vector register file feels like the obvious choice
for that.

I did follow the Cray model, though, and allow mixing of scalar and
vector operands, and technically it's possible to use the same EU:s
for scalar and for vector operations (the same instructions are used
for both scalar and vector operands).

> I got a reg-reg vector ISA put together without a vector register file--
> which allow me to vectorize things like C-string-library, C-mem-library,
> most everyday loops can be vectorized, essentially all smaller FP
> loops can be vectorized,..... and the compiler almost does not need
> to know about vectorizations (no alias analysis, no loop caried
> dependency analysis,...).

So I've heard, and it sounds very interesting indeed. I'm curious about
how that scales to wide operations. E.g. is it feasible to implement
an 8-wide FMA and have it utilized efficiently? Or is it more like you
can have any number of FMA:s (e.g. six) and dynamically schedule
individual vector elements to those EU:s?

> Be happy to send you a copy of my ISA for your reading pleasure.

Please do - it would be very interesting.

>> It bears some resemblance to RISC-V and ForwardCom (by Agner Fog), but
>> with some important differences. Actually, if I had known about RISC-V
>> when I started the project I would probably not have created my own ISA.
>> In retrospect I'm glad I didn't know about it.
>>
>> My background is mainly in software development, and in the past I have
>> programmed for 8-, 16-, 32- and 64-bit architectures (e.g. 6502,
>> 68000/30/60, ARMv7 and AArch64), so my approach has been to make an ISA
>> that is fun to program for, yet easy to implement in hardware. So far
>> I feel that I have succeeded, since programming for the MRISC32 ISA so
>> far has been a quite enjoyable experience.
>>
>> The VHDL implementation has mainly served as a way for me to better
>> understand and identify good and bad ISA design decisions. Also, I see
>> this 32-bit ISA as a prototype for a more mature 64-bit ISA. I selected
>> 32 bits since I wanted to make things easy (it's my first CPU/ISA
>> design) and I wanted to be able to a fit the design into a low end
>> FPGA.
>
> Yes, indeed, understanding the HW design at those levels will make you a
> better architect in the future--being exposed to more of the disease is
> actually a blessing.

What I need to learn now is more advanced OoO design. I have a vague
idea based on old university courses, but I guess that it's more like
the tip of the iceberg.

Many design decisions so far have proven quite successful in an
in-order design, but I fear that some decisions may be bad for an OoO
design.

One thing that people seem do disagree about is the choice to keep
integers and floats in the same register file. From a software model
and programmer's perspective it's much more convenient to use a
single register file, which is why I like it. Some claim that it may
pose a problem for wide OoO designs, mainly due to register file
read/write port pressure. I assume that you have some experience
with the pros & cons of this solution?

BTW, I assume that you don't actually need N register read ports to
be able to issue M instructions with N/M operands (on average), as
a sizeable fraction of the operands will come from forwarding or
a result bus for instance.

/Marcus

Thomas Koenig

unread,

Feb 2, 2021, 8:47:18 AM2/2/21

to

Marcus <m.de...@this.bitsnbites.eu> schrieb:

> On 2021-02-02 00:13, MitchAlsup wrote:

>> Congradulations in getting vectors included.

> Thanks!

What is the motivation for Cray-style (or Fujitsu-style) vectors
these days?

At the time, the big advantage was that you could do an arithmetic
instruction per cycle and pipe once the latency was over.
Superscalar execution can do pretty much the same thing these days.

What was the workload you had in mind for your design?

Marcus

unread,

Feb 2, 2021, 9:41:07 AM2/2/21

to

There are three important principles of the design:

1. The ISA is vector register size agnostic.
2. Vector operations can either be executed in parallel or in series.
3. The same instructions work with both scalar and vector registers.

(The last point does not have much to do with your question).

The first point aims to solve the problem that x86 is in right now,
namely gradually evolving and incompatible SIMD ISA:s with wider and
wider registers. Pretty much every modern ISA tries to avoid that
situation (e.g. see ARM SVE, RISC-V V and ForwardCom).

The second point is kind of a natural byproduct of the first point,
but it also makes it easier to scale the ISA to different hardware
implementations if it does not make any assumptions about how the
vector elements are processed.

For instance, in a low-end implementation it can make sense to have
in-order execution of vector operations, and hide result latencies
by splitting a vector operation into a few serial operations.
Likewise a high-end implementation can use very wide EUs. Both
implementations would still be able to execute the same software.

/Marcus

MitchAlsup

unread,

Feb 22, 2021, 2:46:51 PM2/22/21

to

On Tuesday, February 2, 2021 at 8:41:07 AM UTC-6, Marcus wrote:
> On 2021-02-02 14:47, Thomas Koenig wrote:
> > Marcus <m.de...@this.bitsnbites.eu> schrieb:
> >> On 2021-02-02 00:13, MitchAlsup wrote:
> >
> >>> Congradulations in getting vectors included.
> >
> >> Thanks!
> >
> > What is the motivation for Cray-style (or Fujitsu-style) vectors
> > these days?
> >
> > At the time, the big advantage was that you could do an arithmetic
> > instruction per cycle and pipe once the latency was over.
> > Superscalar execution can do pretty much the same thing these days.
> >
> > What was the workload you had in mind for your design?
> There are three important principles of the design:
>
> 1. The ISA is vector register size agnostic.

My 66000 vectorizes loops not calculations

> 2. Vector operations can either be executed in parallel or in series.

dito

> 3. The same instructions work with both scalar and vector registers.

dito
But My 66000 contains no visible extra state to support vectorizing of loops.

>
> (The last point does not have much to do with your question).
>
> The first point aims to solve the problem that x86 is in right now,
> namely gradually evolving and incompatible SIMD ISA:s with wider and
> wider registers. Pretty much every modern ISA tries to avoid that
> situation (e.g. see ARM SVE, RISC-V V and ForwardCom).

This is the curse of SIMD--also addressed (without adding state) in
My 66000 through the Virtual Vector Method. Each implementation
can decide for itself how wide to make the various paths.

>
> The second point is kind of a natural byproduct of the first point,
> but it also makes it easier to scale the ISA to different hardware
> implementations if it does not make any assumptions about how the
> vector elements are processed.

Exactly why no VRF is desirable.

Marcus

unread,

Feb 24, 2021, 4:06:06 PM2/24/21

to

On 2021-02-22 20:46, MitchAlsup wrote:
> On Tuesday, February 2, 2021 at 8:41:07 AM UTC-6, Marcus wrote:
>> On 2021-02-02 14:47, Thomas Koenig wrote:
>>> Marcus <m.de...@this.bitsnbites.eu> schrieb:
>>>> On 2021-02-02 00:13, MitchAlsup wrote:
>>>
>>>>> Congradulations in getting vectors included.
>>>
>>>> Thanks!
>>>
>>> What is the motivation for Cray-style (or Fujitsu-style) vectors
>>> these days?
>>>
>>> At the time, the big advantage was that you could do an arithmetic
>>> instruction per cycle and pipe once the latency was over.
>>> Superscalar execution can do pretty much the same thing these days.
>>>
>>> What was the workload you had in mind for your design?
>> There are three important principles of the design:
>>
>> 1. The ISA is vector register size agnostic.
>
> My 66000 vectorizes loops not calculations
>> 2. Vector operations can either be executed in parallel or in series.
> dito
>> 3. The same instructions work with both scalar and vector registers.
> dito
> But My 66000 contains no visible extra state to support vectorizing of loops.

That is a very attractive property of your ISA. I think that in
many ways it is probably the right thing (TM), especially as it
unleashes a lot of vectorization potential in regular code without
the need to target a special SIMD ISA (that often has different
behavior than the scalar ISA).

I do not have enough experience with CPU design to fully envision
the potential pitfalls etc of such a design, so I would not know how
to approach it - hence my explicit vector register file.

>> (The last point does not have much to do with your question).
>>
>> The first point aims to solve the problem that x86 is in right now,
>> namely gradually evolving and incompatible SIMD ISA:s with wider and
>> wider registers. Pretty much every modern ISA tries to avoid that
>> situation (e.g. see ARM SVE, RISC-V V and ForwardCom).
>
> This is the curse of SIMD--also addressed (without adding state) in
> My 66000 through the Virtual Vector Method. Each implementation
> can decide for itself how wide to make the various paths.

Exactly. I think that this is very important.

>>
>> The second point is kind of a natural byproduct of the first point,
>> but it also makes it easier to scale the ISA to different hardware
>> implementations if it does not make any assumptions about how the
>> vector elements are processed.
>
> Exactly why no VRF is desirable.

In my ISA the vector length (i.e. the number of vector elements to
process by each instruction) is controlled by the software. There is
an upper limit to the vector length that is dictated by the hardware
implementation (i.e. the size of each vector register), but this can
(and usually must) be queried by the software. The extra overhead for
this is usually one instruction before the loop (query the max vector
length) and one instruction inside the loop (a min instruction to
decide the vector length for each iteration).

Thus I think that the extra visible state that the VRF introduces does
not limit hardware implementations to freely select the actual vector
width.

Regarding VVM, there are a couple of things that I'm curious about.

First, it seems to me that it is designed to vectorize loops where
the loop count is known before entering the loop (which, I'm sure,
is the most important use case for vectorization). Is there any
provision for terminating a loop based on other conditions? (E.g.
consider a Mandelbrot iteration loop where there are typically two
different termination criteria: a loop count and a distance
threshold), or is the idea that such loops are handled by regular
super-scalar OoO scheduling?

Second, is there a way to do "horizontal" operations (e.g. common
when calculating the sum or min/max of an array or for logical
operations)? In traditional SIMD ISA:s this is typically implemented
as successive pairwise operations (as in working your way through a
binary tree), so that you get a dependency chain of log2(N)
operations instead of N operations (and usually better FP rounding
behavior).

For integer operations there are usually no observable differences
between sequential and pairwise operations, but for floating-point
there can be (and usually is) a real difference - which means that
you can't automatically do such transformations without having a hint
from the programmer (so that the CPU will produce the expected
result).

In my ISA I have implemented this with successive folding (combine
the upper and lower halves of two vector operands - usually the same
vector register). Since you have to explicitly specify that you are
doing a folding operation, it is clearly defined what the result will
be.

/Marcus

MitchAlsup

unread,

Feb 24, 2021, 5:41:28 PM2/24/21

to

Thank you !

One specific aspect of VVM is that if the code has a memory alias
the hardware will process the loop correctly if the alias is real
or if the alias is fake (fake runniing faster than real). So the
same source code works both ways, the HW adjusting to the
performance it can achieve obeying all the rules.

This means the compiler does not have to perform alias analysis to
the nth degree; and codes like::

for( unsigned i = k; i < MAX; i++ )
y[i] = y[i-5]*a + y[i-7]*b + y[i-j];

vectorize, but run based on the actual memory address aliasing.

>
> I do not have enough experience with CPU design to fully envision
> the potential pitfalls etc of such a design, so I would not know how
> to approach it - hence my explicit vector register file

> >> (The last point does not have much to do with your question).
> >>
> >> The first point aims to solve the problem that x86 is in right now,
> >> namely gradually evolving and incompatible SIMD ISA:s with wider and
> >> wider registers. Pretty much every modern ISA tries to avoid that
> >> situation (e.g. see ARM SVE, RISC-V V and ForwardCom).
> >
> > This is the curse of SIMD--also addressed (without adding state) in
> > My 66000 through the Virtual Vector Method. Each implementation
> > can decide for itself how wide to make the various paths.
> Exactly. I think that this is very important.
> >>
> >> The second point is kind of a natural byproduct of the first point,
> >> but it also makes it easier to scale the ISA to different hardware
> >> implementations if it does not make any assumptions about how the
> >> vector elements are processed.
> >
> > Exactly why no VRF is desirable.

> In my ISA the vector length (i.e. the number of vector elements to
> process by each instruction) is controlled by the software. There is
> an upper limit to the vector length that is dictated by the hardware
> implementation (i.e. the size of each vector register), but this can
> (and usually must) be queried by the software. The extra overhead for
> this is usually one instruction before the loop (query the max vector
> length) and one instruction inside the loop (a min instruction to
> decide the vector length for each iteration).

Yes this is the path CRAY and NEC went down (I believe RISC-V is going
down this route also) and this was a big advance in 1975 ! where 1 FP
multiplier, 1 memory port, 1 FP adder,... could be assembled into a several
calculations per cycle for long duration's in time.

>
> Thus I think that the extra visible state that the VRF introduces does
> not limit hardware implementations to freely select the actual vector
> width.

No, but the VRF is as big as the FP function units, and has significant
cost at context switch time. CRAY 1 (et. al.) had 4KBytes of VRF.

>
> Regarding VVM, there are a couple of things that I'm curious about.
>
> First, it seems to me that it is designed to vectorize loops where
> the loop count is known before entering the loop (which, I'm sure,
> is the most important use case for vectorization).

This is not true. VVM vectorizes Loops rather than vectorizing
instructions. This gives VVM the opportunity to see the memory
aliasing and adjust its throughput accordingly. Almost all of
the inner loops of Livermore loops get vectorized in Brian's compiler.

> Is there any
> provision for terminating a loop based on other conditions? (E.g.
> consider a Mandelbrot iteration loop where there are typically two
> different termination criteria: a loop count and a distance
> threshold), or is the idea that such loops are handled by regular
> super-scalar OoO scheduling?

Almost all the entire C str* and mem* library functions are vectorized!

There three kinds of LOOP behavior, straight forward counted loops,
counted loops with termination clause, incremented loops with
comparison termination.

>
> Second, is there a way to do "horizontal" operations (e.g. common
> when calculating the sum or min/max of an array or for logical
> operations)? In traditional SIMD ISA:s this is typically implemented
> as successive pairwise operations (as in working your way through a
> binary tree), so that you get a dependency chain of log2(N)
> operations instead of N operations (and usually better FP rounding
> behavior).

Yes, and it is all choice of the implementation. Small lowest power
machines may simply iterate via the FETCH unit. Slightly bigger
machines may iterate via the instructions scheduler. GBOoO
machines can bring as much memory resources and calculation
resources as they can figure out.

For example, the memmove subroutine, when vectorized, can process
a cache-port access width per cycle on small or large machines--equi-
valent to ~32-64 instructions per cycle. I suspect the small machines
will do 1/2 a cache line per cycle, while larger machines may do 2-3 cache
lines per cycle.

Given an implementation that can afford 4 FMAC units and a Loop
with 1 FMAC in it, that loop can be performed at 4 iterations per cycle.

Given a strcmp subroutine, you can do string comparison (byte by byte)
cache-access width per cycle, or 4-to-16 iterations per cycle.

And the code doing 16 iterations per cycle is the exact same code as
that doing 1 on the lowest implementation. You do not have to special-
ize the code to the machine you are on. You specify what needs to happen
and the HW chooses how to pull it all off.

>
> For integer operations there are usually no observable differences
> between sequential and pairwise operations, but for floating-point
> there can be (and usually is) a real difference - which means that
> you can't automatically do such transformations without having a hint
> from the programmer (so that the CPU will produce the expected
> result).

VVM gets rid of the need to hint:: basically if the compiler can find
a loop and if the loop has "reasonable" iteration and termination
properties, the loop can be vectorized! Even when vectorized, all
memory and data ordering are obeyed !

>
> In my ISA I have implemented this with successive folding (combine
> the upper and lower halves of two vector operands - usually the same
> vector register). Since you have to explicitly specify that you are
> doing a folding operation, it is clearly defined what the result will
> be.

I wish you luck.....
>
> /Marcus

Stephen Fuld

unread,

Feb 25, 2021, 12:10:31 AM2/25/21

to

On 2/24/2021 1:06 PM, Marcus wrote:

snip

> Second, is there a way to do "horizontal" operations (e.g. common
> when calculating the sum or min/max of an array or for logical
> operations)? In traditional SIMD ISA:s this is typically implemented
> as successive pairwise operations (as in working your way through a
> binary tree), so that you get a dependency chain of log2(N)
> operations instead of N operations (and usually better FP rounding
> behavior).

This seems related to the earlier discussion between Mitch and me, and
others regarding reductions. While there are huge advantages to VVM,
one consequence of it is that, since intermediate values are kept in the
functional units, and not written to the register until loop termination
or interrupt, it is difficult to take advantage of superscaler
capabilities for algorithms like these, due to the inability for the
hardware to have multiple places to write "temporary" intermediate
results. While you could code say an array max to do two compares
simultaneously (or more depending upon the number of FUs you have), and
add code to finish up the partial values after the loop, it can't be
done transparently by the hardware.

Mitch, please feel free to correct me.

> For integer operations there are usually no observable differences
> between sequential and pairwise operations,

I think it works fine if the integers are unsigned, and for signed only
if overflow/underflow isn't a possibility.

> but for floating-point
> there can be (and usually is) a real difference - which means that
> you can't automatically do such transformations without having a hint
> from the programmer (so that the CPU will produce the expected
> result).

Logical operations, such as Max will work for FP, but you are right
about arithmetic operations.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Terje Mathisen

unread,

Feb 25, 2021, 1:56:37 AM2/25/21

to

Stephen Fuld wrote:
> On 2/24/2021 1:06 PM, Marcus wrote:
>
> snip
>
>
>> Second, is there a way to do "horizontal" operations (e.g. common
>> when calculating the sum or min/max of an array or for logical
>> operations)? In traditional SIMD ISA:s this is typically implemented
>> as successive pairwise operations (as in working your way through a
>> binary tree), so that you get a dependency chain of log2(N)
>> operations instead of N operations (and usually better FP rounding
>> behavior).
>
>
> This seems related to the earlier discussion between Mitch and me, and
> others regarding reductions. While there are huge advantages to VVM,
> one consequence of it is that, since intermediate values are kept in the
> functional units, and not written to the register until loop termination
> or interrupt, it is difficult to take advantage of superscaler
> capabilities for algorithms like these, due to the inability for the
> hardware to have multiple places to write "temporary" intermediate
> results. While you could code say an array max to do two compares
> simultaneously (or more depending upon the number of FUs you have), and
> add code to finish up the partial values after the loop, it can't be
> done transparently by the hardware.
>
> Mitch, please feel free to correct me.

I am pretty sure you are correct, which means that you will need to
assign N accumulators, with N high enough to saturate the largest GBOO
model you plan to target.

This is simply due to the fact that an N-way reduction, with final merge
of the N accumulators will behave differently from straight forward FADD
(or FMAC) operations, so the HW cannot do this transparently from a
scalar loop.

The only exception to this rule is of course if you have
superaccumulators, in which case the reduction order doesn't matter
anymore, but at this point you won't have more than one (or two?) of
them, so it doesn't matter.

OTOH, a superaccumulator using carry-save storage can easily absorb as
many additions/cycle as you have the resources (basically alignment
networks) to throw at it.

BTW, the most obvious way to implement such an accumulator in SW is to
have two of them, one for positive and one for negative inputs, both
with several extra bits for extended range, and a trap when either of
them is about to overflow: At this point you could set the smaller (in
magnitude) to zero and subtract its value from the larger.

>
>
>> For integer operations there are usually no observable differences
>> between sequential and pairwise operations,
>
> I think it works fine if the integers are unsigned, and for signed only
> if overflow/underflow isn't a possibility.

Right.

>
>
>> but for floating-point
>> there can be (and usually is) a real difference - which means that
>> you can't automatically do such transformations without having a hint
>> from the programmer (so that the CPU will produce the expected
>> result).
>
> Logical operations, such as Max will work for FP, but you are right
> about arithmetic operations.

See above, having a multiple FADDs/cycle unit helps out here. It is also
possible that Mitch is capable of feeding multiple inputs/cycle into a
largish adder as long as each of the chunks are already aligned, with hw
delays (down to one/cycle) only when ranges overlap badly, but doing
full FADD semantics, including final rounding of each result seems like
a hard problem.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

MitchAlsup

unread,

Feb 25, 2021, 12:43:02 PM2/25/21

to

On Wednesday, February 24, 2021 at 11:10:31 PM UTC-6, Stephen Fuld wrote:
> On 2/24/2021 1:06 PM, Marcus wrote:
>
> snip
> > Second, is there a way to do "horizontal" operations (e.g. common
> > when calculating the sum or min/max of an array or for logical
> > operations)? In traditional SIMD ISA:s this is typically implemented
> > as successive pairwise operations (as in working your way through a
> > binary tree), so that you get a dependency chain of log2(N)
> > operations instead of N operations (and usually better FP rounding
> > behavior).
> This seems related to the earlier discussion between Mitch and me, and
> others regarding reductions. While there are huge advantages to VVM,
> one consequence of it is that, since intermediate values are kept in the
> functional units, and not written to the register until loop termination
> or interrupt, it is difficult to take advantage of superscaler
> capabilities for algorithms like these, due to the inability for the
> hardware to have multiple places to write "temporary" intermediate
> results.

I got rid of this due to bad-semantics. In its place is a single instruction
(with the included CARRY) which performs Kahan-Babuška summation.

> While you could code say an array max to do two compares
> simultaneously (or more depending upon the number of FUs you have), and
> add code to finish up the partial values after the loop, it can't be
> done transparently by the hardware.
>
> Mitch, please feel free to correct me.

I did

Stephen Fuld

unread,

Feb 25, 2021, 1:21:59 PM2/25/21

to

On 2/25/2021 9:43 AM, MitchAlsup wrote:
> On Wednesday, February 24, 2021 at 11:10:31 PM UTC-6, Stephen Fuld wrote:
>> On 2/24/2021 1:06 PM, Marcus wrote:
>>
>> snip
>>> Second, is there a way to do "horizontal" operations (e.g. common
>>> when calculating the sum or min/max of an array or for logical
>>> operations)? In traditional SIMD ISA:s this is typically implemented
>>> as successive pairwise operations (as in working your way through a
>>> binary tree), so that you get a dependency chain of log2(N)
>>> operations instead of N operations (and usually better FP rounding
>>> behavior).
>> This seems related to the earlier discussion between Mitch and me, and
>> others regarding reductions. While there are huge advantages to VVM,
>> one consequence of it is that, since intermediate values are kept in the
>> functional units, and not written to the register until loop termination
>> or interrupt, it is difficult to take advantage of superscaler
>> capabilities for algorithms like these, due to the inability for the
>> hardware to have multiple places to write "temporary" intermediate
>> results.
>
>
> I got rid of this due to bad-semantics. In its place is a single instruction
> (with the included CARRY) which performs Kahan-Babuška summation.

I think you may have missed the idea of the Marcus's question. It was
about taking advantage of superscaler capabilities, *where appropriate*.
I think all three of us agree that for FP add, it is inappropriate,
you shouldn't do it, and that your solution is excellent. But for
logical operations, max/min, and sometimes integer addition, it would be
nice to have. But it seems expensive in the VVM context and you deemed
it not worth the effort. I am not disagreeing with that choice, just
saying that it should be clear.

MitchAlsup

unread,

Feb 25, 2021, 3:13:47 PM2/25/21

to

At one point I posited that I could (COULD) make an FP ADDER hold onto
an extended accumulator allowing a fairly large number of FP additions
to accumulate into a sum with a single rounding. {This failed for more
than a few reasons--so It got removed and something different got put
in that addresses the root of the problem better.}

I was addressing your statement of thinking this semantic still lives on.
It does not.

I was not addressing Marcus' question at all.

As far as reductions are concerned--these are "popular" calculations,
but rarely supported. CRAY had a rather crazy (but admittedly simple)
way to perform--if YOUR application can deal with the resulting
semantic. It seems apparent that 754-2019 could not as they now
require something akin to Kahan-Babuška summation (by hook or by
crook) for these reductions.

I just hope Marcus understands why he should not do CRAY-like
reductions.

Marcus

unread,

Feb 25, 2021, 3:38:54 PM2/25/21

to

Exactly. Regardless of which semantics you use, it really boils down
to that - i.e. you need to explicitly describe how to partition the
operation into several accumulators. For a SIMD ISA like SSE or NEON
this comes natural, as the width is fixed (e.g. four single-precision
lanes mean four accumulators).

The same is true for the MRISC32 ISA - even though the vector registers
can have any POT size, you typically have to decide how many folding
operations you want to do, and thus how many "lanes" (i.e. accumulators)
to use during the loop (you may not necessarily want to use the full
vector register length as that would be sub-optimal when the array
is not much longer than the vector length).

It occurred to me that the same can probably be described quite well
for VVA. E.g. what would the VVA solution for the following C code
look like (here I assume that n is a multiple of 4)?

float sum(const float *x, unsigned n)
{
float s1 = 0.0f;
float s2 = 0.0f;
float s3 = 0.0f;
float s4 = 0.0f;
for (unsigned i = 0; i < n / 4; ++i)
{
s1 += x[4*i];
s2 += x[4*i+1];
s3 += x[4*i+2];
s4 += x[4*i+3];
}
return (s1 + s2) + (s3 + s4);
}

For reference, the corresponding hand written assembly for MRISC32
would go something like this (provided that some of my ideas are
implemented):

; S1 = x, S2 = n
sum:
ldi vl, #4 ; Vector length = 4
or v1, vz, #0 ; Clear v1
1:
ldw v2, s1, #4 ; Load (stride = 4 bytes)
fadd v1, v1, v2 ; Accumulate
add s1, s1, #16
add s2, s2, #-4
bnz s2, 1b

fadd/f v1, v1, v1 ; First fold: 2 additions
fadd/f v1, v1, v1 ; Second fold: 1 addition
vmov s1, v1, #0 ; Return v1[0]
ret

/Marcus

Marcus

unread,

Feb 25, 2021, 4:01:26 PM2/25/21

to

On 2021-02-25 21:13, MitchAlsup wrote:

<snip>

> As far as reductions are concerned--these are "popular" calculations,
> but rarely supported. CRAY had a rather crazy (but admittedly simple)
> way to perform--if YOUR application can deal with the resulting
> semantic. It seems apparent that 754-2019 could not as they now
> require something akin to Kahan-Babuška summation (by hook or by
> crook) for these reductions.
>
> I just hope Marcus understands why he should not do CRAY-like
> reductions.

I would not bet on it ;-) I'm still learning most of this stuff, and
I haven't actually come in contact with a Cray (I was born the same
year that the Cray 1 made its debut). I have not yet figured out how
Cray did reductions, but if it's the method described under "Recursive
characteristic of vector functional units" (3-14) in the CRAY-1
hardware reference manual, it does sound very funky.

My current idea is to store an internal "length" state with each vector
register (so that the length of a vector operation is controlled by
the source vector operands). Reductions would be performed by means of
"folding", where the length is cut in half, and the upper and lower
parts of the first and second vector operand, respectively, are combined
into the destination vector operand, which is thus given a length
property that is half that of the source operand. Successive folding
can thus be used for reducing a vector into a single scalar element in
log2(N) passes.

The main advantage of this method, as I see it, is that it is feasible
to implement it either as a serial operation or a parallel operation (or
a combination), without changing the semantics.

/Marcus

Stephen Fuld

unread,

Feb 25, 2021, 4:26:25 PM2/25/21

to

On 2/25/2021 1:01 PM, Marcus wrote:
> On 2021-02-25 21:13, MitchAlsup wrote:
>
> <snip>
>
>> As far as reductions are concerned--these are "popular" calculations,
>> but rarely supported. CRAY had a rather crazy (but admittedly simple)
>> way to perform--if YOUR application can deal with the resulting
>> semantic. It seems apparent that 754-2019 could not as they now
>> require something akin to Kahan-Babuška summation (by hook or by
>> crook) for these reductions.
>>
>> I just hope Marcus understands why he should not do CRAY-like
>> reductions.
>
> I would not bet on it ;-) I'm still learning most of this stuff,

Me too!

The issue, as was pointed out to me by someone in this group (Anton?) is
that floating point addition is not associative. That is

(A+B)+C is not necessarily equal to A+(B+C).

So by changing the order of the additions, no mater whether it is by
recursive folding or even just multiple accumulators, you have
potentially changed the result of the calculation, which, of course, is
a no-no.

MitchAlsup

unread,

Feb 25, 2021, 5:15:46 PM2/25/21

to

On Thursday, February 25, 2021 at 3:01:26 PM UTC-6, Marcus wrote:
> On 2021-02-25 21:13, MitchAlsup wrote:
>
> <snip>
> > As far as reductions are concerned--these are "popular" calculations,
> > but rarely supported. CRAY had a rather crazy (but admittedly simple)
> > way to perform--if YOUR application can deal with the resulting
> > semantic. It seems apparent that 754-2019 could not as they now
> > require something akin to Kahan-Babuška summation (by hook or by
> > crook) for these reductions.
> >
> > I just hope Marcus understands why he should not do CRAY-like
> > reductions.
> I would not bet on it ;-) I'm still learning most of this stuff, and
> I haven't actually come in contact with a Cray (I was born the same
> year that the Cray 1 made its debut). I have not yet figured out how
> Cray did reductions, but if it's the method described under "Recursive
> characteristic of vector functional units" (3-14) in the CRAY-1
> hardware reference manual, it does sound very funky.

Since apparently nobody knows how CRAY performed reductions,
I will address the algorithm.

A reduction is a vector chained operation--that is a result from some
calculation is chained (forwarded) into the reduction unit (FADD).
The instruction decoder has recognized that an operand and the result
are the same register. Instead of incrementing the result register,
it is held at the "front" of the vector until the first result is generated,
Then it is allowed to increment normally. k happens to be the latency
of the FU feeding the FADD (doing the reduction). So after k steps,
instead of adding what was in the vector register, one is adding the
result from the FADD and using k elements of the vector register as
k individual accumulators.

Presto, recognize that a 3-bit field is equal to another 3-bit field and
you are performing an "add" {FADD, IADD, OR, XOR} and HW will vector
calculate your reduction. After the vector reduction, there was some
other trick used to perform the final reduction.

Please don't do this !!

Thomas Koenig

unread,

Feb 26, 2021, 2:48:37 AM2/26/21

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:

> The issue, as was pointed out to me by someone in this group (Anton?) is
> that floating point addition is not associative. That is
>
> (A+B)+C is not necessarily equal to A+(B+C).

If you're into this stuff, the Goldberg paper
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
is certainly very good reference.

> So by changing the order of the additions, no mater whether it is by
> recursive folding or even just multiple accumulators, you have
> potentially changed the result of the calculation, which, of course, is
> a no-no.

That depends on the language semantics. If somebody codes a loop
for matrix multiplication, for example, the semantics of IEEE
combined with those of many languages will tell you that loops
could not be blocked, vectorized or even interchanged for better
efficiency by the compiler.

If, on the other hand, you call an intrinsic, like Fortran's

c = matmul(a,b)

or call DGEMM, you are in fact telling the compiler to go wild and
to whatever it choses to do.

If you want something intermediate, that's hard to express in
the languages that we have.

Terje Mathisen

unread,

Feb 26, 2021, 2:59:12 AM2/26/21

to

Which is why we _need_ a "superaccumulator", either exact or with just
2-3 words via AugmentedAddition().

This is clearly superior to any current setup, which means that people
might even accept that the results will not be exactly identical.

Marcus

unread,

Feb 26, 2021, 3:44:40 AM2/26/21

to

Exactly. That's why, in order to parallelize accumulation, the
programmer typically has to be explicit about /how/ to do it. Usually
this takes the form of intrinsics, pragmas or assembler language.

Unless, as Terje suggests, the hardware has the capability to do the
arithmetic with increased precision so that the order does not matter
(as much).

/Marcus

Marcus

unread,

Feb 26, 2021, 3:48:09 AM2/26/21

to

Thanks for the clarification.

No that does not sound like a good idea, and that's not what I'm doing.
See description below:

Stefan Monnier

unread,

Feb 26, 2021, 10:31:28 AM2/26/21

to

> A reduction is a vector chained operation--that is a result from some
> calculation is chained (forwarded) into the reduction unit (FADD).
> The instruction decoder has recognized that an operand and the result
> are the same register. Instead of incrementing the result register,
> it is held at the "front" of the vector until the first result is generated,
> Then it is allowed to increment normally. k happens to be the latency
> of the FU feeding the FADD (doing the reduction). So after k steps

> instead of adding what was in the vector register, one is adding the
> result from the FADD and using k elements of the vector register as
> k individual accumulators.

If you overlook the lack of associativity of float addition, this is
a rather neat solution ;-)

I guess for multiplication one could say that it is "associative
enough", but it's probably not worth it because it's rather rare to
reduce vectors of non-trivial length with a multiplication (for one
thing, it only makes sense if those numbers are fundamentally
unit-less).

Still, there are all kinds of situation where reductions are performed
and where we can do better than "one at a time" because the reductions
*are* associative (e.g. because they operate on integers/fixedpoint).

Stefan

Stefan Monnier

unread,

Feb 26, 2021, 10:36:57 AM2/26/21

to

> This is clearly superior to any current setup, which means that people might
> even accept that the results will not be exactly identical.

Of course, the difference between "almost always identical" and "always
identical" remains substantial. I wonder if there could be a way to
impose some artificial rounding at the end to make sure that the results
are always identical (at the cost of losing a bit of precision).
This would give extra choice in the tradeoffs between bit-for-bit
reproducibility, speed, and precision.

Stefan

MitchAlsup

unread,

Feb 26, 2021, 11:24:40 AM2/26/21

to

Loop excerpt from LAPACK[ DAXPY.F ]

*
m = mod(n,4)
IF (m.NE.0) THEN
DO i = 1,m
dy(i) = dy(i) + da*dx(i)
END DO
END IF
IF (n.LT.4) RETURN
mp1 = m + 1
DO i = mp1,n,4
dy(i) = dy(i) + da*dx(i)
dy(i+1) = dy(i+1) + da*dx(i+1)
dy(i+2) = dy(i+2) + da*dx(i+2)
dy(i+3) = dy(i+3) + da*dx(i+3)
END DO
ELSE
*

I would say that is rather explicit. But, there are 4 different triple
loops in DAXPY, one for each way memory is referenced, and
optimized for that access pattern. One should read all the code
in LAPACK in order to get a feel for what some are saying compilers
should be doing, if only to realize why you don't want compilers
doing those things.

>
> Unless, as Terje suggests, the hardware has the capability to do the
> arithmetic with increased precision so that the order does not matter
> (as much).

Then someone would complain that too much precision is just as bad
as not enough.

>
> /Marcus

MitchAlsup

unread,

Feb 26, 2021, 11:27:47 AM2/26/21

to

On Friday, February 26, 2021 at 9:36:57 AM UTC-6, Stefan Monnier wrote:
> > This is clearly superior to any current setup, which means that people might
> > even accept that the results will not be exactly identical.
> Of course, the difference between "almost always identical" and "always
> identical" remains substantial. I wonder if there could be a way to
> impose some artificial rounding at the end to make sure that the results

The problem is not rounding, the problem is the container size and that
if you want everyone to achieve the same result, there can be only 1
rounding applied to the whole reduction !

> are always identical (at the cost of losing a bit of precision).
> This would give extra choice in the tradeoffs between bit-for-bit
> reproducibility, speed, and precision.

Choice to whom::
a) the compiler,
b) the programmer,
c) the hardware,
d) the numerical analyst ?
>
>
> Stefan

Stefan Monnier

unread,

Feb 26, 2021, 11:32:54 AM2/26/21

to

>> > This is clearly superior to any current setup, which means that people might
>> > even accept that the results will not be exactly identical.
>> Of course, the difference between "almost always identical" and "always
>> identical" remains substantial. I wonder if there could be a way to
>> impose some artificial rounding at the end to make sure that the results
> The problem is not rounding, the problem is the container size and that
> if you want everyone to achieve the same result, there can be only 1
> rounding applied to the whole reduction !

Call it "fuzzying", then.

>> are always identical (at the cost of losing a bit of precision).
>> This would give extra choice in the tradeoffs between bit-for-bit
>> reproducibility, speed, and precision.
>
> Choice to whom::
> a) the compiler,
> b) the programmer,
> c) the hardware,
> d) the numerical analyst ?

b and d.

Stefan

MitchAlsup

unread,

Feb 26, 2021, 1:13:49 PM2/26/21

to

Why is it that when a compiler sees::

sum=0.0;
for( i = 0; i < MAX; i++ )
sum += a[i]*b[i];

The compiler does not simply convert this into::

sum = LAPACK.DDOT( MAX, a, 1, b, 1);

??

And let the numerical analyst from 50 years ago use all
their powers to achieve good results! I bet more than
1/2 of all such loops would be numerically better off.

That is,, rather than fighting the same problem over and over
again, let the experience of the past provide a solution for
the future. And if by chance, the system you are working on
happens to have hundreds of thousands of processors,
then that system will have a LAPACK.DDOT function
optimized for the system as it is.

Stefan Monnier

unread,

Feb 26, 2021, 1:45:30 PM2/26/21

to

> Why is it that when a compiler sees::
>
> sum=0.0;
> for( i = 0; i < MAX; i++ )
> sum += a[i]*b[i];
>
> The compiler does not simply convert this into::
>
> sum = LAPACK.DDOT( MAX, a, 1, b, 1);
>
> ??

Because it doesn't give the same result in all cases?
To me the question is rather why do programmers write the first instead
of the second, and what can we do so they stop doing it.

Stefan

Thomas Koenig

unread,

Feb 26, 2021, 2:01:10 PM2/26/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> Why is it that when a compiler sees::
>
> sum=0.0;
> for( i = 0; i < MAX; i++ )
> sum += a[i]*b[i];
>
> The compiler does not simply convert this into::
>
> sum = LAPACK.DDOT( MAX, a, 1, b, 1);
>
> ??

Usually, the language definition forbids this.

Fortran has the DOT_PRODUCT intrinsic function, so people can
just write

sum = dot_product (a,b)

and the compiler can then generate highly vectorized code,
call a library function, or just generally do whatever.

Fortran also has a built-in IEEE module where you can set and
enqurie about IEEE-related stuff.

Fortran is far from perfect, but much better than C or C++
for numerical work.

Marcus

unread,

Feb 26, 2021, 2:03:01 PM2/26/21

to

I believe the biggest problem lies in the language(s) that people
use. In particular C and C++ are very popular, and they encourage
for-loops with a specific evaluation order etc.

There are other alternatives. For instance Python + NumPy, where the
latter essentially is LAPACK under the hood (the combination is kind
of similar to Matlab - a dynamically translated language with
heavily optimized numerical core routines).

Anyone coding against NumPy would avoid Python for loops for
numeric processing.

/Marcus

MitchAlsup

unread,

Feb 26, 2021, 2:14:31 PM2/26/21

to

On Friday, February 26, 2021 at 1:01:10 PM UTC-6, Thomas Koenig wrote:
> MitchAlsup <Mitch...@aol.com> schrieb:
> > Why is it that when a compiler sees::
> >
> > sum=0.0;
> > for( i = 0; i < MAX; i++ )
> > sum += a[i]*b[i];
> >
> > The compiler does not simply convert this into::
> >
> > sum = LAPACK.DDOT( MAX, a, 1, b, 1);
> >
> > ??
> Usually, the language definition forbids this.

Would this not fall under the "as if" rule ??

MitchAlsup

unread,

Feb 26, 2021, 2:17:15 PM2/26/21

to

I suspect that mostly, it is because they don't know a better solution
is available at less cost than doing it themselves.

>
>
> Stefan

MitchAlsup

unread,

Feb 26, 2021, 2:20:06 PM2/26/21

to

On Friday, February 26, 2021 at 1:01:10 PM UTC-6, Thomas Koenig wrote:

> MitchAlsup <Mitch...@aol.com> schrieb:
> > Why is it that when a compiler sees::
> >
> > sum=0.0;
> > for( i = 0; i < MAX; i++ )
> > sum += a[i]*b[i];
> >
> > The compiler does not simply convert this into::
> >
> > sum = LAPACK.DDOT( MAX, a, 1, b, 1);
> >
> > ??
> Usually, the language definition forbids this.
>
> Fortran has the DOT_PRODUCT intrinsic function, so people can
> just write
>
> sum = dot_product (a,b)

Does FORTRAN allow the compiler to "call dot_product" if it
sees the Fortran version of the C code above ?

Anton Ertl

unread,

Feb 26, 2021, 2:38:28 PM2/26/21

to

Stefan Monnier <mon...@iro.umontreal.ca> writes:
[Mitch Alsup on Cray reduction]

>> A reduction is a vector chained operation--that is a result from some
>> calculation is chained (forwarded) into the reduction unit (FADD).
>> The instruction decoder has recognized that an operand and the result
>> are the same register. Instead of incrementing the result register,
>> it is held at the "front" of the vector until the first result is generated,
>> Then it is allowed to increment normally. k happens to be the latency
>> of the FU feeding the FADD (doing the reduction). So after k steps
>> instead of adding what was in the vector register, one is adding the
>> result from the FADD and using k elements of the vector register as
>> k individual accumulators.
>
>If you overlook the lack of associativity of float addition, this is
>a rather neat solution ;-)

My guess is that Cray FP is so approximate that users don't care for
associativity. Even if you care, if the architecture defines it
appropriately, it's architecturally no problem.

It's only when you want to have an auto-vectorizing compiler (rather
than the user writing "sum(a,len)" or somesuch) that you get a
problem. I expect that VVM will also have a problem with that.

Tree reduction is, on average, nicer than linear reduction for
precision, and nicely parallelizable, but needs half a vector as extra
storage.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

MitchAlsup

unread,

Feb 26, 2021, 3:33:11 PM2/26/21

to

We all know that the least error prone way to add up a pile of numbers is
to sort the list and add them up from small to large using two accumulators
a negative one and a positive one. This has been known since the early 1950s.
The only problem with this method is that sort has a rather large cost.

I am beginning to think that a programmer needs to sign a waver before the
computer allows him/her to use floating point at all..............or spend a few
months with Kahan learning all the bad ways to code algorithms.....

Thomas Koenig

unread,

Feb 26, 2021, 5:17:31 PM2/26/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:

> My guess is that Cray FP is so approximate that users don't care for
> associativity.

I once read that Seymour Cray would have built a machine where 2.0 +
2.0 did not equal 4.0 if it made it a little faster.

Probably apocryphal.

JimBrakefield

unread,

Feb 26, 2021, 5:24:04 PM2/26/21

to

Kahan should be respected, not revered: https://www.youtube.com/watch?v=KEAKYDyUua4
My slides on a talk on Posits have many web links and may help the confused:
https://github.com/jimbrake/alt_posit
In particular, the "quire" is an extended accumulator for posits designed to eliminate round-off
error for dot products (also works for IEEE-754's with limited exponent range). Hardware
implementation of extended dot-product accumulator can be subjected to optimization
including pipelining.

Terje Mathisen

unread,

Feb 26, 2021, 6:29:46 PM2/26/21

to

Even that isn't the best:

You can do slightly better by starting with the sort, then locate the
zero crossing point and go out in both directions from that point,
always adding/subtracting the value which will leave the smallest
magnitude partial sum. This means doing two trial fadd/fsub ops in each
iteration and throw away the one that lost the comparison.

// sort the input, set pos to the index of the first value >= 0.0
// and neg = pos-1, then ncnt = total_cnt - pos and
// pcnt = total_cnt - ncnt

while (pcnt > 0 && ncnt > 0) {
psum = acc + *pos;
nsum = acc + *neg; // Really a subtraction!
if (abs(psum) < abs(nsum)) {
acc = psum;
pos++;
pcnt--;
}
else {
acc = nsum;
neg--;
ncnt--;
}
}
while (pcnt > 0) {
acc += *pos++;
pcnt--;
}
while (ncnt > 0) {
acc += *neg--;
ncnt--;
}

This reduction stage with all the branching is still almost certainly
faster than the sorting process, and it only works as long as there are
both positive and negative values left: Past that point we're left with
just taking them in order.

OTOH, if we can do AugmentedAddition()/Kahan pair results in less time
than 5-10 regular fp ops, then that is almost certainly both faster and
more accurate.

>
> I am beginning to think that a programmer needs to sign a waver before the
> computer allows him/her to use floating point at all..............or spend a few
> months with Kahan learning all the bad ways to code algorithms.....

"all the bad ways"?
:-)

Terje Mathisen

unread,

Feb 26, 2021, 6:33:13 PM2/26/21

to

NO!

You will always have the possibility of a true sum which is arbitrarily
close to the round-off point for that final bit.

OTOH, it is relatively easy to design something which will give you
guaranteed less than 0.501 ulp final error, i.e. similar to some of
Mitch's best/fast hw trig functions.

MitchAlsup

unread,

Feb 26, 2021, 7:40:39 PM2/26/21

to

How about "a goodly large number of bad ways".

>

:-):-):-)

Stefan Monnier

unread,

Feb 26, 2021, 8:33:04 PM2/26/21

to

>>> This is clearly superior to any current setup, which means that people might
>>> even accept that the results will not be exactly identical.
>> Of course, the difference between "almost always identical" and "always
>> identical" remains substantial. I wonder if there could be a way to
>> impose some artificial rounding at the end to make sure that the results
>> are always identical (at the cost of losing a bit of precision).
>> This would give extra choice in the tradeoffs between bit-for-bit
>> reproducibility, speed, and precision.
>
> NO!
>
> You will always have the possibility of a true sum which is arbitrarily
> close to the round-off point for that final bit.

I'm not sure what it is you're saying "NO" to. I was not talking about
proximity to the ideal value but about bit-for-bit reproducibility of
the result, which is another desirable behavior.

Stefan

MitchAlsup

unread,

Feb 26, 2021, 9:11:24 PM2/26/21

to

On Friday, February 26, 2021 at 5:33:13 PM UTC-6, Terje Mathisen wrote:
> Stefan Monnier wrote:
> >> This is clearly superior to any current setup, which means that people might
> >> even accept that the results will not be exactly identical.
> >
> > Of course, the difference between "almost always identical" and "always
> > identical" remains substantial. I wonder if there could be a way to
> > impose some artificial rounding at the end to make sure that the results
> > are always identical (at the cost of losing a bit of precision).
> > This would give extra choice in the tradeoffs between bit-for-bit
> > reproducibility, speed, and precision.
> NO!
>
> You will always have the possibility of a true sum which is arbitrarily
> close to the round-off point for that final bit.
>
> OTOH, it is relatively easy to design something which will give you
> guaranteed less than 0.501 ulp final error, i.e. similar to some of
> Mitch's best/fast hw trig functions.

If it were so easy, people decades before my time would have done it.

The trick is to leave the world of floating point behind, and recompose
the problem as an integer calculation, using appropriate data widths,
and algorithms that would seem 'foolish' to the SW FP programmer.....

Thomas Koenig

unread,

Feb 27, 2021, 3:59:53 AM2/27/21

to

MitchAlsup <Mitch...@aol.com> schrieb:

> On Friday, February 26, 2021 at 1:01:10 PM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>> > Why is it that when a compiler sees::
>> >
>> > sum=0.0;
>> > for( i = 0; i < MAX; i++ )
>> > sum += a[i]*b[i];
>> >
>> > The compiler does not simply convert this into::
>> >
>> > sum = LAPACK.DDOT( MAX, a, 1, b, 1);
>> >
>> > ??
>> Usually, the language definition forbids this.
>>
>> Fortran has the DOT_PRODUCT intrinsic function, so people can
>> just write
>>
>> sum = dot_product (a,b)
>
> Does FORTRAN allow the compiler to "call dot_product" if it
> sees the Fortran version of the C code above ?

No, it doesn't (not if you use a standard DO loop). Compilers could
allow this via non-standard flags.

What Fortran will have in the upcoming standard is DO CONCURRENT
with REDUCE, which means that the processor can sum up the values
from a loop (DO CONCURRENT has unspecified ordering) in any order.

That will fit the bill perfectly, but will take some time to
actually appear in compilers.

Anton Ertl

unread,

Feb 27, 2021, 4:15:40 AM2/27/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, February 26, 2021 at 1:01:10 PM UTC-6, Thomas Koenig wrote:
>> MitchAlsup <Mitch...@aol.com> schrieb:
>> > Why is it that when a compiler sees::
>> >
>> > sum=0.0;
>> > for( i = 0; i < MAX; i++ )
>> > sum += a[i]*b[i];
>> >
>> > The compiler does not simply convert this into::
>> >
>> > sum = LAPACK.DDOT( MAX, a, 1, b, 1);
>> >
>> > ??
>> Usually, the language definition forbids this.
>
>Would this not fall under the "as if" rule ??

No, because it produces a different result.

Anton Ertl

unread,

Feb 27, 2021, 4:30:32 AM2/27/21

to

MitchAlsup <Mitch...@aol.com> writes:
>On Friday, February 26, 2021 at 1:38:28 PM UTC-6, Anton Ertl wrote:
>> Tree reduction is, on average, nicer than linear reduction for
>> precision, and nicely parallelizable, but needs half a vector as extra
>> storage.
>
>We all know that the least error prone way to add up a pile of numbers is
>to sort the list and add them up from small to large using two accumulators
>a negative one and a positive one.

Tree reduction <https://en.wikipedia.org/wiki/Pairwise_summation> is
not be the most precise way (I did not claim so), but it's a fast and
parallelizable way. If you want more precision, use something else;
or maybe combine tree reduction with the basic step of Kahan
summation.

Anton Ertl

unread,

Feb 27, 2021, 5:08:06 AM2/27/21

to

Thomas Koenig <tko...@netcologne.de> writes:
>If somebody codes a loop
>for matrix multiplication, for example, the semantics of IEEE
>combined with those of many languages will tell you that loops
>could not be blocked, vectorized or even interchanged for better
>efficiency by the compiler.

That's an interesting claim. Let's check it.

Say, we start with a naive hand-coded implementation in C:

for (i=0; i<n; i++)
for (j=0; j<p; j++) {
for (k=0, r=0.0; k<m; k++)
r += a[i*m+k]*b[k*p+j];
c[i*p+j]=r;

And we want to maintain the associativity.

We can reformulate this as:

for (i=0; i<n; i++)
for (j=0; j<p; j++)
c[i*p+j] = 0.0;
for (i=0; i<n; i++)
for (j=0; j<p; j++)
for (k=0; k<m; k++)
c[i*p+j] += a[i*m+k]*b[k*p+j];

Ok, we introduced 0+x instead of x for some computations, and a lot of
loads and stores, but this does not change the FP results.

The following loop interchange is much faster (for m=n=p=700, with gcc
-O3 and without huge pages 1.4cycles per inner loop iteration instead
of 23.1 cycles on an Ivy Bridge), and vectorizable (even
auto-vectorizable, which is reflected in the result above):

for (i=0; i<n; i++)
for (j=0; j<p; j++)
c[i*p+j] = 0.0;
for (i=0; i<n; i++)
for (k=0; k<m; k++)
for (j=0; j<p; j++)
c[i*p+j] += a[i*m+k]*b[k*p+j];

But does it change the FP results? No, because, looking at every
c[i*p+j], the summands a[i*m+k]*b[k*p+j] are still added in the order
k rising from 0 to m-1. No reassociation happens, only independent
computations are reordered.

And additional cache blocking and register blocking optimizations are
possible without changing the FP results.

The reason why this works for matrix multiply is that the computations
of all elements of c are independent, so which allows a lot of
reordering without reassociation. Other problems are not so nice.

However, a compiler (at least for a system with signal handlers) can
still not perform these transformations, unless it knows that none of
the memory accesses results in a SIGSEGV. Or maybe if it analyses the
signal handlers enough, it can.

Anton Ertl

unread,

Feb 27, 2021, 5:12:19 AM2/27/21

to

MitchAlsup <Mitch...@aol.com> writes:
>Since apparently nobody knows how CRAY performed reductions,
>I will address the algorithm.

>
>A reduction is a vector chained operation--that is a result from some
>calculation is chained (forwarded) into the reduction unit (FADD).
>The instruction decoder has recognized that an operand and the result
>are the same register. Instead of incrementing the result register,

>it is held at the "front" of the vector until the first result is generated=

>,
>Then it is allowed to increment normally. k happens to be the latency

>of the FU feeding the FADD (doing the reduction). So after k steps,
>instead of adding what was in the vector register, one is adding the=20
>result from the FADD and using k elements of the vector register as=20
>k individual accumulators.

I would expect that you would use the latency of the FADD as k: only
after k cycles do you have the result of the first FADD and can
perform another FADD.

>Presto, recognize that a 3-bit field is equal to another 3-bit field and=20
>you are performing an "add" {FADD, IADD, OR, XOR} and HW will vector
>calculate your reduction. After the vector reduction, there was some=20
>other trick used to perform the final reduction.
>
>Please don't do this !!

Why not?

Terje Mathisen

unread,

Feb 27, 2021, 6:04:09 AM2/27/21

to

That is exactly what I'm saying NO to: I.e. there is no way to guarantee
exactly the same final rounding unless you do every previous operation
always in the same order, or with infinite precision.

For both speed and accuracy I would much rather have
AugmentedAddition/TwoSum instead of an absolute evaluation order
requirement.

Terje Mathisen

unread,

Feb 27, 2021, 6:15:33 AM2/27/21

to

We're in violent agreement I think:

We both want either a full superaccumulator, or a more limited version
that carries maybe 150-200 bits, right?

The latter give you that 0.500x ulp final accuracy with very high
confidence, while only the full version (including lots of guard bits
for spurious overflow handling?) can really guarantee repeatable final
sums independent of evaluation order.

OTOH, I just came up with a somewhat contrieved example which would fail
even on my balanced positive/negative accumulator, but where the final
result is exact:

succ(succ(NegativeInf)), x, y, z,..., pred(PositiveInf)

I.e. we have input values which are very close to both positive and
negative infinity, and so close in absolute value that the sum of each
such pair is exact: In that case we need to match them up before adding
the residuals together. In the generic case we end up with the knapsack
problem trying to come up with all combinations of positive and negative
values which can added together exactly.

All of this goes away if we can just throw a few more unsigned integer
bits at the problem.
:-)

Thomas Koenig

unread,

Feb 27, 2021, 7:25:52 AM2/27/21

to

Anton Ertl <an...@mips.complang.tuwien.ac.at> schrieb:
[...]

> But does it change the FP results? No, because, looking at every
> c[i*p+j], the summands a[i*m+k]*b[k*p+j] are still added in the order
> k rising from 0 to m-1. No reassociation happens, only independent
> computations are reordered.

You're right.

Stefan Monnier

unread,

Feb 27, 2021, 10:20:22 AM2/27/21

to

> That is exactly what I'm saying NO to: I.e. there is no way to guarantee
> exactly the same final rounding unless you do every previous operation
> always in the same order, or with infinite precision.

How do you know there is no way (given the extra flexibility of being
allowed to do additional rounding)?

IOW, I imagine an algorithm which estimates the worst case error that
could accrue if evaluated in another order, and at the end performs an
extra rounding that "hides" that worst case error (something along the
lines of keeping the shared prefix of the mantissas of `result+error`
and `result-error`).

As long as the estimate itself is conservative and does not depend on
the order in which *it* is a computed (both are big ifs, admittedly),
then the result should indeed be independent on the order in which
the additions were performed (which is not the case with my "shared
prefix", obviously).
[ A final big if is whether these estimated errors can be computed
efficiently, of course. ]

I do find it likely that there's no way to do that (as you suggest), but
I wonder if it is really known to be impossible, or whether it's just
not known (and presumed unlikely) to be possible.

Stefan

Terje Mathisen

unread,

Feb 27, 2021, 11:50:10 AM2/27/21

to

Here is an example with just three numbers where the final result
depends on evaluation order:

2^53+2, 1, -2^-52

Starting from the high end, the first addition will end up with .5
(beyond the mantissa) and round up due to "nearest_or_even" to 2^53+4,
then adding (subtracting) the third value does nothing.

Starting at the other end, we get 0.9999... which we then add to the
first/largest value, and now the guard+sticky bits are 01 so we don't
round up.

What I'm trying to say is that it will always be possible to come up
with an example which will differ in the ulp bit due to evaluation order
in combination with the rounding rules.

MitchAlsup

unread,

Feb 27, 2021, 4:03:28 PM2/27/21

to

The problem requires 53+52+1 bits of precision to get the right answer,
53+52+4 to be properly rounded !

I suspect that even Kahan-Babuška summation will fail to get the right
answer.

On the other hand, this is NOT what FP was designed to manage.

Stephen Fuld

unread,

Feb 28, 2021, 12:10:56 AM2/28/21

to

On 2/25/2021 11:48 PM, Thomas Koenig wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> schrieb:
>
>> The issue, as was pointed out to me by someone in this group (Anton?) is
>> that floating point addition is not associative. That is
>>
>> (A+B)+C is not necessarily equal to A+(B+C).
>
> If you're into this stuff, the Goldberg paper
> https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
> is certainly very good reference.

Thank you. I had read Goldberg's appendix to H&P, but for some reason,
I hadn't seen this. It may be that when I first saw the reference, I
mistook it for the H&P appendix. :-(

It will take me some time to go through it, but I suspect it will be
time well spent!

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Stephen Fuld

unread,

Feb 28, 2021, 12:14:13 AM2/28/21

to

On 2/26/2021 7:31 AM, Stefan Monnier wrote:
>> A reduction is a vector chained operation--that is a result from some
>> calculation is chained (forwarded) into the reduction unit (FADD).
>> The instruction decoder has recognized that an operand and the result
>> are the same register. Instead of incrementing the result register,

>> it is held at the "front" of the vector until the first result is generated,

>> Then it is allowed to increment normally. k happens to be the latency
>> of the FU feeding the FADD (doing the reduction). So after k steps

>> instead of adding what was in the vector register, one is adding the

>> result from the FADD and using k elements of the vector register as
>> k individual accumulators.
>
> If you overlook the lack of associativity of float addition, this is
> a rather neat solution ;-)
>

> I guess for multiplication one could say that it is "associative
> enough", but it's probably not worth it because it's rather rare to
> reduce vectors of non-trivial length with a multiplication (for one
> thing, it only makes sense if those numbers are fundamentally
> unit-less).
>
> Still, there are all kinds of situation where reductions are performed
> and where we can do better than "one at a time" because the reductions
> *are* associative (e.g. because they operate on integers/fixedpoint).

Or because the operation performed isn't an arithmetic one. Examples
are Max and Min, and logical operations such as XOR for creating checksums.

I don't know how important all of these are, when compared to all the
floating point arithmetic ones.

Stephen Fuld

unread,

Feb 28, 2021, 12:26:53 AM2/28/21

to

A problem is that there are often time when you need "minimal floating
point" operations, that is, you don't want to do integers, but the
things aren't complex enough to require the kind of numerical analysis
knowledge that your solution implies. Some examples are computing sales
totals for a store, or computing an average (mean) to (say) two decimal
places of several hundred integers each less than 10,000.

Anton Ertl

unread,

Feb 28, 2021, 12:05:35 PM2/28/21

to

Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>A problem is that there are often time when you need "minimal floating
>point" operations, that is, you don't want to do integers, but the
>things aren't complex enough to require the kind of numerical analysis
>knowledge that your solution implies. Some examples are computing sales
>totals for a store,

Why would you need FP for that?

>or computing an average (mean) to (say) two decimal
>places of several hundred integers each less than 10,000.

Depending on your precision requirements, you can do that with FP, or
relatively easily with integers. E.g., in C:

long meanx100=(sum*100L+n/2)/n;

The "+n/2" is for rounding. Outputting this in C is a little
cumbersome:

printf("%d.%02d",meanx100/100,meanx100%100);

Probably does not work properly with negative numbers.

MitchAlsup

unread,

Feb 28, 2021, 1:53:09 PM2/28/21

to

On Sunday, February 28, 2021 at 11:05:35 AM UTC-6, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >A problem is that there are often time when you need "minimal floating
> >point" operations, that is, you don't want to do integers, but the
> >things aren't complex enough to require the kind of numerical analysis
> >knowledge that your solution implies. Some examples are computing sales
> >totals for a store,
> Why would you need FP for that?
> >or computing an average (mean) to (say) two decimal
> >places of several hundred integers each less than 10,000.
> Depending on your precision requirements, you can do that with FP, or
> relatively easily with integers. E.g., in C:
>
> long meanx100=(sum*100L+n/2)/n;
>
> The "+n/2" is for rounding. Outputting this in C is a little
> cumbersome:
>
> printf("%d.%02d",meanx100/100,meanx100%100);
>
> Probably does not work properly with negative numbers.

Unsigned solves that

Stephen Fuld

unread,

Mar 15, 2021, 11:46:47 AM3/15/21

to

On 2/28/2021 8:49 AM, Anton Ertl wrote:
> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>> A problem is that there are often time when you need "minimal floating
>> point" operations, that is, you don't want to do integers, but the
>> things aren't complex enough to require the kind of numerical analysis
>> knowledge that your solution implies. Some examples are computing sales
>> totals for a store,
>
> Why would you need FP for that?

In the US, at least, you may want dollars and cents. Or fractions of a
Euro for those in Europe. Yes, you could use scaled integer, but that
is not "really" integer, and, as you demonstrate later takes extra code.
Most people, including probably me, would simply use float (presuming
the numbers are such that it wouldn't overflow).

>
>> or computing an average (mean) to (say) two decimal
>> places of several hundred integers each less than 10,000.
>
> Depending on your precision requirements, you can do that with FP, or
> relatively easily with integers. E.g., in C:
>
> long meanx100=(sum*100L+n/2)/n;
>
> The "+n/2" is for rounding. Outputting this in C is a little
> cumbersome:
>
> printf("%d.%02d",meanx100/100,meanx100%100);
>
> Probably does not work properly with negative numbers.

As I said above, you are right, you can do it that way. But it is much
easier, and probably less error prone, to use floating point.

These are examples of what I mean by the term "minmal floating point".
You could use scaled integer, but it is so much easier to just code
"float" for the appropriate data definitions.

MitchAlsup

unread,

Mar 15, 2021, 12:04:14 PM3/15/21

to

On Monday, March 15, 2021 at 10:46:47 AM UTC-5, Stephen Fuld wrote:
> On 2/28/2021 8:49 AM, Anton Ertl wrote:
> > Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
> >> A problem is that there are often time when you need "minimal floating
> >> point" operations, that is, you don't want to do integers, but the
> >> things aren't complex enough to require the kind of numerical analysis
> >> knowledge that your solution implies. Some examples are computing sales
> >> totals for a store,
> >
> > Why would you need FP for that?
> In the US, at least, you may want dollars and cents. Or fractions of a
> Euro for those in Europe. Yes, you could use scaled integer, but that
> is not "really" integer, and, as you demonstrate later takes extra code.
> Most people, including probably me, would simply use float (presuming
> the numbers are such that it wouldn't overflow).

Why not simply use a language where it is straightforward to specify
fixed point integers ? That way the compiler has the burden of keeping
track of the binary point and most of the rest of the horseplay.

Ivan Godard

unread,

Mar 15, 2021, 12:12:59 PM3/15/21

to

Unfortunately the FP calculation would be illegal in most commercial
applications. By statute, contract and commercial practice you are
required to use bounded scaled integer, must preserve scaling, and are
not permitted to use arbitrary precision, even when that is more
"accurate". And then there's "bankers rounding", also required sometimes.

Stephen Fuld

unread,

Mar 15, 2021, 12:38:41 PM3/15/21

to

On 3/15/2021 9:04 AM, MitchAlsup wrote:
> On Monday, March 15, 2021 at 10:46:47 AM UTC-5, Stephen Fuld wrote:
>> On 2/28/2021 8:49 AM, Anton Ertl wrote:
>>> Stephen Fuld <sf...@alumni.cmu.edu.invalid> writes:
>>>> A problem is that there are often time when you need "minimal floating
>>>> point" operations, that is, you don't want to do integers, but the
>>>> things aren't complex enough to require the kind of numerical analysis
>>>> knowledge that your solution implies. Some examples are computing sales
>>>> totals for a store,
>>>
>>> Why would you need FP for that?
>> In the US, at least, you may want dollars and cents. Or fractions of a
>> Euro for those in Europe. Yes, you could use scaled integer, but that
>> is not "really" integer, and, as you demonstrate later takes extra code.
>> Most people, including probably me, would simply use float (presuming
>> the numbers are such that it wouldn't overflow).
>
> Why not simply use a language where it is straightforward to specify
> fixed point integers ? That way the compiler has the burden of keeping
> track of the binary point and most of the rest of the horseplay.

That is a great solution if you have the choice available. But in lots
of real world situations, you don't get to choose the implementation
language. Things like company policies, compiler availability, etc.
often limit those choices.

Stephen Fuld

unread,

Mar 15, 2021, 12:41:23 PM3/15/21

to

Are you saying that IBM (whose "middle name", after all is "Business")
spent a lot of money implementing something that is basically useless?

EricP

unread,

Mar 15, 2021, 12:44:17 PM3/15/21

to

MitchAlsup wrote:
> On Monday, March 15, 2021 at 10:46:47 AM UTC-5, Stephen Fuld wrote:
>> On 2/28/2021 8:49 AM, Anton Ertl wrote:
>>> Why would you need FP for that?
>> In the US, at least, you may want dollars and cents. Or fractions of a
>> Euro for those in Europe. Yes, you could use scaled integer, but that
>> is not "really" integer, and, as you demonstrate later takes extra code.
>> Most people, including probably me, would simply use float (presuming
>> the numbers are such that it wouldn't overflow).
>
> Why not simply use a language where it is straightforward to specify
> fixed point integers ? That way the compiler has the burden of keeping
> track of the binary point and most of the rest of the horseplay.

A language with decimal fixed point.
The original Ada-1983 specified binary fixed point.
So you still could have wound up trying to explain to
your grandmother why her phone bill didn't add up.

EricP

unread,

Mar 15, 2021, 12:58:58 PM3/15/21

to

There are a plethora of decimal rounding modes.
Conventional, Argentina, Swiss, Swedish...
They vary in how many digits they look at, what digit values they round,
and what intervals they round to.

https://www.syspro.com/blog/applying-and-operating-erp/rounding-for-accounting-accuracy/

https://en.wikipedia.org/wiki/Swedish_rounding

Ivan Godard

unread,

Mar 15, 2021, 1:28:04 PM3/15/21

to

No, they implemented packed decimal, and later DFP, for the business
user, and binary FP for the scientific user. There is no one size to fit
all.

Stephen Fuld

unread,

Mar 15, 2021, 1:36:41 PM3/15/21

to

Right. But I thought you you were saying that the DFP stuff would be
illegal in most places, hence not useful.

Terje Mathisen

unread,

Mar 15, 2021, 4:45:34 PM3/15/21

to

I'm pretty sure you have to employ effectively arbitrary precision (i.e.
exact DFADD & DFMUL) followed by manual rounding according to whatever
the legal requirements might be for the particular jurisdiction.

You can probably use DFP as-is for quite a few stages but you will need
extra operations in some places.

Just the little snag that USD ops were carried to 4 decimals but Euro
decided to require 5 broke quite a few financial/accounting packages.

Terje Mathisen

unread,

Mar 15, 2021, 4:53:55 PM3/15/21

to

That just means that you have to implement the core operations as
callable functions instead of inline operators, unless that company
policy prescribes a language like C++ where you can overload those
operations.

Without overloading the resulting program becomes somewhat less
readable, otoh you can be pretty sure that you understand what it is
doing. I have seen too many C++ programs where overloading had been
carried to stupid levels.

OTOH, even in a modern language like Rust you can write perfectly
readable (to a C guy like myself) and legal code which does something
quite different from what you'd expect.

A guy at work asked me what this would do:

Given that a = 5 and b = 3, what will result here:

diff = a++ - --b;

(This line will not result in any warnings or errors.)

MitchAlsup

unread,

Mar 15, 2021, 5:07:07 PM3/15/21

to

In C::
(a++) delivers the value of 5 and 'a' gets the value of 6
(--b) delivers the value of 2 and 'b' gets the value of 2
6-2 delivers the value of 4
So 'diff' = 4, 'a'=6, 'b'=2

I don't have any Idea about Rust.

Terje Mathisen

unread,

Mar 15, 2021, 5:11:11 PM3/15/21

to

I wrote the SW used for Universal Presentkort starting in 1982, this
is/was the largest gift card company in Norway (created by my
father-in-law). The rounding rule I had to implement was to calculate
and deduct the fee percentage (which could vary from business to
business) for each transaction, instead of doing it to the total sum,
otherwise their books would not balance when a large chain collected and
returned our gift cards from multiple chain outlets.

I did however do all of this using plain 64-bit double, just with
careful consideration of maximal errors due to using binary instead of
decimal, and then adding a corresponding epsilon (typically 0.5000000001
or similar) after I had temporarily multiplied the current value by a
scale factor so that I could take the integer result and then scale down
again.

Working in fixed point would have been easier. :-)
>
>
> https://en.wikipedia.org/wiki/Swedish_rounding
>

AS the article shows, we have been using similar approaches here for
few decades.

The most common and simultaneously largest rounding offset happened when
you returned a 1.5l PET soda bottle: This had a value of 2.50 NOK, which
the cashier would always have to round up to 3.00 after the 50 øre coin
was removed.

However, if you brought in a sack of bottles and returned them
one-by-one the checkout clerk and/or store owner would probably be a bit
irritated.

David Brown

unread,

Mar 15, 2021, 6:08:00 PM3/15/21

to

Note, however, that C does not specify the ordering of the assignments
here. It is quite possible for the value 4 to be stored to "diff", then
the value 2 is stored to b, and the value 6 is stored to a. And expect
nasal demons if someone writes "#define diff a" before that line!

MitchAlsup

unread,

Mar 15, 2021, 6:42:13 PM3/15/21

to

On Monday, March 15, 2021 at 5:08:00 PM UTC-5, David Brown wrote:
> On 15/03/2021 22:07, MitchAlsup wrote:
> > On Monday, March 15, 2021 at 3:53:55 PM UTC-5, Terje Mathisen wrote:
>
> >> OTOH, even in a modern language like Rust you can write perfectly
> >> readable (to a C guy like myself) and legal code which does something
> >> quite different from what you'd expect.
> >>
> >> A guy at work asked me what this would do:
> >>
> >> Given that a = 5 and b = 3, what will result here:
> >>
> >> diff = a++ - --b;
> >
> > In C::
> > (a++) delivers the value of 5 and 'a' gets the value of 6
> > (--b) delivers the value of 2 and 'b' gets the value of 2
> > 6-2 delivers the value of 4
> > So 'diff' = 4, 'a'=6, 'b'=2
> Note, however, that C does not specify the ordering of the assignments

Yes, I agree that until you run into a sequence point (or comma) evaluation
order is not fixed.

However given the original text and a sequence point after the line,
the various containers have the values given.

Ivan Godard

unread,

Mar 15, 2021, 7:23:38 PM3/15/21

to

I was saying that binary FP, which is what I understood was being
proposed for commercial work, is illegal. It won't overflow, but it
won't get the required answer either.

David Brown

unread,

Mar 15, 2021, 8:09:38 PM3/15/21

to

On 15/03/2021 23:42, MitchAlsup wrote:
> On Monday, March 15, 2021 at 5:08:00 PM UTC-5, David Brown wrote:
>> On 15/03/2021 22:07, MitchAlsup wrote:
>>> On Monday, March 15, 2021 at 3:53:55 PM UTC-5, Terje Mathisen wrote:
>>
>>>> OTOH, even in a modern language like Rust you can write perfectly
>>>> readable (to a C guy like myself) and legal code which does something
>>>> quite different from what you'd expect.
>>>>
>>>> A guy at work asked me what this would do:
>>>>
>>>> Given that a = 5 and b = 3, what will result here:
>>>>
>>>> diff = a++ - --b;
>>>
>>> In C::
>>> (a++) delivers the value of 5 and 'a' gets the value of 6
>>> (--b) delivers the value of 2 and 'b' gets the value of 2
>>> 6-2 delivers the value of 4
>>> So 'diff' = 4, 'a'=6, 'b'=2
>> Note, however, that C does not specify the ordering of the assignments
>
> Yes, I agree that until you run into a sequence point (or comma) evaluation
> order is not fixed.

Yes, the evaluation order of subexpressions is unsequenced (except for
special cases, like the comma operator, logical operators && and ||, and
the tertiary operator ?: ). (That last one might appear as a weird
smiley...)

The evaluations can even be interleaved. There is not a lot to
interleave in this case (unless "a" and "b" are big types and the
processor is too small to handle the operations in a single instruction).

>
> However given the original text and a sequence point after the line,
> the various containers have the values given.
>

Yes, the results are the same regardless of the sequencing.

If the results could be different for different orders of execution,
that generally means the behaviour is undefined. Usually you only get
that if you try to modify the same object twice in the same expression -
a bad idea!

MitchAlsup

unread,

Mar 15, 2021, 8:20:03 PM3/15/21

to

It is almost as if one would want to do::

# define ; ;,

!!

David Brown

unread,

Mar 15, 2021, 8:28:17 PM3/15/21

to

Don't worry, a semicolon already acts as a sequence point.

Stephen Fuld

unread,

Mar 15, 2021, 10:29:58 PM3/15/21

to

Oh. OK. Sorry I misunderstood.

James Van Buskirk

unread,

Mar 15, 2021, 10:39:56 PM3/15/21

to

"MitchAlsup" wrote in message
news:af31fd86-154f-44ca...@googlegroups.com...

It seemed to me that you had encountered the foot which Terje
so politely extended, and indeed:

D:\gfortran\clf\MRISC32>type MRISC32.c
#include <stdio.h>

int main()
{
int a = 5, b = 3, diff;

diff = a++ - --b;

printf("a = %d, b = %d, diff = %d\n", a, b, diff);
return 0;
}

D:\gfortran\clf\MRISC32>gcc MRISC32.c -o MRISC32

D:\gfortran\clf\MRISC32>MRISC32
a = 6, b = 2, diff = 3

Stefan Monnier

unread,

Mar 15, 2021, 11:05:14 PM3/15/21

to

>> > diff = a++ - --b;
>
>> In C::
>> (a++) delivers the value of 5 and 'a' gets the value of 6
>> (--b) delivers the value of 2 and 'b' gets the value of 2
>> 6-2 delivers the value of 4
>> So 'diff' = 4, 'a'=6, 'b'=2
>
> It seemed to me that you had encountered the foot which Terje
> so politely extended, and indeed:

The first two lines were right: "(a++) delivers the value of 5" and "(--b)
delivers the value of 2" so the subtraction returns 3.

Stefan

David Brown

unread,

Mar 16, 2021, 3:29:59 AM3/16/21

to

On 16/03/2021 03:39, James Van Buskirk wrote:
> "MitchAlsup" wrote in message
> news:af31fd86-154f-44ca...@googlegroups.com...
>
>> On Monday, March 15, 2021 at 3:53:55 PM UTC-5, Terje Mathisen wrote:
>
>> > diff = a++ - --b;
>
>> In C::
>> (a++) delivers the value of 5 and 'a' gets the value of 6
>> (--b) delivers the value of 2 and 'b' gets the value of 2
>> 6-2 delivers the value of 4
>> So 'diff' = 4, 'a'=6, 'b'=2
>
> It seemed to me that you had encountered the foot which Terje
> so politely extended, and indeed:
>

> a = 6, b = 2, diff = 3
>

I for one simply didn't check his arithmetic! The "difficult" bit is
knowing that (a++) and (--b) evaluate to 5 and 2 respectively. I didn't
even bother noticing that the value he wrote for "diff" was wrong. I
don't know if this was a silly mistake on Terje's part, but it was
certainly a silly mistake on /my/ part not to notice it.

Terje Mathisen

unread,

Mar 16, 2021, 5:35:47 AM3/16/21

to

You all very helpfully fell into the same manhole as I did:

In Rust neither ++ nor -- is defined, so in reality what the line does is

diff = a + + - (-(-b))

which (silently!) simplifies to

diff = a - b

Oops!