FP transcendentals (trigonometry, root/exp/log) proposal

517 views
Skip to first unread message

lkcl

unread,
Aug 6, 2019, 8:33:46 AM8/6/19
to RISC-V ISA Dev, Libre-RISCV General Development
https://libre-riscv.org/ztrans_proposal/

As part of developing a Libre GPU that is intended for 3D, specialist Compute and Machine Learning, standard operations used in OpenCL are pretty much mandatory [1].

As they will end up in common public usage - upstream compilers with high volumes of downloads - it does not make sense for these opcodes to be relegated to "custom" status ["custom" status is suitable only for embedded proprietary usage that will never see the public light of day].

Also, they are not being proposed as part of RVV for the simple reason that as "scalar" opcodes, they can be used with *scalar* designs. It makes more sense that they be deployable in "embedded" designs (that may not have room for RVV, particularly as CORDIC seems to cover the vast majority of trigonometric algorithms and more [2]), or in NUMA parallel designs, where a cluster of shaders makes up for a lack of "vectorisation".

In addition: as scalar opcodes, they fit into the (huge, sparsely populated) FP opcode brownfield, whereas the RVV major opcode is much more under pressure.

The list of opcodes is at an early stage, and participation in its development is open and welcome to anyone involved in 3D and OpenCL Compute applications.

Context, research, links and discussion are being tracked on the libre riscv bugtracker [3].

L.

[1] https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html
[2] http://www.andraka.com/files/crdcsrvy.pdf
[3] http://bugs.libre-riscv.org/show_bug.cgi?id=127

MitchAlsup

unread,
Aug 7, 2019, 6:36:17 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
b) a reserve on the OpCode space for the double precision equivalents ?
c) a statement on <approximate> execution time ?

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.

Jacob Lifshay

unread,
Aug 7, 2019, 7:43:21 PM8/7/19
to MitchAlsup, RISC-V ISA Dev, Libre-RISCV General Development
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

b) a reserve on the OpCode space for the double precision equivalents ?
the 2 bits right below the funct5 field select from:
00: f32
01: f64
10: f16
11: f128

so f64 is definitely included.

see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratified

it would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.

c) a statement on <approximate> execution time ?
that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).

For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )
That's why the hyperbolics extension is split out into a separate extension. Also, a single instruction may be much faster since it can calculate it all as one operation (cordic will work) rather than requiring several slow operations sqrt/div and log.

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.
for the implementation techniques I know for log/exp, implementing both log/exp and logp1/expm1 is a slight increase in complexity compared to only one or the other (changing constants for polynomial/lut-based implementations and for cordic). I think it's worth saving the extra instructions for the common case of implementing pow (where you need log/exp) and logp1/expm1 is not worth getting rid of due to the small additional cost and additional accuracy obtained.

Jacob Lifshay

lkcl

unread,
Aug 7, 2019, 8:27:08 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
[some overlap with what jacob wrote, reviewing/removing redundant replies]


On Wednesday, August 7, 2019 at 11:36:17 PM UTC+1, MitchAlsup wrote:
Is this proposal going to <eventually> include:: 
a) statement on required/delivered numeric accuracy per transcendental ?


jacob makes and emphasises the point that these are intended to be *scalar* operations, for direct use in libm.

b) a reserve on the OpCode space for the double precision equivalents ?


reservations, even where the case has been made clear that the impact of not having a reservation will cause severe detrimental ongoing impact for the wider RISC-V community, do not have an IANA-style contact/proposal procedure.  i've repeatedly requested an official reservation, for this and many other proposals.

i have not received a response.

Jacob wrote:

> it would probably be a good idea to split the trancendental extensions
> into separate f32, f64, f16, and f128 extensions, since some implementations 
> may want to only implement them for f32 while still implementing the D
> (f64 arithmetic) extension.

oh, of course. Ztrans.F/Q/S/H is a really good point.

c) a statement on <approximate> execution time ?

what jacob said.

as a Standard, we can't limit the proposal in ways that would restrict or exclude implementors.  accuracy on the other hand *is* important, because it could potentially cause catastrophic failures if an algorithm is written to critically rely on a given accuracy.

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )


ah, excellent - i'll add that recipe to the document.   Zfhyp, separate extension.

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.

oo that's very interesting.   of course.  i like it.

the only thing: as a Standard, some implementors may find it more efficient to implement LOG than LOGP1 (likewise with exp).  in particular, if CORDIC is used (which i have just recently found, and am absolutely amazed by - https://en.wikipedia.org/wiki/CORDIC) i cannot find a LOGP1/EXPM1 version of that.

CORDIC would be the most sensible "efficient" choice of hardware algorithm, simply because of the sheer overwhelming number of transcendentals that it covers.  if there isn't a way to implement LOGP1 using CORDIC, and one but not the other is chosen, some implementation options will be limited / penalised.

this is one of the really tricky things about Standards.  if we were doing a single implementation, not intended in any way to be Standards-compliant, we could make the decision, best optimised option, according to our requirements, and to hell with everyone else.  take that approach with a Standard, and it results in... other teams creating their own Standard.

having two near-identical opcodes where one may be implemented in terms of the other is however rather unfortunately against the principle of RISC.  in this particular case, though, the hardware implementation actually matters.

does anyone know if CORDIC can be adapted to do LOGP1 as well as LOG?  ha, funny, i found this:

unfortunately, the original dr dobbs article, which has "example 4(d)" as a hyperlink, redirects to a 404 not found.

l.

MitchAlsup

unread,
Aug 7, 2019, 8:29:29 PM8/7/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

Correctly rounded results will require a lot more difficult hardware and more cycles of execution.
Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.
Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)
Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.

b) a reserve on the OpCode space for the double precision equivalents ?
the 2 bits right below the funct5 field select from:
00: f32
01: f64
10: f16
11: f128

so f64 is definitely included.

see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratified

it would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.

c) a statement on <approximate> execution time ?
that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).

For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).

I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.

I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19 

Faithful rounding is not IEEE 754 correct. The unit I have designed makes an IEEE rounding error about once every 171 calculations.

MitchAlsup

unread,
Aug 7, 2019, 8:32:57 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
Both Motorola CORDIC and Intel CORDIC specified the LOGP1 and EXP1M instead of LOG and EXP. 

lkcl

unread,
Aug 7, 2019, 8:45:23 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
i think i managed to interpret the paper, below - it tends to suggest that LOG is possible with the standard hyperbolic CORDIC.  the thing is: the add 1 is done *outside* the LOG(a), which tends to suggest that the iterative algorithm needs modifying...

... unless it's as simple as setting Z0=1

does that look reasonable?

[i really don't like deriving algorithms like this from scratch: someone somewhere has done this, it's so ubiquitous.  i'd be much happier - much more comfortable - when i can see (and execute) a software algorithm that shows how it's done.]

---

https://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm

Since: ln(a) = 2Tanh-1( (a-1) / (a+1)
 
The function ln(α) is obtained by multiplying by 2 the final result 
ZN. (Equation (4)), provided that Z0=0, X0= a+1, and Y0= a-1.

 

lkcl

unread,
Aug 7, 2019, 8:57:38 PM8/7/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 1:29:29 AM UTC+1, MitchAlsup wrote:


On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

Correctly rounded results will require a lot more difficult hardware and more cycles of execution.
Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.
Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)
Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.

hmmm... i don't know what to suggest / recommend here.  there's two separate requirements: accuracy (OpenCL, numerical scenarios), and 3D GPUs, where better accuracy is not essential.

i would be tempted to say that it was reasonable to suggest that if you're going to use FP32, expectations are lower so "what the heck".  however i have absolutely *no* idea what the industry consensus is, here.

i do know that you've an enormous amount of expertise and experience in the 3D GPU area, Mitch.

I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.

I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19 

yes please.  

there will be other implementors of this Standard that will want to make a different call on which direction to go.

l.

MitchAlsup

unread,
Aug 7, 2019, 9:17:37 PM8/7/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org
An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.

At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.

On the other hand, back when I started doing this (CPU design) the guys using floating point just wanted speed and they were willing to put up with not only IBM floating point (Hex normalization, and gard digit) but even CRAY floating point (CDC 6600, CDC 7600, CRAY 1) which was demonstrably WORSE in the numerics department.

In any event; to all. but 5 floating point guys in the world, a rounding error (compared to the correctly rounded result) occurring less often than 3% of the time and no more than 1 ULP, is as accurate as they need (caveat: so long as the arithmetic is repeatable.) As witness, the FDIV <lack of> instruction in ITANIC had a 0.502 ULP accuracy (Markstein) and nobody complained.

My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.
My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.
I have no feeling at all on where to draw the line.

lkcl

unread,
Aug 8, 2019, 1:20:03 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 2:17:37 AM UTC+1, MitchAlsup wrote:
An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.

At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.

someone on the libre-riscv-dev list just hilariously pointed out that Ahmdahl-compatible IBM370 had FP that was more accurate than the 370: customers *complained* and they had to provide libraries that *de-accurified* the FP calculations :)

My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.
My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.
I have no feeling at all on where to draw the line.

this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.

l.

Jacob Lifshay

unread,
Aug 8, 2019, 1:30:11 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Wed, Aug 7, 2019, 22:20 lkcl <luke.l...@gmail.com> wrote:
this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
That wouldn't quite work on our GPU design, since it's supposed to be both a GPU and a CPU that conforms to the UNIX Platform, it would need to meet the requirements of the UNIX Platform and the 3D Platform, which would still end up with correct rounding being needed.

lkcl

unread,
Aug 8, 2019, 1:36:57 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
yes, sorry: forgot to mention (so worth spelling out explicitly) - hybrid systems intended for multi-purpose obviously would need to meet the standard of the highest-accuracy purpose for which it is intended.

although doing FP64 as well, even that would need to meet the UNIX Platform spec standard.

adding these three options is to make it so that other implementors make the choice.  where interoperability matters (due to distribution of precompiled binaries that are targetted at multiple independent implementations), requirements have to be stricter.

l. 

Luis Vitorio Cargnini

unread,
Aug 8, 2019, 2:28:43 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, lkcl
Hello, 

My $0.02 of contribution 
Regarding the comment of 3 platforms:

> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
No, IEEE, ARM is an embedded platform and they implement IEEE in all of them
> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
Standard IEEE, simple no 'new' on this sector.
> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.


No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.


Best Regards,
Luis Vitorio Cargnini, Ph.D.
Senior Hardware Engineer
OURS Technology Inc., Santa Clara, California, 95054


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/43b3c671-7e13-4229-838e-71c7e293941b%40groups.riscv.org.

Andrew Waterman

unread,
Aug 8, 2019, 2:30:38 AM8/8/19
to MitchAlsup, RISC-V ISA Dev, Libre-RISCV General Development
Hi folks,

We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

Andrew

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Jacob Lifshay

unread,
Aug 8, 2019, 2:44:40 AM8/8/19
to Luis Vitorio Cargnini, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org, lkcl
On Wed, Aug 7, 2019, 23:28 Luis Vitorio Cargnini <lvcar...@ours-tech.com> wrote:
Hello, 

My $0.02 of contribution 
Regarding the comment of 3 platforms:
> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
No, IEEE, ARM is an embedded platform and they implement IEEE in all of them
> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
Standard IEEE, simple no 'new' on this sector.
> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.


No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.
Note that IEEE-754 specifies correctly rounded results for all the proposed functions.

lkcl

unread,
Aug 8, 2019, 3:09:50 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, luke.l...@gmail.com
On Thursday, August 8, 2019 at 2:28:43 PM UTC+8, Luis Vitorio Cargnini(OURS/RiVAI) wrote:

> > * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
> No, IEEE, ARM is an embedded platform and they implement IEEE in all of them

I can see the sense in that one. I just thought that some 3D implementors, particularly say in specialist markets, would want the choice.

Hmmmm, perhaps a 3D Embedded spec, separate from "just" Embedded.

> > * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
> Standard IEEE, simple no 'new' on this sector.

Makes sense. Cannot risk noninteroperability, even if it means a higher gate count or larger latency.

> > * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
>
>
>
>
> No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.

Ok, this is where that's not going to fly. As Mitch mentioned, full IEEE754 compliance would result in doubling the gate count and/or decreasing latency through longer pipelines.

In speaking with Jeff Bush from Nyuzi I learned that a GPU is insanely dominated by its FP ALU gate count: well over 50% of the entire chip.

If you double the gate count due to imposition of unnecessary accuracy (unnecessary because due to 3D Standards compliance all the shader algorithms are *designed* to lower accuracy requirements), the proposal will be flat-out rejected by adopters because products based around it will come with a whopping 100% power-performance penalty compared to industry standard alternatives.

So this is why I floated (ha ha) the idea of a new Platform Spec, to give implementors the space to meet industry-driven requirements and remain competitive.

Ironically our implementation will need to meet UNIX requirements, it is one of the quirks / penalties of a hybrid design.

L.

Jacob Lifshay

unread,
Aug 8, 2019, 3:11:01 AM8/8/19
to Andrew Waterman, Mitchalsup, RISC-V ISA Dev, Libre-RISCV General Development
On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
Hi folks,

We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Jacob Lifshay

lkcl

unread,
Aug 8, 2019, 3:20:23 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 2:30:38 PM UTC+8, waterman wrote:
> Hi folks,
>
>
> We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

There are definitely alternative (conflicting) directions here which are driven by price and performance in radically different markets.

3D graphics is seriously optimised, in completely different directions from those that drove the IEEE754 standard, and unfortunately it has been left up to secretive proprietary companies to lead that, as the NREs kept going up and up, driving out Number Nine, Matrox, ATI getting bought by AMD and so on.

A new Open 3D Alliance initiative is in its early stage of being formed and the plan is to get some feedback from members on what they want, here.

This proposal is therefore part of "planning ahead", and there are *going* to be diverse and highly specialist requirements for which IEEE754 compliance is just not going to fly.... *and* there are going to be adopters for whom IEEE754 is absolutely essential.

Recognising this, by creating separate Platform Specs (specially crafted for 3D implementors that distinguish them from the Embedded and UNIX specs) is, realistically, the pragmatic way forward.

L.

Andrew Waterman

unread,
Aug 8, 2019, 3:33:25 AM8/8/19
to Jacob Lifshay, Mitchalsup, RISC-V ISA Dev, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 12:11 AM Jacob Lifshay <program...@gmail.com> wrote:
On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
Hi folks,

We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

That is not a quantitative approach to computer architecture.  We don't add nontrivial features on the basis that they are useful; we add them on the basis that their measured utility justifies their cost.


I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Yeah, this is the cart leading the horse.  It's not obvious that the proposed opcodes justify standardization.


Jacob Lifshay

lkcl

unread,
Aug 8, 2019, 3:50:17 AM8/8/19
to RISC-V ISA Dev, wate...@eecs.berkeley.edu, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 3:11:01 PM UTC+8, Jacob Lifshay wrote:
> On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>
> Hi folks,
>
>
> We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.
>
>
>
> Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation),

https://patents.google.com/patent/US9471305B2/en

This is really cool. Like CORDIC, it covers a huge range of operations. Mitch described it in the R-Sqrt thread.

> I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

If we were talking about an embedded-only product, or a co-processor, the firmware requiring hard forked compilers or specialist dedicated compilers (like hoe NVIDIA and AMD do it), we would neither be having this discussion publicly nor putting forward a common Zftrans / Ztrig* spec.

This proposal is for *multiple* use cases *including* hybrid CPU/GPU, low power embedded specialist 3D, *and* standard UNIX (GNU libm).

In talking with Atif from Pixilica a few days ago he relayed to me the responses he got

https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d4322170924340017bfeeab

The attendance was *50* people at the BoF! He was expecting maybe two or three :) Some 3D engineers were doing transparent polygons which requires checking the hits from both sides. Using *proprietary* GPUs they have a 100% performance penalty as it is a 2 pass operation.

Others have non-standard projection surfaces (spherical, not flat). No *way* proprietary hardware/software is going to cope with that.

Think Silicon has some stringent low power requirements for their embedded GPUs.

Machine Learning has another set of accuracy requirements (way laxer), where Jacon I think mentioned that atan in FP16 can be adequately implemented with a single cycle lookup table (something like that)

OpenCL even has specialist "fast inaccurate" SPIRV opcodes for some functions (SPIRV is part of Vulkan, and was originally based on LLVM IR). Search this page for "fast_" for examples:

https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

The point is: 3D, ML and OpenCL is *nothing* like the Embedded Platform or UNIX Platform world. Everything that we think we know about how it should be done is completely wrong, when it comes to this highly specialist and extremely diverse and unfortunately secretive market.

>
> I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Altivec SSE / Vector nightmare, and RISCV is toast.

When we reach the layout milestone, the implementation will be frozen. We are not going to waste our sponsors' money: we have to act responsibly and get it right.

Also, NLNet's funding, once allocated, is gone. We are therefore under time pressure to get the implementation done so that we can put in a second application for the layout.

Bottom line we are not going to wait around, the consequences are too severe (loss of access to funding).

L.


lkcl

unread,
Aug 8, 2019, 4:36:49 AM8/8/19
to RISC-V ISA Dev, program...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.

Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.

There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.

It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.

It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.

The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.

Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.

Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.

If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.

Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.

SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".

For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.

I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).

L.

lkcl

unread,
Aug 8, 2019, 4:53:32 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, luke.l...@gmail.com
On Thursday, August 8, 2019 at 2:28:43 PM UTC+8, Luis Vitorio Cargnini(OURS/RiVAI) wrote:

> No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own.

Just to emphasise, Luis, Andrew: "on their own" is precisely what each of the proprietary 3D GPU Vendors have done, and made literally billions of dollars by doing so.

Saying "we are on our own" and expecting that to mean that not conforming to IEEE754 would kill the proposal, this is false logic.

MALI (ARM), Vivante, the hated PowerVR, NVidia, AMD/ATI, Samsung's new GPU (with Mitch's work in it), and many more, they *all* went "their own way", hid the hardware behind a proprietary library, and *still made billions of dollars*.

This should tell you what you need to know, namely that a new 3D GPU Platform Spec which has specialist FP accuracy requirements to meet the specific needs of this *multi BILLION dollar market* is essential to the proposal's successful industry adoption.

If we restrict it to UNIX (IEEE754) it's dead.

If we restrict it to *not* require IEEE754, it's dead.

The way to meet all the different industry needs: new Platform Specs.

That does not affect the actual opcodes. They remain the same, no matter the Platform accuracy requirements.

Thus the software libraries and compilers all remain the same, as well.

L.

lkcl

unread,
Aug 8, 2019, 5:41:13 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:
 
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. 

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified.  are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
 
It does not make sense to allocate opcode space under these circumstances.

[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]

l.

Andrew Waterman

unread,
Aug 8, 2019, 6:00:16 AM8/8/19
to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 2:41 AM lkcl <luke.l...@gmail.com> wrote:


On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:
 
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. 

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified.  are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?

The sentence you quoted began with the adjective "ISA-level".  We happily provide correctly rounded transcendental math on Linux as-is.

 
It does not make sense to allocate opcode space under these circumstances.

[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Andrew Waterman

unread,
Aug 8, 2019, 6:07:21 AM8/8/19
to lkcl, RISC-V ISA Dev, Jacob Lifshay, MitchAlsup, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:
[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Oh, man.  This is great.  "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".
 

Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.

Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.

There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.

It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.

It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.

The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.

Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.

Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.

If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.

Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.

SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".

For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes.  My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.

I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Jacob Lifshay

unread,
Aug 8, 2019, 6:09:28 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Thu, Aug 8, 2019, 02:41 lkcl <luke.l...@gmail.com> wrote:


On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:
 
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. 

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified.  are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
- machine-learning-mode: fast as possible
    -- maybe need additional requirements such as monotonicity for atanh?
- GPU-mode: accurate to within a few ULP
    -- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines
- almost-accurate-mode: accurate to <1 ULP
     (would 0.51 or some other value be better?)
- fully-accurate-mode: correctly rounded in all cases
- maybe more modes?

all modes are required to produce deterministic answers (no random outputs for the same input) only depending on the input values, the mode, and the fp control reg.

the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.

there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible) that would allow user code to switch to faster software implementations if it chooses to.

Jacob Lifshay

lkcl

unread,
Aug 8, 2019, 7:17:41 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, program...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 11:07:21 AM UTC+1, andrew wrote:


On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:
[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Oh, man.  This is great.  "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".

bam bam... baaaaa :)

yes, i realised about half an hour later that we may have been speaking at cross-purposes, due to a misunderstanding that there's four separate potential Platforms here, covering each of the specialist areas.  very few people have *3D* optimisation experience [we're lucky to have Mitch involved].

sorry for the misunderstanding, Andrew.  follow-up question (already posted) seeks clarification.

l.

lkcl

unread,
Aug 8, 2019, 7:25:45 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?

ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.

are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?

thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so.  HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).

in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".

that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.

something like that?

l.

Jacob Lifshay

unread,
Aug 8, 2019, 7:47:38 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Thu, Aug 8, 2019, 04:25 lkcl <luke.l...@gmail.com> wrote:
On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?

ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.

are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?
yes.

also, having explicit mode bits allows emulating more accurate operations when the HW doesn't actually implement the extra gates needed.
This allows greater software portability (allows converting a libm call into a single instruction without requiring hw that implements the required accuracy).

thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so.  HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).

in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".

that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.

something like that?
yup.

I do think that there should be an exact-rounding mode even if the UNIX platform doesn't require that much accuracy, otherwise, HPC implementations (or others who need exact rounding) will run into the same dilemma of needing more instruction encodings again.

Jacob

Jacob Lifshay

unread,
Aug 8, 2019, 7:56:15 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Thu, Aug 8, 2019, 03:09 Jacob Lifshay <program...@gmail.com> wrote:
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
- machine-learning-mode: fast as possible
    -- maybe need additional requirements such as monotonicity for atanh?
- GPU-mode: accurate to within a few ULP
    -- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines
- almost-accurate-mode: accurate to <1 ULP
     (would 0.51 or some other value be better?)
- fully-accurate-mode: correctly rounded in all cases
- ma