FP transcendentals (trigonometry, root/exp/log) proposal

457 views
Skip to first unread message

lkcl

unread,
Aug 6, 2019, 8:33:46 AM8/6/19
to RISC-V ISA Dev, Libre-RISCV General Development
https://libre-riscv.org/ztrans_proposal/

As part of developing a Libre GPU that is intended for 3D, specialist Compute and Machine Learning, standard operations used in OpenCL are pretty much mandatory [1].

As they will end up in common public usage - upstream compilers with high volumes of downloads - it does not make sense for these opcodes to be relegated to "custom" status ["custom" status is suitable only for embedded proprietary usage that will never see the public light of day].

Also, they are not being proposed as part of RVV for the simple reason that as "scalar" opcodes, they can be used with *scalar* designs. It makes more sense that they be deployable in "embedded" designs (that may not have room for RVV, particularly as CORDIC seems to cover the vast majority of trigonometric algorithms and more [2]), or in NUMA parallel designs, where a cluster of shaders makes up for a lack of "vectorisation".

In addition: as scalar opcodes, they fit into the (huge, sparsely populated) FP opcode brownfield, whereas the RVV major opcode is much more under pressure.

The list of opcodes is at an early stage, and participation in its development is open and welcome to anyone involved in 3D and OpenCL Compute applications.

Context, research, links and discussion are being tracked on the libre riscv bugtracker [3].

L.

[1] https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html
[2] http://www.andraka.com/files/crdcsrvy.pdf
[3] http://bugs.libre-riscv.org/show_bug.cgi?id=127

MitchAlsup

unread,
Aug 7, 2019, 6:36:17 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
b) a reserve on the OpCode space for the double precision equivalents ?
c) a statement on <approximate> execution time ?

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.

Jacob Lifshay

unread,
Aug 7, 2019, 7:43:21 PM8/7/19
to MitchAlsup, RISC-V ISA Dev, Libre-RISCV General Development
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

b) a reserve on the OpCode space for the double precision equivalents ?
the 2 bits right below the funct5 field select from:
00: f32
01: f64
10: f16
11: f128

so f64 is definitely included.

see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratified

it would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.

c) a statement on <approximate> execution time ?
that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).

For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )
That's why the hyperbolics extension is split out into a separate extension. Also, a single instruction may be much faster since it can calculate it all as one operation (cordic will work) rather than requiring several slow operations sqrt/div and log.

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.
for the implementation techniques I know for log/exp, implementing both log/exp and logp1/expm1 is a slight increase in complexity compared to only one or the other (changing constants for polynomial/lut-based implementations and for cordic). I think it's worth saving the extra instructions for the common case of implementing pow (where you need log/exp) and logp1/expm1 is not worth getting rid of due to the small additional cost and additional accuracy obtained.

Jacob Lifshay

lkcl

unread,
Aug 7, 2019, 8:27:08 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
[some overlap with what jacob wrote, reviewing/removing redundant replies]


On Wednesday, August 7, 2019 at 11:36:17 PM UTC+1, MitchAlsup wrote:
Is this proposal going to <eventually> include:: 
a) statement on required/delivered numeric accuracy per transcendental ?


jacob makes and emphasises the point that these are intended to be *scalar* operations, for direct use in libm.

b) a reserve on the OpCode space for the double precision equivalents ?


reservations, even where the case has been made clear that the impact of not having a reservation will cause severe detrimental ongoing impact for the wider RISC-V community, do not have an IANA-style contact/proposal procedure.  i've repeatedly requested an official reservation, for this and many other proposals.

i have not received a response.

Jacob wrote:

> it would probably be a good idea to split the trancendental extensions
> into separate f32, f64, f16, and f128 extensions, since some implementations 
> may want to only implement them for f32 while still implementing the D
> (f64 arithmetic) extension.

oh, of course. Ztrans.F/Q/S/H is a really good point.

c) a statement on <approximate> execution time ?

what jacob said.

as a Standard, we can't limit the proposal in ways that would restrict or exclude implementors.  accuracy on the other hand *is* important, because it could potentially cause catastrophic failures if an algorithm is written to critically rely on a given accuracy.

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )


ah, excellent - i'll add that recipe to the document.   Zfhyp, separate extension.

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.

oo that's very interesting.   of course.  i like it.

the only thing: as a Standard, some implementors may find it more efficient to implement LOG than LOGP1 (likewise with exp).  in particular, if CORDIC is used (which i have just recently found, and am absolutely amazed by - https://en.wikipedia.org/wiki/CORDIC) i cannot find a LOGP1/EXPM1 version of that.

CORDIC would be the most sensible "efficient" choice of hardware algorithm, simply because of the sheer overwhelming number of transcendentals that it covers.  if there isn't a way to implement LOGP1 using CORDIC, and one but not the other is chosen, some implementation options will be limited / penalised.

this is one of the really tricky things about Standards.  if we were doing a single implementation, not intended in any way to be Standards-compliant, we could make the decision, best optimised option, according to our requirements, and to hell with everyone else.  take that approach with a Standard, and it results in... other teams creating their own Standard.

having two near-identical opcodes where one may be implemented in terms of the other is however rather unfortunately against the principle of RISC.  in this particular case, though, the hardware implementation actually matters.

does anyone know if CORDIC can be adapted to do LOGP1 as well as LOG?  ha, funny, i found this:

unfortunately, the original dr dobbs article, which has "example 4(d)" as a hyperlink, redirects to a 404 not found.

l.

MitchAlsup

unread,
Aug 7, 2019, 8:29:29 PM8/7/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

Correctly rounded results will require a lot more difficult hardware and more cycles of execution.
Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.
Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)
Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.

b) a reserve on the OpCode space for the double precision equivalents ?
the 2 bits right below the funct5 field select from:
00: f32
01: f64
10: f16
11: f128

so f64 is definitely included.

see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratified

it would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.

c) a statement on <approximate> execution time ?
that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).

For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).

I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.

I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19 

Faithful rounding is not IEEE 754 correct. The unit I have designed makes an IEEE rounding error about once every 171 calculations.

MitchAlsup

unread,
Aug 7, 2019, 8:32:57 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
Both Motorola CORDIC and Intel CORDIC specified the LOGP1 and EXP1M instead of LOG and EXP. 

lkcl

unread,
Aug 7, 2019, 8:45:23 PM8/7/19
to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
i think i managed to interpret the paper, below - it tends to suggest that LOG is possible with the standard hyperbolic CORDIC.  the thing is: the add 1 is done *outside* the LOG(a), which tends to suggest that the iterative algorithm needs modifying...

... unless it's as simple as setting Z0=1

does that look reasonable?

[i really don't like deriving algorithms like this from scratch: someone somewhere has done this, it's so ubiquitous.  i'd be much happier - much more comfortable - when i can see (and execute) a software algorithm that shows how it's done.]

---

https://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm

Since: ln(a) = 2Tanh-1( (a-1) / (a+1)
 
The function ln(α) is obtained by multiplying by 2 the final result 
ZN. (Equation (4)), provided that Z0=0, X0= a+1, and Y0= a-1.

 

lkcl

unread,
Aug 7, 2019, 8:57:38 PM8/7/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 1:29:29 AM UTC+1, MitchAlsup wrote:


On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

Correctly rounded results will require a lot more difficult hardware and more cycles of execution.
Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.
Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)
Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.

hmmm... i don't know what to suggest / recommend here.  there's two separate requirements: accuracy (OpenCL, numerical scenarios), and 3D GPUs, where better accuracy is not essential.

i would be tempted to say that it was reasonable to suggest that if you're going to use FP32, expectations are lower so "what the heck".  however i have absolutely *no* idea what the industry consensus is, here.

i do know that you've an enormous amount of expertise and experience in the 3D GPU area, Mitch.

I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.

I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19 

yes please.  

there will be other implementors of this Standard that will want to make a different call on which direction to go.

l.

MitchAlsup

unread,
Aug 7, 2019, 9:17:37 PM8/7/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org
An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.

At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.

On the other hand, back when I started doing this (CPU design) the guys using floating point just wanted speed and they were willing to put up with not only IBM floating point (Hex normalization, and gard digit) but even CRAY floating point (CDC 6600, CDC 7600, CRAY 1) which was demonstrably WORSE in the numerics department.

In any event; to all. but 5 floating point guys in the world, a rounding error (compared to the correctly rounded result) occurring less often than 3% of the time and no more than 1 ULP, is as accurate as they need (caveat: so long as the arithmetic is repeatable.) As witness, the FDIV <lack of> instruction in ITANIC had a 0.502 ULP accuracy (Markstein) and nobody complained.

My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.
My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.
I have no feeling at all on where to draw the line.

lkcl

unread,
Aug 8, 2019, 1:20:03 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 2:17:37 AM UTC+1, MitchAlsup wrote:
An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.

At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.

someone on the libre-riscv-dev list just hilariously pointed out that Ahmdahl-compatible IBM370 had FP that was more accurate than the 370: customers *complained* and they had to provide libraries that *de-accurified* the FP calculations :)

My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.
My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.
I have no feeling at all on where to draw the line.

this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.

l.

Jacob Lifshay

unread,
Aug 8, 2019, 1:30:11 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Wed, Aug 7, 2019, 22:20 lkcl <luke.l...@gmail.com> wrote:
this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
That wouldn't quite work on our GPU design, since it's supposed to be both a GPU and a CPU that conforms to the UNIX Platform, it would need to meet the requirements of the UNIX Platform and the 3D Platform, which would still end up with correct rounding being needed.

lkcl

unread,
Aug 8, 2019, 1:36:57 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
yes, sorry: forgot to mention (so worth spelling out explicitly) - hybrid systems intended for multi-purpose obviously would need to meet the standard of the highest-accuracy purpose for which it is intended.

although doing FP64 as well, even that would need to meet the UNIX Platform spec standard.

adding these three options is to make it so that other implementors make the choice.  where interoperability matters (due to distribution of precompiled binaries that are targetted at multiple independent implementations), requirements have to be stricter.

l. 

Luis Vitorio Cargnini

unread,
Aug 8, 2019, 2:28:43 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, lkcl
Hello, 

My $0.02 of contribution 
Regarding the comment of 3 platforms:

> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
No, IEEE, ARM is an embedded platform and they implement IEEE in all of them
> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
Standard IEEE, simple no 'new' on this sector.
> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.


No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.


Best Regards,
Luis Vitorio Cargnini, Ph.D.
Senior Hardware Engineer
OURS Technology Inc., Santa Clara, California, 95054


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/43b3c671-7e13-4229-838e-71c7e293941b%40groups.riscv.org.

Andrew Waterman

unread,
Aug 8, 2019, 2:30:38 AM8/8/19
to MitchAlsup, RISC-V ISA Dev, Libre-RISCV General Development
Hi folks,

We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

Andrew

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Jacob Lifshay

unread,
Aug 8, 2019, 2:44:40 AM8/8/19
to Luis Vitorio Cargnini, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org, lkcl
On Wed, Aug 7, 2019, 23:28 Luis Vitorio Cargnini <lvcar...@ours-tech.com> wrote:
Hello, 

My $0.02 of contribution 
Regarding the comment of 3 platforms:
> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
No, IEEE, ARM is an embedded platform and they implement IEEE in all of them
> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
Standard IEEE, simple no 'new' on this sector.
> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.


No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.
Note that IEEE-754 specifies correctly rounded results for all the proposed functions.

lkcl

unread,
Aug 8, 2019, 3:09:50 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, luke.l...@gmail.com
On Thursday, August 8, 2019 at 2:28:43 PM UTC+8, Luis Vitorio Cargnini(OURS/RiVAI) wrote:

> > * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
> No, IEEE, ARM is an embedded platform and they implement IEEE in all of them

I can see the sense in that one. I just thought that some 3D implementors, particularly say in specialist markets, would want the choice.

Hmmmm, perhaps a 3D Embedded spec, separate from "just" Embedded.

> > * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
> Standard IEEE, simple no 'new' on this sector.

Makes sense. Cannot risk noninteroperability, even if it means a higher gate count or larger latency.

> > * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
>
>
>
>
> No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.

Ok, this is where that's not going to fly. As Mitch mentioned, full IEEE754 compliance would result in doubling the gate count and/or decreasing latency through longer pipelines.

In speaking with Jeff Bush from Nyuzi I learned that a GPU is insanely dominated by its FP ALU gate count: well over 50% of the entire chip.

If you double the gate count due to imposition of unnecessary accuracy (unnecessary because due to 3D Standards compliance all the shader algorithms are *designed* to lower accuracy requirements), the proposal will be flat-out rejected by adopters because products based around it will come with a whopping 100% power-performance penalty compared to industry standard alternatives.

So this is why I floated (ha ha) the idea of a new Platform Spec, to give implementors the space to meet industry-driven requirements and remain competitive.

Ironically our implementation will need to meet UNIX requirements, it is one of the quirks / penalties of a hybrid design.

L.

Jacob Lifshay

unread,
Aug 8, 2019, 3:11:01 AM8/8/19
to Andrew Waterman, Mitchalsup, RISC-V ISA Dev, Libre-RISCV General Development
On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
Hi folks,

We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Jacob Lifshay

lkcl

unread,
Aug 8, 2019, 3:20:23 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 2:30:38 PM UTC+8, waterman wrote:
> Hi folks,
>
>
> We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

There are definitely alternative (conflicting) directions here which are driven by price and performance in radically different markets.

3D graphics is seriously optimised, in completely different directions from those that drove the IEEE754 standard, and unfortunately it has been left up to secretive proprietary companies to lead that, as the NREs kept going up and up, driving out Number Nine, Matrox, ATI getting bought by AMD and so on.

A new Open 3D Alliance initiative is in its early stage of being formed and the plan is to get some feedback from members on what they want, here.

This proposal is therefore part of "planning ahead", and there are *going* to be diverse and highly specialist requirements for which IEEE754 compliance is just not going to fly.... *and* there are going to be adopters for whom IEEE754 is absolutely essential.

Recognising this, by creating separate Platform Specs (specially crafted for 3D implementors that distinguish them from the Embedded and UNIX specs) is, realistically, the pragmatic way forward.

L.

Andrew Waterman

unread,
Aug 8, 2019, 3:33:25 AM8/8/19
to Jacob Lifshay, Mitchalsup, RISC-V ISA Dev, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 12:11 AM Jacob Lifshay <program...@gmail.com> wrote:
On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
Hi folks,

We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.

Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

That is not a quantitative approach to computer architecture.  We don't add nontrivial features on the basis that they are useful; we add them on the basis that their measured utility justifies their cost.


I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Yeah, this is the cart leading the horse.  It's not obvious that the proposed opcodes justify standardization.


Jacob Lifshay

lkcl

unread,
Aug 8, 2019, 3:50:17 AM8/8/19
to RISC-V ISA Dev, wate...@eecs.berkeley.edu, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 3:11:01 PM UTC+8, Jacob Lifshay wrote:
> On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>
> Hi folks,
>
>
> We would seem to be putting the cart before the horse.  ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.  It does not make sense to allocate opcode space under these circumstances.
>
>
>
> Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation),

https://patents.google.com/patent/US9471305B2/en

This is really cool. Like CORDIC, it covers a huge range of operations. Mitch described it in the R-Sqrt thread.

> I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

If we were talking about an embedded-only product, or a co-processor, the firmware requiring hard forked compilers or specialist dedicated compilers (like hoe NVIDIA and AMD do it), we would neither be having this discussion publicly nor putting forward a common Zftrans / Ztrig* spec.

This proposal is for *multiple* use cases *including* hybrid CPU/GPU, low power embedded specialist 3D, *and* standard UNIX (GNU libm).

In talking with Atif from Pixilica a few days ago he relayed to me the responses he got

https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d4322170924340017bfeeab

The attendance was *50* people at the BoF! He was expecting maybe two or three :) Some 3D engineers were doing transparent polygons which requires checking the hits from both sides. Using *proprietary* GPUs they have a 100% performance penalty as it is a 2 pass operation.

Others have non-standard projection surfaces (spherical, not flat). No *way* proprietary hardware/software is going to cope with that.

Think Silicon has some stringent low power requirements for their embedded GPUs.

Machine Learning has another set of accuracy requirements (way laxer), where Jacon I think mentioned that atan in FP16 can be adequately implemented with a single cycle lookup table (something like that)

OpenCL even has specialist "fast inaccurate" SPIRV opcodes for some functions (SPIRV is part of Vulkan, and was originally based on LLVM IR). Search this page for "fast_" for examples:

https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

The point is: 3D, ML and OpenCL is *nothing* like the Embedded Platform or UNIX Platform world. Everything that we think we know about how it should be done is completely wrong, when it comes to this highly specialist and extremely diverse and unfortunately secretive market.

>
> I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Altivec SSE / Vector nightmare, and RISCV is toast.

When we reach the layout milestone, the implementation will be frozen. We are not going to waste our sponsors' money: we have to act responsibly and get it right.

Also, NLNet's funding, once allocated, is gone. We are therefore under time pressure to get the implementation done so that we can put in a second application for the layout.

Bottom line we are not going to wait around, the consequences are too severe (loss of access to funding).

L.


lkcl

unread,
Aug 8, 2019, 4:36:49 AM8/8/19
to RISC-V ISA Dev, program...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.

Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.

There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.

It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.

It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.

The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.

Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.

Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.

If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.

Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.

SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".

For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.

I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).

L.

lkcl

unread,
Aug 8, 2019, 4:53:32 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, luke.l...@gmail.com
On Thursday, August 8, 2019 at 2:28:43 PM UTC+8, Luis Vitorio Cargnini(OURS/RiVAI) wrote:

> No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own.

Just to emphasise, Luis, Andrew: "on their own" is precisely what each of the proprietary 3D GPU Vendors have done, and made literally billions of dollars by doing so.

Saying "we are on our own" and expecting that to mean that not conforming to IEEE754 would kill the proposal, this is false logic.

MALI (ARM), Vivante, the hated PowerVR, NVidia, AMD/ATI, Samsung's new GPU (with Mitch's work in it), and many more, they *all* went "their own way", hid the hardware behind a proprietary library, and *still made billions of dollars*.

This should tell you what you need to know, namely that a new 3D GPU Platform Spec which has specialist FP accuracy requirements to meet the specific needs of this *multi BILLION dollar market* is essential to the proposal's successful industry adoption.

If we restrict it to UNIX (IEEE754) it's dead.

If we restrict it to *not* require IEEE754, it's dead.

The way to meet all the different industry needs: new Platform Specs.

That does not affect the actual opcodes. They remain the same, no matter the Platform accuracy requirements.

Thus the software libraries and compilers all remain the same, as well.

L.

lkcl

unread,
Aug 8, 2019, 5:41:13 AM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:
 
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. 

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified.  are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
 
It does not make sense to allocate opcode space under these circumstances.

[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]

l.

Andrew Waterman

unread,
Aug 8, 2019, 6:00:16 AM8/8/19
to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 2:41 AM lkcl <luke.l...@gmail.com> wrote:


On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:
 
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. 

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified.  are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?

The sentence you quoted began with the adjective "ISA-level".  We happily provide correctly rounded transcendental math on Linux as-is.

 
It does not make sense to allocate opcode space under these circumstances.

[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Andrew Waterman

unread,
Aug 8, 2019, 6:07:21 AM8/8/19
to lkcl, RISC-V ISA Dev, Jacob Lifshay, MitchAlsup, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:
[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Oh, man.  This is great.  "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".
 

Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.

Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.

There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.

It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.

It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.

The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.

Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.

Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.

If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.

Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.

SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".

For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes.  My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.

I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Jacob Lifshay

unread,
Aug 8, 2019, 6:09:28 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Thu, Aug 8, 2019, 02:41 lkcl <luke.l...@gmail.com> wrote:


On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:
 
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. 

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified.  are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
- machine-learning-mode: fast as possible
    -- maybe need additional requirements such as monotonicity for atanh?
- GPU-mode: accurate to within a few ULP
    -- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines
- almost-accurate-mode: accurate to <1 ULP
     (would 0.51 or some other value be better?)
- fully-accurate-mode: correctly rounded in all cases
- maybe more modes?

all modes are required to produce deterministic answers (no random outputs for the same input) only depending on the input values, the mode, and the fp control reg.

the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.

there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible) that would allow user code to switch to faster software implementations if it chooses to.

Jacob Lifshay

lkcl

unread,
Aug 8, 2019, 7:17:41 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, program...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 11:07:21 AM UTC+1, andrew wrote:


On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:
[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Oh, man.  This is great.  "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".

bam bam... baaaaa :)

yes, i realised about half an hour later that we may have been speaking at cross-purposes, due to a misunderstanding that there's four separate potential Platforms here, covering each of the specialist areas.  very few people have *3D* optimisation experience [we're lucky to have Mitch involved].

sorry for the misunderstanding, Andrew.  follow-up question (already posted) seeks clarification.

l.

lkcl

unread,
Aug 8, 2019, 7:25:45 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?

ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.

are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?

thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so.  HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).

in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".

that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.

something like that?

l.

Jacob Lifshay

unread,
Aug 8, 2019, 7:47:38 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Thu, Aug 8, 2019, 04:25 lkcl <luke.l...@gmail.com> wrote:
On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?

ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.

are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?
yes.

also, having explicit mode bits allows emulating more accurate operations when the HW doesn't actually implement the extra gates needed.
This allows greater software portability (allows converting a libm call into a single instruction without requiring hw that implements the required accuracy).

thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so.  HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).

in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".

that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.

something like that?
yup.

I do think that there should be an exact-rounding mode even if the UNIX platform doesn't require that much accuracy, otherwise, HPC implementations (or others who need exact rounding) will run into the same dilemma of needing more instruction encodings again.

Jacob

Jacob Lifshay

unread,
Aug 8, 2019, 7:56:15 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Thu, Aug 8, 2019, 03:09 Jacob Lifshay <program...@gmail.com> wrote:
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
- machine-learning-mode: fast as possible
    -- maybe need additional requirements such as monotonicity for atanh?
- GPU-mode: accurate to within a few ULP
    -- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines
- almost-accurate-mode: accurate to <1 ULP
     (would 0.51 or some other value be better?)
- fully-accurate-mode: correctly rounded in all cases
- maybe more modes?

One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.

Jacob

lkcl

unread,
Aug 8, 2019, 8:32:43 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 7:47:38 PM UTC+8, Jacob Lifshay wrote:

> are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?
> yes.

Ok. I like it. It's kinda only sonething that hybrid CPU/GPU combinations would want, however the level of interest that Pixilica got at SIGGRAPH 2019 in their hybrid CPU/GPU concept says to me that this is on the right track.

Also a dynamic switch stops any fighting over whether one Platform Spec should get priority preference to the exclusion of others.

Will update the page shortly.

>
> also, having explicit mode bits allows emulating more accurate operations when the HW doesn't actually implement the extra gates needed.

Oh, yes, good point, however it would only be mandatory for UNIX* Platforms to provide such traps.

> This allows greater software portability (allows converting a libm call into a single instruction without requiring hw that implements the required accuracy).

and associated performance penalties of doing so (extra conditional tests) if the trap isn't there. The conditional tests which substitute for a lack of a trap adversely impact performance for *both* modes.

>
> I do think that there should be an exact-rounding mode even if the UNIX platform doesn't require that much accuracy, otherwise, HPC implementations (or others who need exact rounding) will run into the same dilemma of needing more instruction encodings again.

Hmm hmm.... well, you know what? If it's behind a CSR Mode flag, and traps activate on unsupported modes, I see no reason why there should not be an extra accuracy mode.

L.


lkcl

unread,
Aug 8, 2019, 8:37:49 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 7:56:15 PM UTC+8, Jacob Lifshay wrote:

>
> One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.

Hmm don't know. Hendrik pointed out that Ahmdahl / IBM370 mainframe problem that extra accuracy caused.

I don't know if that lesson from history matters [in 2019].

No clue. Don't know enough to offer an opinion either way. Anyone any recommendations?

L.


Jacob Lifshay

unread,
Aug 8, 2019, 8:42:57 AM8/8/19
to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
On Thu, Aug 8, 2019, 05:37 lkcl <luke.l...@gmail.com> wrote:
On Thursday, August 8, 2019 at 7:56:15 PM UTC+8, Jacob Lifshay wrote:

>
> One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.

Hmm don't know. Hendrik pointed out that Ahmdahl / IBM370 mainframe problem that extra accuracy caused.
if portable results are desired, correct rounding produces the same results on all (even non-risc-v) hw, for all implementation algorithms.

less accurate modes produce results that depend on the exact algorithm chosen, which is a choice that should be left for implementers.

lkcl

unread,
Aug 8, 2019, 8:44:50 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 6:09:28 PM UTC+8, Jacob Lifshay wrote:

>
> maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

No definitely not ISAMUX/ISANS, its purpose is for switching (paging in) actual opcodes. Not quite true, LE/BE kinda flips in LE variants of LD/ST.

An FP CSR (dedicated or fields) makes more sense I think because it's quite a few bits, and I can see some potential value in the same bits being applied to F, G, H and Q as well.

Hmmm

lkcl

unread,
Aug 8, 2019, 8:56:44 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

Hendrik's example was that Ahmdahl hardware had correct (accurate) FP, where the IBM 370 did not.

Applications writers ran into problems when running on *more accurate* hardware. Ahmdahl had to patch the OS, with associated performance penalty, to *downgrade* the FP accuracy and to emulate IBM's *inaccurate* hardware, precisely.

What I do not know is whether there was something unique about the 370 mainframe and applications being written for it, or, if now in 2019, this is sufficiently well understood such that all FP applications writers have properly taken *better* accuracy (not worse accuracy: *better* accuracy) into consideration in the design of their programs.

Not knowing the answer to that question - not knowing if it is a risky proposition or not - tends to suggest to me that erring on the side of caution and *not* letting implementors provide more accuracy than the FP Accuracy CSR requests is the "safer" albeit more hardware-burdensome option.

Hence why I said I have no clue what the best answer is, here.

L.

lkcl

unread,
Aug 8, 2019, 9:27:50 AM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 8:56:44 PM UTC+8, lkcl wrote:


> What I do not know is whether there was something unique about the 370 mainframe and applications being written for it, or, if now in 2019, this is sufficiently well understood such that all FP applications writers have properly taken *better* accuracy (not worse accuracy: *better* accuracy) into consideration in the design of their programs.

I *think* this is what Andrew might have been trying to get across.

L.

Andrew Waterman

unread,
Aug 8, 2019, 9:51:57 AM8/8/19
to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development
That story might have roots in IBM's alternate base-16 floating-point format, but in any case, that wasn't the point I was trying to make.  I stand by my first message to this thread.


L.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

lkcl .

unread,
Aug 8, 2019, 10:01:29 AM8/8/19
to Andrew Waterman, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 11:00 AM Andrew Waterman <and...@sifive.com> wrote:

>> wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
>
>
> The sentence you quoted began with the adjective "ISA-level". We happily provide correctly rounded transcendental math on Linux as-is.

i am very confused. we seem to be talking at cross-purposes, and i
have no idea where the confusion lies.

it appears that you are rejecting the possibility of providing
ISA-level support for transcendental and trigonometric operations for
*four* possible platform scenarios just because "correctly rounded
transcendental math is provided on linux". this has me utterly
confused.

particularly when, even *on* one of those platforms - the standard
UNIX Platform - Jacob pointed out that there may exist
High-Performance Server scenarios that would want the increased
performance - *on linux* - that such ISA-level support would provide.

i apologise: i don't understand what is going on.

why would even *one* argument "we provide accurate SOFTWARE math
libraries on linux" be reasonable cause to reject HARDWARE support for
the same?

why would that argument be relevant for THREE OTHER completely
different Platform Profiles?

i don't understand.

l.

lkcl .

unread,
Aug 8, 2019, 10:02:46 AM8/8/19
to Andrew Waterman, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development
On Thu, Aug 8, 2019 at 2:51 PM Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> That story might have roots in IBM's alternate base-16 floating-point format, but in any case, that wasn't the point I was trying to make. I stand by my first message to this thread.

what point were you trying to make? i don't understand it. or, i see
the words: i just don't understand the implications, which seem
extreme and illogical. so i must be missing something.

l.

Andrew Waterman

unread,
Aug 8, 2019, 10:14:58 AM8/8/19
to lkcl ., Libre-RISCV General Development, MitchAlsup, RISC-V ISA Dev
I don’t understand the need for all those capital letters.

As I mentioned earlier, an instruction being useful is not by itself a justification for adding it to the ISA. Where’s the quantitative case for these instructions?



l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Allen Baum

unread,
Aug 8, 2019, 10:55:27 AM8/8/19
to Jacob Lifshay, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org
From my point of view, it needs to match the reference model for any ratified standard., else it won’t be labeled compliant. We’ve talked about something looser, especially for vector reduce where implementation operate ordering could produce wildly different results-  it unlikely to happen.

-Allen

On Aug 7, 2019, at 10:29 PM, Jacob Lifshay <program...@gmail.com> wrote:

On Wed, Aug 7, 2019, 22:20 lkcl <luke.l...@gmail.com> wrote:
this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
That wouldn't quite work on our GPU design, since it's supposed to be both a GPU and a CPU that conforms to the UNIX Platform, it would need to meet the requirements of the UNIX Platform and the 3D Platform, which would still end up with correct rounding being needed.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

MitchAlsup

unread,
Aug 8, 2019, 11:58:17 AM8/8/19
to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

We are talking about all of this without a point of reference.

Here is what I do know about correctly rounded transcendentals::

My technology for performing transcendentals in an FMAC unit performs a power series polynomial calculation.

I can achieve 14 cycle LN2, EXP2 and 19 cycle SIN, COS faithfully rounded with coefficient tables which are (essentially) the same size as the FDIV/FSQRT seed tables for Newton-Raphson (or Goldschmidt) iterations. FDIV will end up at 17 cycles and FSQRT at 23 cycles. This is exactly what Opteron FDIV/FSQRT performance was (oh so onog ago). 

If you impose the correctly rounded requirement:: 
a) the size of the coefficient tables grows by 3.5× and 
b) the number of cycles to compute grows by 1.8×
c) the power to compute grows by 2.5×
For a gain of accuracy of about 0.005 ULP

Dan Petrisko

unread,
Aug 8, 2019, 2:19:38 PM8/8/19
to MitchAlsup, RISC-V ISA Dev, Jacob Lifshay, Luke Kenneth Casson Leighton, Libre-RISCV General Development
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

Preface: As Andrew points out, any ISA proposal must be associated with a quantitative evaluation to consider tradeoffs.

A natural place for a standard reduced accuracy extension "Zfpacc" would be in the reserved bits of FCSR.  It could be treated very similarly to how dynamic frm is treated now. Currently, there are 5 bits of fflags, 3 bits of frm and 24 Reserved bits. The L (decimal floating-point) extension will presumably use some, but not all of them. I'm unable to find any public proposals for L bit encodings in FCSR.

For reference, frm is treated as follows:
Floating-point operations use either a static rounding mode encoded in the instruction, or a dynamic rounding mode held in frm. Rounding modes are encoded as shown in Table 11.1. A value of 111 in the instruction’s rm field selects the dynamic rounding mode held in frm. If frm is set to an invalid value (101–111), any subsequent attempt to execute a floating-point operation with a dynamic rounding mode will raise an illegal instruction exception.

Let's say that we wish to support up to 4 accuracy modes -- 2 'fam' bits.  Default would be IEEE-compliant, encoded as 00.  This means that all current hardware would be compliant with the default mode.

the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.

As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs. It will illegal-instruction trap upon executing an unsupported fam-instruction pair.  The implementation can then emulate the accuracy mode required.

there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible) that would allow user code to switch to faster software implementations if it chooses to.
 
If the bits are in FCSR, then the switch itself would be exposed to user mode.  User-mode would not be able to detect emulation vs hardware supported instructions, however (by design).  That would require some platform-specific code.

Now, which accuracy modes should be included is a question outside of my expertise and would require a literature review of instruction frequency in key workloads, PPA analysis of simple and advanced implementations, etc.  (Thanks for the insights, Mitch!)

emulation of unsupported modes would be required for unix platforms.

I don't see why Unix should be required to emulate some arbitrary reduced accuracy ML mode.  My guess would be that Unix Platform Spec requires support for IEEE, whereas arbitrary ML platform requires support for Mode XYZ.  Of course, implementations of either platform would be free to support any/all modes that they find valuable.  Compiling for a specific platform means that support for required accuracy modes is guaranteed (and therefore does not need discovery sequences), while allowing portable code to execute discovery sequences to detect support for alternative accuracy modes.

Best,
Dan Petrisko





--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Allen Baum

unread,
Aug 8, 2019, 2:36:35 PM8/8/19
to MitchAlsup, RISC-V ISA Dev, Jacob Lifshay, lkcl, Libre-RISCV General Development
For what it's worth, the HP calculator algorithms had Prof. Kahan as a consultant (and HP had exclusive rights for the decimal versions of the algorithm; I think Intel had rights to the binary versions). 
Their accuracy requirements were that the result was accurate to with +/-1 bit of the *input* argument, which gives quite a bit of leeway when the slope of the function is extremely steep. Since many of the trig functions required input reduction of X mod pi ( or 2pi of .5pi - don't recall) - that could be pretty far out without ~99 digits of pi to reduce it, and even if it was perfectly reduced, one LSB of X.xxxxxxxxE 99 is not a small number, so accuracy at the end of the scale is a bit nebulous.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

Allen Baum

unread,
Aug 8, 2019, 3:09:26 PM8/8/19
to MitchAlsup, RISC-V ISA Dev, Jacob Lifshay, lkcl, Libre-RISCV General Development
Regarding the statement:
  As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs.
It will illegal-instruction trap upon executing an unsupported fam-instruction pair.
Putting my compliance hat on (I have to do that a lot), this works only if 
 - the reference model is capable of being configured to trapon any permutation of illegal opcodes, OR 
 - the compliance framework can load and properly execute abitrary (vendor supplied) emulation routines 
 -- and they get exactly the same answer as the reference model.

This is all mot if you don't want to use the RISC-V trademark, or the platform doesn't requirement whatever is non-compliant, of course (which isn't flippant - its then a custom extension that may work perfectly well for some custom appllication).

MitchAlsup

unread,
Aug 8, 2019, 4:15:32 PM8/8/19
to RISC-V ISA Dev, Mitch...@aol.com, program...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org


On Thursday, August 8, 2019 at 1:19:38 PM UTC-5, Dan Petrisko wrote:
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

Preface: As Andrew points out, any ISA proposal must be associated with a quantitative evaluation to consider tradeoffs.

A natural place for a standard reduced accuracy extension "Zfpacc" would be in the reserved bits of FCSR.  

In my patent application concerning double precision transcendentals, I invented another way.

The HW is in the position to KNOW if it is potentially making an incorrect rounding, 
and in the My 66000 architecture, there is a exception that can be enabled to transfer control when the HW is about to deliver a potentially improperly rounded result. Should the application enable said exception, a rounding not KNOWN to be correct will transfer control to some SW that can fix the problem.
The My 66000 ISA can deliver control to the trap handler in about a dozen cycles (complete with register file swapping) and back in about another dozen cycles.
If the trap rate is less than 1% and the trap overhead to deliver a correct result is on the order of 300 cycles, then the user will see a 5% penalty in wall clock time while never seeing an incorrectly rounded result. That 5% penalty is 1 clock when transcendentals only take 20 cycles to complete.

HW skimps in precision, SW takes over when there is potential for error, and the overhead is essentially negligible.

But note:: this is root claim 4 in my patent application.

Note 2:: The slower the TRAP/RETT is the better the HW needs to be to have negligible overhead.

lkcl

unread,
Aug 8, 2019, 11:12:47 PM8/8/19
to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org, Mitch...@aol.com
On Thursday, August 8, 2019 at 10:14:58 PM UTC+8, andrew wrote:

> As I mentioned earlier, an instruction being useful is not by itself a justification for adding it to the ISA. Where’s the quantitative case for these instructions?

3D is a Billion dollar mass volume market, proprietary GPUs are going after mass volume, inherently excluding the needs of (profitable) long-tail markets.

L.

lkcl

unread,
Aug 8, 2019, 11:26:13 PM8/8/19
to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 10:55:27 PM UTC+8, Allen Baum wrote:
> From my point of view, it needs to match the reference model for any ratified standard., else it won’t be labeled compliant.

You mean RV compliance? It also needs to realistically meet customer demand.

In the market dominated by AMD and NVidia that gives one driving factor: compliance with Khronos specifications. Failure to meet these predefined requirements will automatically result in market rejection.

In the embedded GPU market, typically defined as around 1024x768 resolution, sometimes even 14 bit accuracy is completely pointless and just prices the product out of an extremely competitive and lucrative market.

MIPS 3D ASE had a special 12 bit accuracy FPDIV operation that could be called twice in succession. For pixel positions up to around +/-2048 12 bit accuracy works perfectly well.

So the proposal basically has to be flexible enough to recognise Industry Standard market driven requirements.


I received this comment from an offline discussion:

Yes, embedded systems typically can do with 12, 16 or 32 bit accuracy. Rarely does it require 64 bits. But the idea of making a low power 32 bit FPU/DSP that can accommodate 64 bits is already being done in other designs such as PIC etc I believe. For embedded graphics 16 bit is more than adequate. In fact, Cornell had a very innovative 18-bit floating point format described here (useful for FPGA designs with 18-bit DSPs):

https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html

A very interesting GPU using the 18-bit FPU is also described here:

https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html

lkcl

unread,
Aug 8, 2019, 11:38:43 PM8/8/19
to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 11:58:17 PM UTC+8, MitchAlsup wrote:
> We are talking about all of this without a point of reference.
>
>
> Here is what I do know about correctly rounded transcendentals::

I'd spotted that you mentioned these earlier, thank you for reiterating them.

>
> My technology for performing transcendentals in an FMAC unit performs a power series polynomial calculation.

>
>
> If you impose the correctly rounded requirement:: 
> a) the size of the coefficient tables grows by 3.5× and 
> b) the number of cycles to compute grows by 1.8×
> c) the power to compute grows by 2.5×
> For a gain of accuracy of about 0.005 ULP

To put this into perspective: 3D GPU FP Units make up sonething mad like a staggering 50% of the total silicon.

So in the highly competitive mass volume 3D GPU market, where full accuracy is non-essential, that would mean entering the market with a product that had the power performance characteristics with a 3 year old profile [at the price point of a modern competitor]

It would be stone cold dead long before it entered design, and no sane VC would fund it.

However in the UNIX Platform profile, where the FPU is ratchetted back to not completely dominate the chip, the impact of higher accuracy is far less, and the needs of customers far different anyway.

As you can see from the offline message I received, the 3D Embedded Market is even weirder, and market forces and customer needs drive in *completely* the opposite direction.

3D is just completely different from what people are used to in the [current] RISCV Embedded and Unix platforms.

L.


lkcl

unread,
Aug 9, 2019, 12:16:32 AM8/9/19
to RISC-V ISA Dev, Mitch...@aol.com, program...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Friday, August 9, 2019 at 2:19:38 AM UTC+8, Dan Petrisko wrote:

> A natural place for a standard reduced accuracy extension "Zfpacc" would be in the reserved bits of FCSR.

I like it [separate extension]


> Let's say that we wish to support up to 4 accuracy modes -- 2 'fam' bits. 

[From the 3D Embedded world, where between 12 to 18 bits are typically used, it may be necessary to have more than 2 bits. We will see what happens when more input/feedback from stakeholders occurs]

Also there are some specific reduced accuracy requirements ("fast_*) in the OpenCL SPIRV opcode spec, these would need to be included too.

Otherwise, separate opcodes would need to be added, just to support those SPIRV operations, which is against the principle of RISC.

> Default would be IEEE-compliant, encoded as 00.  This means that all current hardware would be compliant with the default mode.

This would be important!

>
> the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.

Yes, agreed, this is v important. Embedded are on their own (as normal).


>
> As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs. It will illegal-instruction trap upon executing an unsupported fam-instruction pair.  The implementation can then emulate the accuracy mode required.

I like it.

> there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible)

Hmmmm, if speed or power consumption of an implementation is compromised by that, it would be bad (and also Khronos nonconformant, see below).

> that would allow user code to switch to faster software implementations if it chooses to.
>  
> If the bits are in FCSR, then the switch itself would be exposed to user mode.  User-mode would not be able to detect emulation vs hardware supported instructions, however (by design).  That would require some platform-specific code.

Hmmm. 3D is quite different.

Look at software unaccelerated MesaGL. High end games are literally unplayable in software rendering, and the Games Studios will in some cases not even permit the game to run if certain hardware characteristics are not met, because it would bring the game into disrepute if it was permitted to run and looked substandard.

Bottom line is: Security be damned - the usermode software *has* to know *everything* about the actual hardware, and there are Standard APIs to list the hardware characteristics.

If those APIs "lie" about those characteristics, not only will the end users bitch like mad (justifiably), the ASIC will *FAIL* Khronos conformance and compliance and will not be permitted to be sold with the Vulkan and OpenGL badge on it (they're Trademarks).

There will be some designs where even the temperature sensors are fed back to userspace and the 3D rendering demands dialed back to not overheat the ASIC and still keep user response time expectations to acceptable levels.

(Gamers HATE lag. It can result in loss of a tournament).

>
> Now, which accuracy modes should be included is a question outside of my expertise and would require a literature review of instruction frequency in key workloads, PPA analysis of simple and advanced implementations, etc. 


Yes. It's a lot of work (that offline message had some links already), and my hope is that the stakeholders in the (yet to be formed/announced) 3D Open Graphics Alliance will have a vested interest in doing exactly that.

The point of raising the Ztrans and Zftrig* proposal at this early phase is to have some underpinnings so that the Alliance members can hit the gound running.



> emulation of unsupported modes would be required for unix platforms.

Yes.

>
> I don't see why Unix should be required to emulate some arbitrary reduced accuracy ML mode.

It's completely outside of my area of expertise to say, one way or the other. It will need a thorough review and some input from experienced 3D software developers.

>  My guess would be that Unix Platform Spec requires support for IEEE,

concur.

> whereas arbitrary ML platform requires support for Mode XYZ.  Of course, implementations of either platform would be free to support any/all modes that they find valuable. 

concur.

> Compiling for a specific platform means that support for required accuracy modes is guaranteed (and therefore does not need discovery sequences), while allowing portable code to execute discovery sequences to detect support for alternative accuracy modes.

The latter will be essential for detecting the "fast_*" capability.

Main point, I cannot emphasise enough how critical it is that userspace software get at the underlying hardware characteristics. This for Khronos Standards Compliance.

Sensible proposal, Dan. will write it up shortly.

L.

lkcl

unread,
Aug 9, 2019, 12:24:56 AM8/9/19
to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org
On Thursday, August 8, 2019 at 10:55:27 PM UTC+8, Allen Baum wrote:

> From my point of view, it needs to match the reference model for any ratified standard., else it won’t be labeled compliant.

To reiterate in this context, after reading Dan's post: Khronos Conformance (Vulkan, OpenGL, OpenCL) is absolutely critical for products to enter certain high profit markets.

The Khronos Group also has trademarks, and it is critical that their Industry Standard requirements be met.

It will be absolutely essential for RISCV Conformance / Compliance / Standards to *not* get in the way or impede Khronos Conformance / Compliance / Standards, in any way, shape or form.

This is why I suggested the [new] "3D UNIX Platform" as it makes it clear that its specialist requirements are completely separate and distinct from, and do not impact in any way, the [present] UNIX Platform Spec.

L.

Andrew Waterman

unread,
Aug 9, 2019, 1:39:53 AM8/9/19
to lkcl, Mitch...@aol.com, RISC-V ISA Dev, libre-r...@lists.libre-riscv.org
That’s a business justification, not an architectural one, and in any case it’s a justification for different instructions than the ones you’ve proposed. The market you’re describing isn’t populated with products that have ISA-level support for correctly rounded transcendentals; they favor less accurate operations or architectural support for approximations and iterative refinement.



L.


--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

lkcl

unread,
Aug 9, 2019, 1:42:18 AM8/9/19
to RISC-V ISA Dev, Mitch...@aol.com, program...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org
On Friday, August 9, 2019 at 3:09:26 AM UTC+8, Allen Baum wrote:
> Regarding the statement:
>   As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs.
> It will illegal-instruction trap upon executing an unsupported fam-instruction pair.
> Putting my compliance hat on (I have to do that a lot), this works only if 
>  - the reference model is capable of being configured to trapon any permutation of illegal opcodes, OR 
>  - the compliance framework can load and properly execute abitrary (vendor supplied) emulation routines 
>  -- and they get exactly the same answer as the reference model.
>
>
>
> This is all mot if you don't want to use the RISC-V trademark, or the platform doesn't requirement whatever is non-compliant, of course (which isn't flippant - its then a custom extension that may work perfectly well for some custom appllication).

Absolutely, agreed.

The weird thing here is that Zftrans/Ztrig*/Zfacc are for a wide and diverse range of implementors, covering:

* Hybrid CPU/GPUs. Aside from the Libre RISCV SoC, Pixilica's well attended SIGGRAPH 2019 BoF was a Hybrid proposal that had a huge amount of interest, because of the flexibility

https://s2019.siggraph.org/presentation/?id=bof_191&sess=sess590

Here, paradoxically, both IEEE754 compliance *and* Khronos conformance are required in order to meet customer expectations.

Just these requirements alone - the fact that the compiler toolchains (and software libraries such as GNU libm) will have enormous public distribution - excludes them from "custom" status.

* UNIX platform only. This platform ends up indirectly benefitting from a speed boost on hardware that has support for the proposed Zf exts.

* Low power low performance cost sensitive embedded 3D. Typically 1024x768 resolution. Even here, implementors can benefit hugely from collaboration and industry wide standardisation.

Again, even in Embedded 3D, "custom" ie not appropriate as it sends the wrong message. They will use the exact same publicly available widely distributed UPSTREAM compiler toolchains and software resources as the UNIX and 3dUNIX platforms, just dialled back with Zfacc.


What does *not* need standardisation is proprietary GPUs. Mass volume products with their own proprietary binary-only-distributed implementation of OpenGL, Vulkan, and OpenCL.

Such implementors would definitely qualify for "custom" status.

It is critical to understand that this proposal is NOT designed for their needs. They might benefit from it, however unless they come forward (with associated cheque book to cover the cost of including their needs in the proposals) I'm not going to spend time or energy on them.

I've had enough of proprietary GPUs and the adverse impact proprietary libaries has on software development, reliability and the environment [1].

L.

[1] http://www.h-online.com/open/news/item/Intel-and-Valve-collaborate-to-develop-open-source-graphics-drivers-1649632.html

Allen J. Baum

unread,
Aug 9, 2019, 1:47:18 AM8/9/19