FP transcendentals (trigonometry, root/exp/log) proposal

lkcl

unread,

Aug 6, 2019, 8:33:46 AM8/6/19

to RISC-V ISA Dev, Libre-RISCV General Development

https://libre-riscv.org/ztrans_proposal/

As part of developing a Libre GPU that is intended for 3D, specialist Compute and Machine Learning, standard operations used in OpenCL are pretty much mandatory [1].

As they will end up in common public usage - upstream compilers with high volumes of downloads - it does not make sense for these opcodes to be relegated to "custom" status ["custom" status is suitable only for embedded proprietary usage that will never see the public light of day].

Also, they are not being proposed as part of RVV for the simple reason that as "scalar" opcodes, they can be used with *scalar* designs. It makes more sense that they be deployable in "embedded" designs (that may not have room for RVV, particularly as CORDIC seems to cover the vast majority of trigonometric algorithms and more [2]), or in NUMA parallel designs, where a cluster of shaders makes up for a lack of "vectorisation".

In addition: as scalar opcodes, they fit into the (huge, sparsely populated) FP opcode brownfield, whereas the RVV major opcode is much more under pressure.

The list of opcodes is at an early stage, and participation in its development is open and welcome to anyone involved in 3D and OpenCL Compute applications.

Context, research, links and discussion are being tracked on the libre riscv bugtracker [3].

L.

[1] https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html
[2] http://www.andraka.com/files/crdcsrvy.pdf
[3] http://bugs.libre-riscv.org/show_bug.cgi?id=127

MitchAlsup

unread,

Aug 7, 2019, 6:36:17 PM8/7/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?

b) a reserve on the OpCode space for the double precision equivalents ?

c) a statement on <approximate> execution time ?

You may have more transcendentals than necessary::

1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals

..... ASINH( x ) = ln( x + SQRT(x**2+1) )

2) LOG(x) = LOGP1(x) + 1.0

... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.

Jacob Lifshay

unread,

Aug 7, 2019, 7:43:21 PM8/7/19

to MitchAlsup, RISC-V ISA Dev, Libre-RISCV General Development

On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:

Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?

From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

b) a reserve on the OpCode space for the double precision equivalents ?

the 2 bits right below the funct5 field select from:

00: f32

01: f64

10: f16

11: f128

so f64 is definitely included.

see https://libre-riscv.org/rv_major_opcode_1010011/#index2h1

see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratified

it would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.

c) a statement on <approximate> execution time ?

that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).

For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )

That's why the hyperbolics extension is split out into a separate extension. Also, a single instruction may be much faster since it can calculate it all as one operation (cordic will work) rather than requiring several slow operations sqrt/div and log.

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.

for the implementation techniques I know for log/exp, implementing both log/exp and logp1/expm1 is a slight increase in complexity compared to only one or the other (changing constants for polynomial/lut-based implementations and for cordic). I think it's worth saving the extra instructions for the common case of implementing pow (where you need log/exp) and logp1/expm1 is not worth getting rid of due to the small additional cost and additional accuracy obtained.

Jacob Lifshay

lkcl

unread,

Aug 7, 2019, 8:27:08 PM8/7/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

[some overlap with what jacob wrote, reviewing/removing redundant replies]

On Wednesday, August 7, 2019 at 11:36:17 PM UTC+1, MitchAlsup wrote:

Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?

originally thought it was just this: https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

jacob makes and emphasises the point that these are intended to be *scalar* operations, for direct use in libm.

b) a reserve on the OpCode space for the double precision equivalents ?

reservations, even where the case has been made clear that the impact of not having a reservation will cause severe detrimental ongoing impact for the wider RISC-V community, do not have an IANA-style contact/proposal procedure. i've repeatedly requested an official reservation, for this and many other proposals.

i have not received a response.

Jacob wrote:

> it would probably be a good idea to split the trancendental extensions

> into separate f32, f64, f16, and f128 extensions, since some implementations

> may want to only implement them for f32 while still implementing the D

> (f64 arithmetic) extension.

oh, of course. Ztrans.F/Q/S/H is a really good point.

c) a statement on <approximate> execution time ?

what jacob said.

as a Standard, we can't limit the proposal in ways that would restrict or exclude implementors. accuracy on the other hand *is* important, because it could potentially cause catastrophic failures if an algorithm is written to critically rely on a given accuracy.

You may have more transcendentals than necessary::
1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals
..... ASINH( x ) = ln( x + SQRT(x**2+1) )

ah, excellent - i'll add that recipe to the document. Zfhyp, separate extension.

2) LOG(x) = LOGP1(x) + 1.0
... EXP(x) = EXPM1(x-1.0)

That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.

oo that's very interesting. of course. i like it.

the only thing: as a Standard, some implementors may find it more efficient to implement LOG than LOGP1 (likewise with exp). in particular, if CORDIC is used (which i have just recently found, and am absolutely amazed by - https://en.wikipedia.org/wiki/CORDIC) i cannot find a LOGP1/EXPM1 version of that.

CORDIC would be the most sensible "efficient" choice of hardware algorithm, simply because of the sheer overwhelming number of transcendentals that it covers. if there isn't a way to implement LOGP1 using CORDIC, and one but not the other is chosen, some implementation options will be limited / penalised.

this is one of the really tricky things about Standards. if we were doing a single implementation, not intended in any way to be Standards-compliant, we could make the decision, best optimised option, according to our requirements, and to hell with everyone else. take that approach with a Standard, and it results in... other teams creating their own Standard.

having two near-identical opcodes where one may be implemented in terms of the other is however rather unfortunately against the principle of RISC. in this particular case, though, the hardware implementation actually matters.

does anyone know if CORDIC can be adapted to do LOGP1 as well as LOG? ha, funny, i found this:

http://dns.uls.cl/~ej/daa_08/Algoritmos/books/book10/9010f/jarvis.asc

unfortunately, the original dr dobbs article, which has "example 4(d)" as a hyperlink, redirects to a 404 not found.

l.

MitchAlsup

unread,

Aug 7, 2019, 8:29:29 PM8/7/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:

On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

Correctly rounded results will require a lot more difficult hardware and more cycles of execution.

Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.

Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)

Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.

b) a reserve on the OpCode space for the double precision equivalents ?
the 2 bits right below the funct5 field select from:
00: f32
01: f64
10: f16
11: f128

so f64 is definitely included.

see https://libre-riscv.org/rv_major_opcode_1010011/#index2h1
see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratified

it would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.

c) a statement on <approximate> execution time ?
that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).

For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).

I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.

I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19

Faithful rounding is not IEEE 754 correct. The unit I have designed makes an IEEE rounding error about once every 171 calculations.

MitchAlsup

unread,

Aug 7, 2019, 8:32:57 PM8/7/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

Both Motorola CORDIC and Intel CORDIC specified the LOGP1 and EXP1M instead of LOG and EXP.

lkcl

unread,

Aug 7, 2019, 8:45:23 PM8/7/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

i think i managed to interpret the paper, below - it tends to suggest that LOG is possible with the standard hyperbolic CORDIC. the thing is: the add 1 is done *outside* the LOG(a), which tends to suggest that the iterative algorithm needs modifying...

... unless it's as simple as setting Z0=1

does that look reasonable?

[i really don't like deriving algorithms like this from scratch: someone somewhere has done this, it's so ubiquitous. i'd be much happier - much more comfortable - when i can see (and execute) a software algorithm that shows how it's done.]

---

https://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm

Since: ln(a) = 2Tanh-1( (a-1) / (a+1)
 
The function ln(Î±) is obtained by multiplying by 2 the final result

ZN. (Equation (4)), provided that Z0=0, X0= a+1, and Y0= a-1.

lkcl

unread,

Aug 7, 2019, 8:57:38 PM8/7/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 1:29:29 AM UTC+1, MitchAlsup wrote:

On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:
Is this proposal going to <eventually> include::

a) statement on required/delivered numeric accuracy per transcendental ?
From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.

Correctly rounded results will require a lot more difficult hardware and more cycles of execution.
Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.
Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)
Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.

hmmm... i don't know what to suggest / recommend here. there's two separate requirements: accuracy (OpenCL, numerical scenarios), and 3D GPUs, where better accuracy is not essential.

i would be tempted to say that it was reasonable to suggest that if you're going to use FP32, expectations are lower so "what the heck". however i have absolutely *no* idea what the industry consensus is, here.

i do know that you've an enormous amount of expertise and experience in the 3D GPU area, Mitch.

I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.

I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19

yes please.

there will be other implementors of this Standard that will want to make a different call on which direction to go.

l.

MitchAlsup

unread,

Aug 7, 2019, 9:17:37 PM8/7/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.

At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.

On the other hand, back when I started doing this (CPU design) the guys using floating point just wanted speed and they were willing to put up with not only IBM floating point (Hex normalization, and gard digit) but even CRAY floating point (CDC 6600, CDC 7600, CRAY 1) which was demonstrably WORSE in the numerics department.

In any event; to all. but 5 floating point guys in the world, a rounding error (compared to the correctly rounded result) occurring less often than 3% of the time and no more than 1 ULP, is as accurate as they need (caveat: so long as the arithmetic is repeatable.) As witness, the FDIV <lack of> instruction in ITANIC had a 0.502 ULP accuracy (Markstein) and nobody complained.

My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.

My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.

I have no feeling at all on where to draw the line.

lkcl

unread,

Aug 8, 2019, 1:20:03 AM8/8/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 2:17:37 AM UTC+1, MitchAlsup wrote:

An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.

At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.

someone on the libre-riscv-dev list just hilariously pointed out that Ahmdahl-compatible IBM370 had FP that was more accurate than the 370: customers *complained* and they had to provide libraries that *de-accurified* the FP calculations :)

My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.
My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.
I have no feeling at all on where to draw the line.

this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)

* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)

* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.

l.

Jacob Lifshay

unread,

Aug 8, 2019, 1:30:11 AM8/8/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org

On Wed, Aug 7, 2019, 22:20 lkcl <luke.l...@gmail.com> wrote:

this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.

That wouldn't quite work on our GPU design, since it's supposed to be both a GPU and a CPU that conforms to the UNIX Platform, it would need to meet the requirements of the UNIX Platform and the 3D Platform, which would still end up with correct rounding being needed.

lkcl

unread,

Aug 8, 2019, 1:36:57 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

yes, sorry: forgot to mention (so worth spelling out explicitly) - hybrid systems intended for multi-purpose obviously would need to meet the standard of the highest-accuracy purpose for which it is intended.

although doing FP64 as well, even that would need to meet the UNIX Platform spec standard.

adding these three options is to make it so that other implementors make the choice. where interoperability matters (due to distribution of precompiled binaries that are targetted at multiple independent implementations), requirements have to be stricter.

l.

Luis Vitorio Cargnini

unread,

Aug 8, 2019, 2:28:43 AM8/8/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, lkcl

Hello, 

My $0.02 of contribution 
Regarding the comment of 3 platforms:

> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)

No, IEEE, ARM is an embedded platform and they implement IEEE in all of them

> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)

Standard IEEE, simple no 'new' on this sector.

> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.

No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.

No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.

Best Regards,
Luis Vitorio Cargnini, Ph.D.

Senior Hardware Engineer

OURS Technology Inc., Santa Clara, California, 95054

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/43b3c671-7e13-4229-838e-71c7e293941b%40groups.riscv.org.

Andrew Waterman

unread,

Aug 8, 2019, 2:30:38 AM8/8/19

to MitchAlsup, RISC-V ISA Dev, Libre-RISCV General Development

Hi folks,

We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.

Andrew

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/0a8e035c-0996-44ba-af8d-d19be84575f5%40groups.riscv.org.

Jacob Lifshay

unread,

Aug 8, 2019, 2:44:40 AM8/8/19

to Luis Vitorio Cargnini, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org, lkcl

On Wed, Aug 7, 2019, 23:28 Luis Vitorio Cargnini <lvcar...@ours-tech.com> wrote:

Hello,

My $0.02 of contribution
Regarding the comment of 3 platforms:
> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
No, IEEE, ARM is an embedded platform and they implement IEEE in all of them
> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
Standard IEEE, simple no 'new' on this sector.
> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.

No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.

Note that IEEE-754 specifies correctly rounded results for all the proposed functions.

lkcl

unread,

Aug 8, 2019, 3:09:50 AM8/8/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, luke.l...@gmail.com

On Thursday, August 8, 2019 at 2:28:43 PM UTC+8, Luis Vitorio Cargnini(OURS/RiVAI) wrote:

> > * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
> No, IEEE, ARM is an embedded platform and they implement IEEE in all of them

I can see the sense in that one. I just thought that some 3D implementors, particularly say in specialist markets, would want the choice.

Hmmmm, perhaps a 3D Embedded spec, separate from "just" Embedded.

> > * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
> Standard IEEE, simple no 'new' on this sector.

Makes sense. Cannot risk noninteroperability, even if it means a higher gate count or larger latency.

> > * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
>
>
>
>
> No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.

Ok, this is where that's not going to fly. As Mitch mentioned, full IEEE754 compliance would result in doubling the gate count and/or decreasing latency through longer pipelines.

In speaking with Jeff Bush from Nyuzi I learned that a GPU is insanely dominated by its FP ALU gate count: well over 50% of the entire chip.

If you double the gate count due to imposition of unnecessary accuracy (unnecessary because due to 3D Standards compliance all the shader algorithms are *designed* to lower accuracy requirements), the proposal will be flat-out rejected by adopters because products based around it will come with a whopping 100% power-performance penalty compared to industry standard alternatives.

So this is why I floated (ha ha) the idea of a new Platform Spec, to give implementors the space to meet industry-driven requirements and remain competitive.

Ironically our implementation will need to meet UNIX requirements, it is one of the quirks / penalties of a hybrid design.

L.

Jacob Lifshay

unread,

Aug 8, 2019, 3:11:01 AM8/8/19

to Andrew Waterman, Mitchalsup, RISC-V ISA Dev, Libre-RISCV General Development

On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:

Hi folks,

We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.

Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Jacob Lifshay

lkcl

unread,

Aug 8, 2019, 3:20:23 AM8/8/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 2:30:38 PM UTC+8, waterman wrote:
> Hi folks,
>
>
> We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.

There are definitely alternative (conflicting) directions here which are driven by price and performance in radically different markets.

3D graphics is seriously optimised, in completely different directions from those that drove the IEEE754 standard, and unfortunately it has been left up to secretive proprietary companies to lead that, as the NREs kept going up and up, driving out Number Nine, Matrox, ATI getting bought by AMD and so on.

A new Open 3D Alliance initiative is in its early stage of being formed and the plan is to get some feedback from members on what they want, here.

This proposal is therefore part of "planning ahead", and there are *going* to be diverse and highly specialist requirements for which IEEE754 compliance is just not going to fly.... *and* there are going to be adopters for whom IEEE754 is absolutely essential.

Recognising this, by creating separate Platform Specs (specially crafted for 3D implementors that distinguish them from the Embedded and UNIX specs) is, realistically, the pragmatic way forward.

L.

Andrew Waterman

unread,

Aug 8, 2019, 3:33:25 AM8/8/19

to Jacob Lifshay, Mitchalsup, RISC-V ISA Dev, Libre-RISCV General Development

On Thu, Aug 8, 2019 at 12:11 AM Jacob Lifshay <program...@gmail.com> wrote:

On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
Hi folks,

We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.

Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

That is not a quantitative approach to computer architecture. We don't add nontrivial features on the basis that they are useful; we add them on the basis that their measured utility justifies their cost.

I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Yeah, this is the cart leading the horse. It's not obvious that the proposed opcodes justify standardization.

Jacob Lifshay

lkcl

unread,

Aug 8, 2019, 3:50:17 AM8/8/19

to RISC-V ISA Dev, wate...@eecs.berkeley.edu, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 3:11:01 PM UTC+8, Jacob Lifshay wrote:
> On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:
>
> Hi folks,
>
>
> We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.
>
>
>
> Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation),

https://patents.google.com/patent/US9471305B2/en

This is really cool. Like CORDIC, it covers a huge range of operations. Mitch described it in the R-Sqrt thread.

> I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.

If we were talking about an embedded-only product, or a co-processor, the firmware requiring hard forked compilers or specialist dedicated compilers (like hoe NVIDIA and AMD do it), we would neither be having this discussion publicly nor putting forward a common Zftrans / Ztrig* spec.

This proposal is for *multiple* use cases *including* hybrid CPU/GPU, low power embedded specialist 3D, *and* standard UNIX (GNU libm).

In talking with Atif from Pixilica a few days ago he relayed to me the responses he got

https://www.pixilica.com/forum/event/risc-v-graphical-isa-at-siggraph-2019/p-1/dl-5d4322170924340017bfeeab

The attendance was *50* people at the BoF! He was expecting maybe two or three :) Some 3D engineers were doing transparent polygons which requires checking the hits from both sides. Using *proprietary* GPUs they have a 100% performance penalty as it is a 2 pass operation.

Others have non-standard projection surfaces (spherical, not flat). No *way* proprietary hardware/software is going to cope with that.

Think Silicon has some stringent low power requirements for their embedded GPUs.

Machine Learning has another set of accuracy requirements (way laxer), where Jacon I think mentioned that atan in FP16 can be adequately implemented with a single cycle lookup table (something like that)

OpenCL even has specialist "fast inaccurate" SPIRV opcodes for some functions (SPIRV is part of Vulkan, and was originally based on LLVM IR). Search this page for "fast_" for examples:

https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

The point is: 3D, ML and OpenCL is *nothing* like the Embedded Platform or UNIX Platform world. Everything that we think we know about how it should be done is completely wrong, when it comes to this highly specialist and extremely diverse and unfortunately secretive market.

>
> I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.

Altivec SSE / Vector nightmare, and RISCV is toast.

When we reach the layout milestone, the implementation will be frozen. We are not going to waste our sponsors' money: we have to act responsibly and get it right.

Also, NLNet's funding, once allocated, is gone. We are therefore under time pressure to get the implementation done so that we can put in a second application for the layout.

Bottom line we are not going to wait around, the consequences are too severe (loss of access to funding).

L.

lkcl

unread,

Aug 8, 2019, 4:36:49 AM8/8/19

to RISC-V ISA Dev, program...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.

Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.

There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.

It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.

It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.

The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.

Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.

Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.

If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.

Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.

SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".

For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.

I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).

L.

lkcl

unread,

Aug 8, 2019, 4:53:32 AM8/8/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org, luke.l...@gmail.com

On Thursday, August 8, 2019 at 2:28:43 PM UTC+8, Luis Vitorio Cargnini(OURS/RiVAI) wrote:

> No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own.

Just to emphasise, Luis, Andrew: "on their own" is precisely what each of the proprietary 3D GPU Vendors have done, and made literally billions of dollars by doing so.

Saying "we are on our own" and expecting that to mean that not conforming to IEEE754 would kill the proposal, this is false logic.

MALI (ARM), Vivante, the hated PowerVR, NVidia, AMD/ATI, Samsung's new GPU (with Mitch's work in it), and many more, they *all* went "their own way", hid the hardware behind a proprietary library, and *still made billions of dollars*.

This should tell you what you need to know, namely that a new 3D GPU Platform Spec which has specialist FP accuracy requirements to meet the specific needs of this *multi BILLION dollar market* is essential to the proposal's successful industry adoption.

If we restrict it to UNIX (IEEE754) it's dead.

If we restrict it to *not* require IEEE754, it's dead.

The way to meet all the different industry needs: new Platform Specs.

That does not affect the actual opcodes. They remain the same, no matter the Platform accuracy requirements.

Thus the software libraries and compilers all remain the same, as well.

L.

lkcl

unread,

Aug 8, 2019, 5:41:13 AM8/8/19

to RISC-V ISA Dev, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:

ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?

It does not make sense to allocate opcode space under these circumstances.

[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]

l.

Andrew Waterman

unread,

Aug 8, 2019, 6:00:16 AM8/8/19

to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development

On Thu, Aug 8, 2019 at 2:41 AM lkcl <luke.l...@gmail.com> wrote:

On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:

ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?

The sentence you quoted began with the adjective "ISA-level". We happily provide correctly rounded transcendental math on Linux as-is.

It does not make sense to allocate opcode space under these circumstances.

[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]

l.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6a391409-31c8-498b-8209-ed4fef5ba4a6%40groups.riscv.org.

Andrew Waterman

unread,

Aug 8, 2019, 6:07:21 AM8/8/19

to lkcl, RISC-V ISA Dev, Jacob Lifshay, MitchAlsup, Libre-RISCV General Development

On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:

[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Oh, man. This is great. "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".

Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.

Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.

There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.

It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.

It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.

The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.

Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.

Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.

If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.

Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.

SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".

For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.

I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/666729c1-c7c9-47a4-af20-0cc78f6cea99%40groups.riscv.org.

Jacob Lifshay

unread,

Aug 8, 2019, 6:09:28 AM8/8/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org

On Thu, Aug 8, 2019, 02:41 lkcl <luke.l...@gmail.com> wrote:

On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:

ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.

wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

- machine-learning-mode: fast as possible

-- maybe need additional requirements such as monotonicity for atanh?

- GPU-mode: accurate to within a few ULP

-- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines

- almost-accurate-mode: accurate to <1 ULP

(would 0.51 or some other value be better?)

- fully-accurate-mode: correctly rounded in all cases

- maybe more modes?

all modes are required to produce deterministic answers (no random outputs for the same input) only depending on the input values, the mode, and the fp control reg.

the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.

there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible) that would allow user code to switch to faster software implementations if it chooses to.

Jacob Lifshay

lkcl

unread,

Aug 8, 2019, 7:17:41 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, program...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 11:07:21 AM UTC+1, andrew wrote:

On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:
[mostly OT for the thread, but still relevant]

On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:

> It's not obvious that the proposed opcodes justify standardization.

It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"

Oh, man. This is great. "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".

bam bam... baaaaa :)

yes, i realised about half an hour later that we may have been speaking at cross-purposes, due to a misunderstanding that there's four separate potential Platforms here, covering each of the specialist areas. very few people have *3D* optimisation experience [we're lucky to have Mitch involved].

sorry for the misunderstanding, Andrew. follow-up question (already posted) seeks clarification.

l.

lkcl

unread,

Aug 8, 2019, 7:25:45 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?

ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.

are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?

thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so. HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).

in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".

that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.

something like that?

l.

Jacob Lifshay

unread,

Aug 8, 2019, 7:47:38 AM8/8/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org

On Thu, Aug 8, 2019, 04:25 lkcl <luke.l...@gmail.com> wrote:

On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?

ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.

are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?

yes.

also, having explicit mode bits allows emulating more accurate operations when the HW doesn't actually implement the extra gates needed.

This allows greater software portability (allows converting a libm call into a single instruction without requiring hw that implements the required accuracy).

thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so. HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).

in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".

that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.

something like that?

yup.

I do think that there should be an exact-rounding mode even if the UNIX platform doesn't require that much accuracy, otherwise, HPC implementations (or others who need exact rounding) will run into the same dilemma of needing more instruction encodings again.

Jacob

Jacob Lifshay

unread,

Aug 8, 2019, 7:56:15 AM8/8/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org

On Thu, Aug 8, 2019, 03:09 Jacob Lifshay <program...@gmail.com> wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
- machine-learning-mode: fast as possible
-- maybe need additional requirements such as monotonicity for atanh?
- GPU-mode: accurate to within a few ULP
-- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines
- almost-accurate-mode: accurate to <1 ULP
(would 0.51 or some other value be better?)
- fully-accurate-mode: correctly rounded in all cases
- maybe more modes?

One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.

Jacob

lkcl

unread,

Aug 8, 2019, 8:32:43 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 7:47:38 PM UTC+8, Jacob Lifshay wrote:

> are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?
> yes.

Ok. I like it. It's kinda only sonething that hybrid CPU/GPU combinations would want, however the level of interest that Pixilica got at SIGGRAPH 2019 in their hybrid CPU/GPU concept says to me that this is on the right track.

Also a dynamic switch stops any fighting over whether one Platform Spec should get priority preference to the exclusion of others.

Will update the page shortly.

>
> also, having explicit mode bits allows emulating more accurate operations when the HW doesn't actually implement the extra gates needed.

Oh, yes, good point, however it would only be mandatory for UNIX* Platforms to provide such traps.

> This allows greater software portability (allows converting a libm call into a single instruction without requiring hw that implements the required accuracy).

and associated performance penalties of doing so (extra conditional tests) if the trap isn't there. The conditional tests which substitute for a lack of a trap adversely impact performance for *both* modes.

>
> I do think that there should be an exact-rounding mode even if the UNIX platform doesn't require that much accuracy, otherwise, HPC implementations (or others who need exact rounding) will run into the same dilemma of needing more instruction encodings again.

Hmm hmm.... well, you know what? If it's behind a CSR Mode flag, and traps activate on unsupported modes, I see no reason why there should not be an extra accuracy mode.

L.

lkcl

unread,

Aug 8, 2019, 8:37:49 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 7:56:15 PM UTC+8, Jacob Lifshay wrote:

>
> One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.

Hmm don't know. Hendrik pointed out that Ahmdahl / IBM370 mainframe problem that extra accuracy caused.

I don't know if that lesson from history matters [in 2019].

No clue. Don't know enough to offer an opinion either way. Anyone any recommendations?

L.

Jacob Lifshay

unread,

Aug 8, 2019, 8:42:57 AM8/8/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org

On Thu, Aug 8, 2019, 05:37 lkcl <luke.l...@gmail.com> wrote:

On Thursday, August 8, 2019 at 7:56:15 PM UTC+8, Jacob Lifshay wrote:

>
> One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.

Hmm don't know. Hendrik pointed out that Ahmdahl / IBM370 mainframe problem that extra accuracy caused.

if portable results are desired, correct rounding produces the same results on all (even non-risc-v) hw, for all implementation algorithms.

less accurate modes produce results that depend on the exact algorithm chosen, which is a choice that should be left for implementers.

lkcl

unread,

Aug 8, 2019, 8:44:50 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 6:09:28 PM UTC+8, Jacob Lifshay wrote:

>
> maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

No definitely not ISAMUX/ISANS, its purpose is for switching (paging in) actual opcodes. Not quite true, LE/BE kinda flips in LE variants of LD/ST.

An FP CSR (dedicated or fields) makes more sense I think because it's quite a few bits, and I can see some potential value in the same bits being applied to F, G, H and Q as well.

Hmmm

lkcl

unread,

Aug 8, 2019, 8:56:44 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

Hendrik's example was that Ahmdahl hardware had correct (accurate) FP, where the IBM 370 did not.

Applications writers ran into problems when running on *more accurate* hardware. Ahmdahl had to patch the OS, with associated performance penalty, to *downgrade* the FP accuracy and to emulate IBM's *inaccurate* hardware, precisely.

What I do not know is whether there was something unique about the 370 mainframe and applications being written for it, or, if now in 2019, this is sufficiently well understood such that all FP applications writers have properly taken *better* accuracy (not worse accuracy: *better* accuracy) into consideration in the design of their programs.

Not knowing the answer to that question - not knowing if it is a risky proposition or not - tends to suggest to me that erring on the side of caution and *not* letting implementors provide more accuracy than the FP Accuracy CSR requests is the "safer" albeit more hardware-burdensome option.

Hence why I said I have no clue what the best answer is, here.

L.

lkcl

unread,

Aug 8, 2019, 9:27:50 AM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 8:56:44 PM UTC+8, lkcl wrote:

> What I do not know is whether there was something unique about the 370 mainframe and applications being written for it, or, if now in 2019, this is sufficiently well understood such that all FP applications writers have properly taken *better* accuracy (not worse accuracy: *better* accuracy) into consideration in the design of their programs.

I *think* this is what Andrew might have been trying to get across.

L.

Andrew Waterman

unread,

Aug 8, 2019, 9:51:57 AM8/8/19

to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development

That story might have roots in IBM's alternate base-16 floating-point format, but in any case, that wasn't the point I was trying to make. I stand by my first message to this thread.

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/24df3c2f-8fec-4384-bfd6-ad5a9bdc97cd%40groups.riscv.org.

lkcl .

unread,

Aug 8, 2019, 10:01:29 AM8/8/19

to Andrew Waterman, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development

On Thu, Aug 8, 2019 at 11:00 AM Andrew Waterman <and...@sifive.com> wrote:

>> wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
>
>
> The sentence you quoted began with the adjective "ISA-level". We happily provide correctly rounded transcendental math on Linux as-is.

i am very confused. we seem to be talking at cross-purposes, and i
have no idea where the confusion lies.

it appears that you are rejecting the possibility of providing
ISA-level support for transcendental and trigonometric operations for
*four* possible platform scenarios just because "correctly rounded
transcendental math is provided on linux". this has me utterly
confused.

particularly when, even *on* one of those platforms - the standard
UNIX Platform - Jacob pointed out that there may exist
High-Performance Server scenarios that would want the increased
performance - *on linux* - that such ISA-level support would provide.

i apologise: i don't understand what is going on.

why would even *one* argument "we provide accurate SOFTWARE math
libraries on linux" be reasonable cause to reject HARDWARE support for
the same?

why would that argument be relevant for THREE OTHER completely
different Platform Profiles?

i don't understand.

l.

lkcl .

unread,

Aug 8, 2019, 10:02:46 AM8/8/19

to Andrew Waterman, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development

On Thu, Aug 8, 2019 at 2:51 PM Andrew Waterman
<wate...@eecs.berkeley.edu> wrote:

> That story might have roots in IBM's alternate base-16 floating-point format, but in any case, that wasn't the point I was trying to make. I stand by my first message to this thread.

what point were you trying to make? i don't understand it. or, i see
the words: i just don't understand the implications, which seem
extreme and illogical. so i must be missing something.

l.

Andrew Waterman

unread,

Aug 8, 2019, 10:14:58 AM8/8/19

to lkcl ., Libre-RISCV General Development, MitchAlsup, RISC-V ISA Dev

I don’t understand the need for all those capital letters.

As I mentioned earlier, an instruction being useful is not by itself a justification for adding it to the ISA. Where’s the quantitative case for these instructions?

l.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDxRm7HV_sjVwebbos8SbemTMe%3DwPP7qLTcEh8KD%2BQJMAw%40mail.gmail.com.

Allen Baum

unread,

Aug 8, 2019, 10:55:27 AM8/8/19

to Jacob Lifshay, Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org

From my point of view, it needs to match the reference model for any ratified standard., else it won’t be labeled compliant. We’ve talked about something looser, especially for vector reduce where implementation operate ordering could produce wildly different results- it unlikely to happen.

-Allen

On Aug 7, 2019, at 10:29 PM, Jacob Lifshay <program...@gmail.com> wrote:

On Wed, Aug 7, 2019, 22:20 lkcl <luke.l...@gmail.com> wrote:
this tends to suggest that three platform specs are needed:

* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)

* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)

* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.

That wouldn't quite work on our GPU design, since it's supposed to be both a GPU and a CPU that conforms to the UNIX Platform, it would need to meet the requirements of the UNIX Platform and the 3D Platform, which would still end up with correct rounding being needed.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAC2bXD58jzG4yTAmhr1D0qnoir6UhBSkbfUw8ik_Q0hgn88_ug%40mail.gmail.com.

MitchAlsup

unread,

Aug 8, 2019, 11:58:17 AM8/8/19

to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

We are talking about all of this without a point of reference.

Here is what I do know about correctly rounded transcendentals::

My technology for performing transcendentals in an FMAC unit performs a power series polynomial calculation.

I can achieve 14 cycle LN2, EXP2 and 19 cycle SIN, COS faithfully rounded with coefficient tables which are (essentially) the same size as the FDIV/FSQRT seed tables for Newton-Raphson (or Goldschmidt) iterations. FDIV will end up at 17 cycles and FSQRT at 23 cycles. This is exactly what Opteron FDIV/FSQRT performance was (oh so onog ago).

If you impose the correctly rounded requirement::

a) the size of the coefficient tables grows by 3.5× and

b) the number of cycles to compute grows by 1.8×

c) the power to compute grows by 2.5×

For a gain of accuracy of about 0.005 ULP

Dan Petrisko

unread,

Aug 8, 2019, 2:19:38 PM8/8/19

to MitchAlsup, RISC-V ISA Dev, Jacob Lifshay, Luke Kenneth Casson Leighton, Libre-RISCV General Development

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

Preface: As Andrew points out, any ISA proposal must be associated with a quantitative evaluation to consider tradeoffs.

A natural place for a standard reduced accuracy extension "Zfpacc" would be in the reserved bits of FCSR. It could be treated very similarly to how dynamic frm is treated now. Currently, there are 5 bits of fflags, 3 bits of frm and 24 Reserved bits. The L (decimal floating-point) extension will presumably use some, but not all of them. I'm unable to find any public proposals for L bit encodings in FCSR.

For reference, frm is treated as follows:

Floating-point operations use either a static rounding mode encoded in the instruction, or a dynamic rounding mode held in frm. Rounding modes are encoded as shown in Table 11.1. A value of 111 in the instruction’s rm field selects the dynamic rounding mode held in frm. If frm is set to an invalid value (101–111), any subsequent attempt to execute a floating-point operation with a dynamic rounding mode will raise an illegal instruction exception.

Let's say that we wish to support up to 4 accuracy modes -- 2 'fam' bits. Default would be IEEE-compliant, encoded as 00. This means that all current hardware would be compliant with the default mode.

the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.

As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs. It will illegal-instruction trap upon executing an unsupported fam-instruction pair. The implementation can then emulate the accuracy mode required.

there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible) that would allow user code to switch to faster software implementations if it chooses to.

If the bits are in FCSR, then the switch itself would be exposed to user mode. User-mode would not be able to detect emulation vs hardware supported instructions, however (by design). That would require some platform-specific code.

Now, which accuracy modes should be included is a question outside of my expertise and would require a literature review of instruction frequency in key workloads, PPA analysis of simple and advanced implementations, etc. (Thanks for the insights, Mitch!)

emulation of unsupported modes would be required for unix platforms.

I don't see why Unix should be required to emulate some arbitrary reduced accuracy ML mode. My guess would be that Unix Platform Spec requires support for IEEE, whereas arbitrary ML platform requires support for Mode XYZ. Of course, implementations of either platform would be free to support any/all modes that they find valuable. Compiling for a specific platform means that support for required accuracy modes is guaranteed (and therefore does not need discovery sequences), while allowing portable code to execute discovery sequences to detect support for alternative accuracy modes.

Best,

Dan Petrisko

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/26e2386a-8a8e-450b-9ab7-dc2453ccce71%40groups.riscv.org.

Allen Baum

unread,

Aug 8, 2019, 2:36:35 PM8/8/19

to MitchAlsup, RISC-V ISA Dev, Jacob Lifshay, lkcl, Libre-RISCV General Development

For what it's worth, the HP calculator algorithms had Prof. Kahan as a consultant (and HP had exclusive rights for the decimal versions of the algorithm; I think Intel had rights to the binary versions).

Their accuracy requirements were that the result was accurate to with +/-1 bit of the *input* argument, which gives quite a bit of leeway when the slope of the function is extremely steep. Since many of the trig functions required input reduction of X mod pi ( or 2pi of .5pi - don't recall) - that could be pretty far out without ~99 digits of pi to reduce it, and even if it was perfectly reduced, one LSB of X.xxxxxxxxE 99 is not a small number, so accuracy at the end of the scale is a bit nebulous.

--

You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/26e2386a-8a8e-450b-9ab7-dc2453ccce71%40groups.riscv.org.

Allen Baum

unread,

Aug 8, 2019, 3:09:26 PM8/8/19

to MitchAlsup, RISC-V ISA Dev, Jacob Lifshay, lkcl, Libre-RISCV General Development

Regarding the statement:

As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs.
It will illegal-instruction trap upon executing an unsupported fam-instruction pair.

Putting my compliance hat on (I have to do that a lot), this works only if

- the reference model is capable of being configured to trapon any permutation of illegal opcodes, OR

- the compliance framework can load and properly execute abitrary (vendor supplied) emulation routines

-- and they get exactly the same answer as the reference model.

This is all mot if you don't want to use the RISC-V trademark, or the platform doesn't requirement whatever is non-compliant, of course (which isn't flippant - its then a custom extension that may work perfectly well for some custom appllication).

MitchAlsup

unread,

Aug 8, 2019, 4:15:32 PM8/8/19

to RISC-V ISA Dev, Mitch...@aol.com, program...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 1:19:38 PM UTC-5, Dan Petrisko wrote:

maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:

Preface: As Andrew points out, any ISA proposal must be associated with a quantitative evaluation to consider tradeoffs.

A natural place for a standard reduced accuracy extension "Zfpacc" would be in the reserved bits of FCSR.

In my patent application concerning double precision transcendentals, I invented another way.

The HW is in the position to KNOW if it is potentially making an incorrect rounding,

and in the My 66000 architecture, there is a exception that can be enabled to transfer control when the HW is about to deliver a potentially improperly rounded result. Should the application enable said exception, a rounding not KNOWN to be correct will transfer control to some SW that can fix the problem.

The My 66000 ISA can deliver control to the trap handler in about a dozen cycles (complete with register file swapping) and back in about another dozen cycles.

If the trap rate is less than 1% and the trap overhead to deliver a correct result is on the order of 300 cycles, then the user will see a 5% penalty in wall clock time while never seeing an incorrectly rounded result. That 5% penalty is 1 clock when transcendentals only take 20 cycles to complete.

HW skimps in precision, SW takes over when there is potential for error, and the overhead is essentially negligible.

But note:: this is root claim 4 in my patent application.

Note 2:: The slower the TRAP/RETT is the better the HW needs to be to have negligible overhead.

lkcl

unread,

Aug 8, 2019, 11:12:47 PM8/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org, Mitch...@aol.com

On Thursday, August 8, 2019 at 10:14:58 PM UTC+8, andrew wrote:

> As I mentioned earlier, an instruction being useful is not by itself a justification for adding it to the ISA. Where’s the quantitative case for these instructions?

3D is a Billion dollar mass volume market, proprietary GPUs are going after mass volume, inherently excluding the needs of (profitable) long-tail markets.

L.

lkcl

unread,

Aug 8, 2019, 11:26:13 PM8/8/19

to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 10:55:27 PM UTC+8, Allen Baum wrote:
> From my point of view, it needs to match the reference model for any ratified standard., else it won’t be labeled compliant.

You mean RV compliance? It also needs to realistically meet customer demand.

In the market dominated by AMD and NVidia that gives one driving factor: compliance with Khronos specifications. Failure to meet these predefined requirements will automatically result in market rejection.

In the embedded GPU market, typically defined as around 1024x768 resolution, sometimes even 14 bit accuracy is completely pointless and just prices the product out of an extremely competitive and lucrative market.

MIPS 3D ASE had a special 12 bit accuracy FPDIV operation that could be called twice in succession. For pixel positions up to around +/-2048 12 bit accuracy works perfectly well.

So the proposal basically has to be flexible enough to recognise Industry Standard market driven requirements.

I received this comment from an offline discussion:

Yes, embedded systems typically can do with 12, 16 or 32 bit accuracy. Rarely does it require 64 bits. But the idea of making a low power 32 bit FPU/DSP that can accommodate 64 bits is already being done in other designs such as PIC etc I believe. For embedded graphics 16 bit is more than adequate. In fact, Cornell had a very innovative 18-bit floating point format described here (useful for FPGA designs with 18-bit DSPs):

https://people.ece.cornell.edu/land/courses/ece5760/FloatingPoint/index.html

A very interesting GPU using the 18-bit FPU is also described here:

https://people.ece.cornell.edu/land/courses/ece5760/FinalProjects/f2008/ap328_sjp45/website/hardwaredesign.html

lkcl

unread,

Aug 8, 2019, 11:38:43 PM8/8/19

to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 11:58:17 PM UTC+8, MitchAlsup wrote:
> We are talking about all of this without a point of reference.
>
>
> Here is what I do know about correctly rounded transcendentals::

I'd spotted that you mentioned these earlier, thank you for reiterating them.

>
> My technology for performing transcendentals in an FMAC unit performs a power series polynomial calculation.

>
>

> If you impose the correctly rounded requirement::
> a) the size of the coefficient tables grows by 3.5× and
> b) the number of cycles to compute grows by 1.8×
> c) the power to compute grows by 2.5×
> For a gain of accuracy of about 0.005 ULP

To put this into perspective: 3D GPU FP Units make up sonething mad like a staggering 50% of the total silicon.

So in the highly competitive mass volume 3D GPU market, where full accuracy is non-essential, that would mean entering the market with a product that had the power performance characteristics with a 3 year old profile [at the price point of a modern competitor]

It would be stone cold dead long before it entered design, and no sane VC would fund it.

However in the UNIX Platform profile, where the FPU is ratchetted back to not completely dominate the chip, the impact of higher accuracy is far less, and the needs of customers far different anyway.

As you can see from the offline message I received, the 3D Embedded Market is even weirder, and market forces and customer needs drive in *completely* the opposite direction.

3D is just completely different from what people are used to in the [current] RISCV Embedded and Unix platforms.

L.

lkcl

unread,

Aug 9, 2019, 12:16:32 AM8/9/19

to RISC-V ISA Dev, Mitch...@aol.com, program...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 2:19:38 AM UTC+8, Dan Petrisko wrote:

> A natural place for a standard reduced accuracy extension "Zfpacc" would be in the reserved bits of FCSR.

I like it [separate extension]

> Let's say that we wish to support up to 4 accuracy modes -- 2 'fam' bits.

[From the 3D Embedded world, where between 12 to 18 bits are typically used, it may be necessary to have more than 2 bits. We will see what happens when more input/feedback from stakeholders occurs]

Also there are some specific reduced accuracy requirements ("fast_*) in the OpenCL SPIRV opcode spec, these would need to be included too.

Otherwise, separate opcodes would need to be added, just to support those SPIRV operations, which is against the principle of RISC.

> Default would be IEEE-compliant, encoded as 00. This means that all current hardware would be compliant with the default mode.

This would be important!

>
> the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.

Yes, agreed, this is v important. Embedded are on their own (as normal).

>
> As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs. It will illegal-instruction trap upon executing an unsupported fam-instruction pair. The implementation can then emulate the accuracy mode required.

I like it.

> there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible)

Hmmmm, if speed or power consumption of an implementation is compromised by that, it would be bad (and also Khronos nonconformant, see below).

> that would allow user code to switch to faster software implementations if it chooses to.
>
> If the bits are in FCSR, then the switch itself would be exposed to user mode. User-mode would not be able to detect emulation vs hardware supported instructions, however (by design). That would require some platform-specific code.

Hmmm. 3D is quite different.

Look at software unaccelerated MesaGL. High end games are literally unplayable in software rendering, and the Games Studios will in some cases not even permit the game to run if certain hardware characteristics are not met, because it would bring the game into disrepute if it was permitted to run and looked substandard.

Bottom line is: Security be damned - the usermode software *has* to know *everything* about the actual hardware, and there are Standard APIs to list the hardware characteristics.

If those APIs "lie" about those characteristics, not only will the end users bitch like mad (justifiably), the ASIC will *FAIL* Khronos conformance and compliance and will not be permitted to be sold with the Vulkan and OpenGL badge on it (they're Trademarks).

There will be some designs where even the temperature sensors are fed back to userspace and the 3D rendering demands dialed back to not overheat the ASIC and still keep user response time expectations to acceptable levels.

(Gamers HATE lag. It can result in loss of a tournament).

>
> Now, which accuracy modes should be included is a question outside of my expertise and would require a literature review of instruction frequency in key workloads, PPA analysis of simple and advanced implementations, etc.

Yes. It's a lot of work (that offline message had some links already), and my hope is that the stakeholders in the (yet to be formed/announced) 3D Open Graphics Alliance will have a vested interest in doing exactly that.

The point of raising the Ztrans and Zftrig* proposal at this early phase is to have some underpinnings so that the Alliance members can hit the gound running.

> emulation of unsupported modes would be required for unix platforms.

Yes.

>
> I don't see why Unix should be required to emulate some arbitrary reduced accuracy ML mode.

It's completely outside of my area of expertise to say, one way or the other. It will need a thorough review and some input from experienced 3D software developers.

> My guess would be that Unix Platform Spec requires support for IEEE,

concur.

> whereas arbitrary ML platform requires support for Mode XYZ. Of course, implementations of either platform would be free to support any/all modes that they find valuable.

concur.

> Compiling for a specific platform means that support for required accuracy modes is guaranteed (and therefore does not need discovery sequences), while allowing portable code to execute discovery sequences to detect support for alternative accuracy modes.

The latter will be essential for detecting the "fast_*" capability.

Main point, I cannot emphasise enough how critical it is that userspace software get at the underlying hardware characteristics. This for Khronos Standards Compliance.

Sensible proposal, Dan. will write it up shortly.

L.

lkcl

unread,

Aug 9, 2019, 12:24:56 AM8/9/19

to RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 8, 2019 at 10:55:27 PM UTC+8, Allen Baum wrote:

> From my point of view, it needs to match the reference model for any ratified standard., else it won’t be labeled compliant.

To reiterate in this context, after reading Dan's post: Khronos Conformance (Vulkan, OpenGL, OpenCL) is absolutely critical for products to enter certain high profit markets.

The Khronos Group also has trademarks, and it is critical that their Industry Standard requirements be met.

It will be absolutely essential for RISCV Conformance / Compliance / Standards to *not* get in the way or impede Khronos Conformance / Compliance / Standards, in any way, shape or form.

This is why I suggested the [new] "3D UNIX Platform" as it makes it clear that its specialist requirements are completely separate and distinct from, and do not impact in any way, the [present] UNIX Platform Spec.

L.

Andrew Waterman

unread,

Aug 9, 2019, 1:39:53 AM8/9/19

to lkcl, Mitch...@aol.com, RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

That’s a business justification, not an architectural one, and in any case it’s a justification for different instructions than the ones you’ve proposed. The market you’re describing isn’t populated with products that have ISA-level support for correctly rounded transcendentals; they favor less accurate operations or architectural support for approximations and iterative refinement.

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/65871042-2e4a-4a76-869d-e785f5d8bd16%40groups.riscv.org.

lkcl

unread,

Aug 9, 2019, 1:42:18 AM8/9/19

to RISC-V ISA Dev, Mitch...@aol.com, program...@gmail.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 3:09:26 AM UTC+8, Allen Baum wrote:
> Regarding the statement:
> As with frm, an implementation can choose to support any permutation of dynamic fam-instruction pairs.
> It will illegal-instruction trap upon executing an unsupported fam-instruction pair.
> Putting my compliance hat on (I have to do that a lot), this works only if
> - the reference model is capable of being configured to trapon any permutation of illegal opcodes, OR
> - the compliance framework can load and properly execute abitrary (vendor supplied) emulation routines
> -- and they get exactly the same answer as the reference model.
>
>
>
> This is all mot if you don't want to use the RISC-V trademark, or the platform doesn't requirement whatever is non-compliant, of course (which isn't flippant - its then a custom extension that may work perfectly well for some custom appllication).

Absolutely, agreed.

The weird thing here is that Zftrans/Ztrig*/Zfacc are for a wide and diverse range of implementors, covering:

* Hybrid CPU/GPUs. Aside from the Libre RISCV SoC, Pixilica's well attended SIGGRAPH 2019 BoF was a Hybrid proposal that had a huge amount of interest, because of the flexibility

https://s2019.siggraph.org/presentation/?id=bof_191&sess=sess590

Here, paradoxically, both IEEE754 compliance *and* Khronos conformance are required in order to meet customer expectations.

Just these requirements alone - the fact that the compiler toolchains (and software libraries such as GNU libm) will have enormous public distribution - excludes them from "custom" status.

* UNIX platform only. This platform ends up indirectly benefitting from a speed boost on hardware that has support for the proposed Zf exts.

* Low power low performance cost sensitive embedded 3D. Typically 1024x768 resolution. Even here, implementors can benefit hugely from collaboration and industry wide standardisation.

Again, even in Embedded 3D, "custom" ie not appropriate as it sends the wrong message. They will use the exact same publicly available widely distributed UPSTREAM compiler toolchains and software resources as the UNIX and 3dUNIX platforms, just dialled back with Zfacc.

What does *not* need standardisation is proprietary GPUs. Mass volume products with their own proprietary binary-only-distributed implementation of OpenGL, Vulkan, and OpenCL.

Such implementors would definitely qualify for "custom" status.

It is critical to understand that this proposal is NOT designed for their needs. They might benefit from it, however unless they come forward (with associated cheque book to cover the cost of including their needs in the proposals) I'm not going to spend time or energy on them.

I've had enough of proprietary GPUs and the adverse impact proprietary libaries has on software development, reliability and the environment [1].

L.

[1] http://www.h-online.com/open/news/item/Intel-and-Valve-collaborate-to-develop-open-source-graphics-drivers-1649632.html

Allen J. Baum

unread,

Aug 9, 2019, 1:47:18 AM8/9/19

to lkcl, RISC-V ISA Dev, program...@gmail.com, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

At 8:26 PM -0700 8/8/19, lkcl wrote:
>On Thursday, August 8, 2019 at 10:55:27 PM UTC+8, Allen Baum wrote:
>> From my point of view, it needs to match the reference model for any ratified standard., else it won't be labeled compliant.
>
>You mean RV compliance? It also needs to realistically meet customer demand.
>
>In the market dominated by AMD and NVidia that gives one driving factor: compliance with Khronos specifications. Failure to meet these predefined requirements will automatically result in market rejection.

Well, yes an no. The "my point of view" part is important.
From a RV Compliance framework point of view, we will only support RV standards.
OR, another way of looking at it, is that we won't support custom extensions.

The right custom extension could be a billion dollar business, but there will be no (officially supported) way to have the reference model compare it to a test target. A developer could claim compliance with a custom platform that requires everything except the custom extension (and be allowed to use the trademark)., It (or a customer) could then have a separate test suite to ensure the custom extension that the customer has required will work/
But unless it becomes a ratified standard, the RV Compliance Framework and Test Suite just won't deal with it.

If the customer is big enough, and the market is big enough, maybe that custom extension becomes a standard - at which point it all works.

--
**************************************************
* Allen Baum tel. (908)BIT-BAUM *
* 248-2286 *
**************************************************

lkcl

unread,

Aug 9, 2019, 2:03:08 AM8/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 1:39:53 PM UTC+8, andrew wrote:
> On Thu, Aug 8, 2019 at 8:12 PM lkcl <luke.l...@gmail.com> wrote:
> On Thursday, August 8, 2019 at 10:14:58 PM UTC+8, andrew wrote:
>
>
>
> > As I mentioned earlier, an instruction being useful is not by itself a justification for adding it to the ISA. Where’s the quantitative case for these instructions?
>
>
>
> 3D is a Billion dollar mass volume market, proprietary GPUs are going after mass volume, inherently excluding the needs of (profitable) long-tail markets.
>
>
> That’s a business justification,

Yes it is. That's the start of the logic chain.

> not an architectural one, and in any case it’s a justification for different instructions than the ones you’ve proposed.

Andrew: I appreciate that you're busy: so am I. If you could give a little bit more detail, by spending the time describing a way forward instead of putting up barriers where we have to guess how to work around them, that would save a lot of time.

For example, in order to move forward with a solution, I would expect such a statement above to include some sort of description or hint as to what the alternative instructions might look like.

Then those can be formally evaluated to see if they meet first the business justification, then if that passes, the time can be spent on architectural evaluation.

> The market you’re describing isn’t populated with products that have ISA-level support for correctly rounded transcendentals; they favor less accurate operations or architectural support for approximations and iterative refinement.

Iterative pipelined refinements, in order to meet timing critical needs of [some] of the business requirements, yes.

Andrew: the proprietary vendors are "custom" profiles. Their design approach gas to be excluded from consideration as something to follow.

The proprietary vendors typically have conplete custom architectures. Custom ISAs. Even NVIDIA's new architecture fits the *custom* RISCV profile.

This has a software development penalty due to the need to "marshall" and "unmarshall" the OpenGL / OpenCL / Vulkan API parameters on the userspace (x86/ARM/MIPS) side, stream them over to the GPU's memory space (typically over a PCIe PHY), and unpack them.

The response to the API call goes through the exact same insane process. This so that, in the case of eg MALI, PowerVR, Vivante etc they can sell product independently of the main CPU ISA.

We are proposing something that runs DIRECTLY ON THE PROCESSOR. I apologise I know you don't like capitals, I have written the above about eight or nine tines now over the past year and it's starting to get irritating that its significance is being ignored.

Hybrid CPU / GPUs such as that proposed by Pixilica (and independently by our team) are much simpler to implement, make debugging far easier, and save hugely on development time.

Hybrid CPU/GPU/VPUs therefore *need* - at the architectural level - a close tie-in to the host ISA. That's just how it has to be.

And as high profile "open" products, the compiler tools also neef full upstream support for them.

This is why they simply cannot and will not work as "custom" extensions. I have repeated this literally dozens of times for over 18 months. Eventually it will sink in.

L.

lkcl

unread,

Aug 9, 2019, 2:25:57 AM8/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, program...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 1:47:18 PM UTC+8, Allen Baum wrote:

> But unless it becomes a ratified standard, the RV Compliance Framework and Test Suite just won't deal with it.

I see the logic behind that. It costs money and time.

> If the customer is big enough, and the market is big enough, maybe that custom extension becomes a standard - at which point it all works.

Indeed. The doenside: that unfortunately would require a wing and a prayer that the opcode space doesn't get used up for other [official] purposes in between those two events.

It's an extremely risky approach, as the implementor will sure as hellfire exists not want their product - one that cost them tens to hundreds of millions to develop and market - to be "relegated" to nonstandard status by an *incompatible* post-silicon after-sales Standardisation effort, in the [highly likely] event of an opcode clash.

You can guarantee they'll fight that one, to the detriment of the entire RISCV community [cf: Altivec SSE nightmare]

On another note: I think, *deep breath*, sad to say it, the RISCV Foundation looks at our team, operating from the outside, excluded from participation due to the Membership Agreement being an NDA, and considers us to be a bit of a joke.

Our input - including warnings - can therefore be "safely ignored", just like Open Source contributors ideas and input can and have been ignored, for many years, now.

Because we don't come with a billion dollar Corporate cheque book automatically attached, our input and perspective cannot possibly have any impact, because how can such unrealistically stupid and deluded idealistic time wasters possibly get the money to actually deliver silicon, right?

Things will get a lot easier when that perspective changes. I hope and trust that that change occurs before it is too late.

L.

lkcl

unread,

Aug 9, 2019, 2:45:05 AM8/9/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

I must apologise, I saw somewhere, Mitch I think it was, posted that CORDIC latency is quite high (single bit per iteration) so is less likely to be used in high performance designs.

Couple of things about that:

* In the [new] 3D Embedded Platform, speed and performance are nonessential: even accuracy is nonessential. Cost savings on SDKs, power, etc are the higher priority.

CORDIC, for these profiles, is perfect, because of the huge number of operands it covers.

* For the Libre RISCV SoC, chances are high that we will use it (at least for a first revision), and will do so by treating each iteration as a combinatorial block.

Several of those blocks will be chained together *per pipeline*, which will increase gate latency and that is perfectly acceptable as the clock rate target is only 800mhz (not 4ghz).

This trick is one that we have deployed in the FPDIV/SQRT/RSQRT pipeline, using high radix stages as well, to get the pipeline length down even further in that instance.

Whether CORDIC algorithm enhancements exist that will allow us to do more than one bit at a time? Haven't looked yet.

Given that CORDIC is at heart just a simple add and compare, I really do not expect the chaining of multiple iterations as combinatorial blocks to have that big an adverse effect (not on an 800mhz target).

With the mantissa being 23 bit, three chains would easily get us to a 9-10 long pipeline for FP32. 4 chains would give a 7-8 stage pipeline.

This is a really good return on implementation time investment for such a huge number of operations being covered by such a ridiculously simple and elegant algorithm.

L.

lkcl

unread,

Aug 9, 2019, 2:47:25 AM8/9/19

to RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 2:45:05 PM UTC+8, lkcl wrote:

> Whether CORDIC algorithm enhancements exist that will allow us to do more than one bit at a time? Haven't looked yet.

Pffh. Yes. *lots*.

https://www.google.com/search?q=CORDIC+RADIX+4

Andrew Waterman

unread,

Aug 9, 2019, 3:25:54 AM8/9/19

to lkcl, Mitch...@aol.com, RISC-V ISA Dev, libre-r...@lists.libre-riscv.org

On Thu, Aug 8, 2019 at 11:03 PM lkcl <luke.l...@gmail.com> wrote:

On Friday, August 9, 2019 at 1:39:53 PM UTC+8, andrew wrote:
> On Thu, Aug 8, 2019 at 8:12 PM lkcl <luke.l...@gmail.com> wrote:
> On Thursday, August 8, 2019 at 10:14:58 PM UTC+8, andrew wrote:
>
>
>
> > As I mentioned earlier, an instruction being useful is not by itself a justification for adding it to the ISA. Where’s the quantitative case for these instructions?
>
>
>
> 3D is a Billion dollar mass volume market, proprietary GPUs are going after mass volume, inherently excluding the needs of (profitable) long-tail markets.
>
>
> That’s a business justification,

Yes it is. That's the start of the logic chain.

> not an architectural one, and in any case it’s a justification for different instructions than the ones you’ve proposed.

Andrew: I appreciate that you're busy

Good point - I capitulate.

: so am I. If you could give a little bit more detail, by spending the time describing a way forward instead of putting up barriers where we have to guess how to work around them, that would save a lot of time.

For example, in order to move forward with a solution, I would expect such a statement above to include some sort of description or hint as to what the alternative instructions might look like.

Then those can be formally evaluated to see if they meet first the business justification, then if that passes, the time can be spent on architectural evaluation.

> The market you’re describing isn’t populated with products that have ISA-level support for correctly rounded transcendentals; they favor less accurate operations or architectural support for approximations and iterative refinement.

Iterative pipelined refinements, in order to meet timing critical needs of [some] of the business requirements, yes.

Andrew: the proprietary vendors are "custom" profiles. Their design approach gas to be excluded from consideration as something to follow.

The proprietary vendors typically have conplete custom architectures. Custom ISAs. Even NVIDIA's new architecture fits the *custom* RISCV profile.

This has a software development penalty due to the need to "marshall" and "unmarshall" the OpenGL / OpenCL / Vulkan API parameters on the userspace (x86/ARM/MIPS) side, stream them over to the GPU's memory space (typically over a PCIe PHY), and unpack them.

The response to the API call goes through the exact same insane process. This so that, in the case of eg MALI, PowerVR, Vivante etc they can sell product independently of the main CPU ISA.

We are proposing something that runs DIRECTLY ON THE PROCESSOR. I apologise I know you don't like capitals, I have written the above about eight or nine tines now over the past year and it's starting to get irritating that its significance is being ignored.

Hybrid CPU / GPUs such as that proposed by Pixilica (and independently by our team) are much simpler to implement, make debugging far easier, and save hugely on development time.

Hybrid CPU/GPU/VPUs therefore *need* - at the architectural level - a close tie-in to the host ISA. That's just how it has to be.

And as high profile "open" products, the compiler tools also neef full upstream support for them.

This is why they simply cannot and will not work as "custom" extensions. I have repeated this literally dozens of times for over 18 months. Eventually it will sink in.

L.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3faac876-96d5-4844-9d54-3a2e3e38e194%40groups.riscv.org.

lkcl

unread,

Aug 9, 2019, 3:40:29 AM8/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 8:25:54 AM UTC+1, andrew wrote:

Andrew: I appreciate that you're busy

Good point - I capitulate.

Andrew! that doesn't help! your input here is just as valuable as everyone else's. if you believe you have a better idea *it's important to evaluate it*!

there used to be something in here about "capitulation" and how it's a disempowerment strategy - https://www.crnhq.org/cr-kit/ - however they've redone the pages slightly and I can't find it, so don't know what the appropriate "response" is that will help.

my observations are that just at the point where there was the *possibility* that i had answered your question... you capitulated and gave up on the discussion. of course, that leaves us sadly with the possibility that you won't even see this or any other efforts to include you in the discussion.

sigh.

there's got to be a better way than this. if nothing changes it's going to be *two more years* of absolute hell, for everyone.

l.

Bruce Hoult

unread,

Aug 9, 2019, 4:12:14 AM8/9/19

to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development

On Fri, Aug 9, 2019 at 12:40 AM lkcl <luke.l...@gmail.com> wrote:
>
> On Friday, August 9, 2019 at 8:25:54 AM UTC+1, andrew wrote:
>
>>>
>>> Andrew: I appreciate that you're busy
>>
>> Good point - I capitulate.
>
> Andrew! that doesn't help! your input here is just as valuable as everyone else's. if you believe you have a better idea *it's important to evaluate it*!

It's not Andrew's job to come up with a better idea. It's your job.

The key word is "quantitative". You, as the proposer of new
instructions must provide justification for the percentage improvement
of having the new instruction vs not having it. Not hand-waving.
Numbers. Measured, or at least calculated in a justifiable way.

If there is a measurable and significant improvement on some large
body of code, such as SPEC for example, then that would be grounds for
considering inclusion in a RISC-V Foundation standard extension.

If it improves only some narrow specialized task then that might
justify a custom extension. But you haven't even shown that, other
than "hardware good -- software bad". Is it even measurable, even on
*your* workload? We sure don't know the answer.

Potential market size is irrelevant The most it does is provide
justification for doing the quantitative performance evaluation in the
first place.

lkcl

unread,

Aug 9, 2019, 4:46:17 AM8/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 4:12:14 PM UTC+8, Bruce Hoult wrote:
> On Fri, Aug 9, 2019 at 12:40 AM lkcl <luke.l...@gmail.com> wrote:
> >
> > On Friday, August 9, 2019 at 8:25:54 AM UTC+1, andrew wrote:
> >
> >>>
> >>> Andrew: I appreciate that you're busy
> >>
> >> Good point - I capitulate.
> >
> > Andrew! that doesn't help! your input here is just as valuable as everyone else's. if you believe you have a better idea *it's important to evaluate it*!
>
> It's not Andrew's job to come up with a better idea. It's your job.
>
> The key word is "quantitative". You, as the proposer of new
> instructions must provide justification for the percentage improvement
> of having the new instruction vs not having it. Not hand-waving.
> Numbers. Measured, or at least calculated in a justifiable way.

Okaay, thank you for explaining.

> If there is a measurable and significant improvement on some large
> body of code, such as SPEC for example, then that would be grounds for
> considering inclusion in a RISC-V Foundation standard extension.

It took Jeff Bush on Nyuzi about... 2 years to get to the point of being able to do that level of assessment.

We're a small team, only 2 full time engineers, total sponsorship: EUR 50,000.

The RISCV Foundation receives.. what... several million investment and membership fees?

So as a Libre Project, and as GPUs is a market that we know there is demand for... you get the general idea.

In the meantime I will reach out to my contact and explain the situation to him. He will then be in a better position to explain to new Alliance Members what'w needed.

> If it improves only some narrow specialized task then that might
> justify a custom extension.

No, depending on how much silicon area is dedicated to FP, it'll be somewhere around one or two orders of magnitude performance improvement in both OpenCL and 3D.

For 3D embedded, the collaboration itself (that the opcodes exist at all and have software support) is the key benefit. Performance metrics in the 3D embedded space are actually completely and utterly misleading.

> But you haven't even shown that, other
> than "hardware good -- software bad". Is it even measurable, even on
> *your* workload?

Remember, there's several different platforms. They all have different requirements. Driving the entire proposal from a perormance or "quantitive" perspective is both detrimental, misleading, and misses the point.

> We sure don't know the answer.

It'll be about... estimated... six to eight months before we have RTL that can run anything.

In the meantime, *for those platforms that desire performance*, simple assessments of what libm currently does in s/w, and replacing that with *single cycle* hardware opcodes gives a clear idea of the performance gains.

Also: 3D requires *guaranteed* real time response times. Iterative blocking algorithms are absolutely unacceptable as they break the guaranteed pixel frame rate requirements.

For example, we have to do FPDIV as a pipeline, iterative N.R. is a No, and there is one FPDIV per pixel in Normalisation (and one RSQRT).

Jeff Bush's paper, nyuzi_pass2016, is also a good reference. It shows what happens in 3D if you *don't* have the right primitives.

Nyuzi, like Larrabee, is fantastic as a Vector Compute Engine. As a 3D GPU, it is a paltry 25% of modern GPU performance for the same silicon area / power budget.

The Larrabee team were not permitted to reveal that little fact in their original paper. whoops :)

> Potential market size is irrelevant The most it does is provide
> justification for doing the quantitative performance evaluation in the
> first place.

Funding. We're doing this project from charitable donations, from NLNet. Find the funding, we can do the evaluation.

Otherwise, someone else has to do it. We may be able to find someone, through our contacts, but there's definitely no budget available for our small team to do six to eight months of research, here.

Sorry.

Ok, so thank you for clarifying, Bruce: I'll ask around, see how this can be solved.

L.

Jacob Lifshay

unread,

Aug 9, 2019, 6:04:29 AM8/9/19

to lkcl, RISC-V ISA Dev, Mitchalsup, Libre-RISCV General Development

On Thu, Aug 8, 2019 at 9:16 PM lkcl <luke.l...@gmail.com> wrote:
> Hmmmm, if speed or power consumption of an implementation is compromised by that, it would be bad (and also Khronos nonconformant, see below).

I know that for Vulkan/OpenGL compliance, it's based on if it produces
the correct results, not how long it takes to compute.
A good example of that kind of thing is that OpenGL 4.x is supported
(assuming they've finished by now) on older AMD gpus by using
soft-float for fp64:
https://www.phoronix.com/scan.php?page=news_item&px=Airlie-Preps-Soft-FP64-Mesa
If I recall correctly, soft-float is also used for fp64 on Intel Haswell GPUs.

> If those APIs "lie" about those characteristics, not only will the end users bitch like mad (justifiably), the ASIC will *FAIL* Khronos conformance and compliance and will not be permitted to be sold with the Vulkan and OpenGL badge on it (they're Trademarks).

According to Khronos's website, any Vulkan implementation is declared
conformant from a technical standpoint once it has been verified by
Khronos that it passes the Vulkan conformance tests, which is just a
suite of unit tests.
https://www.khronos.org/vulkan/adopters/
Vulkan/OpenGL/OpenGL ES conformance tests:
https://github.com/KhronosGroup/VK-GL-CTS

Jacob Lifshay

lkcl

unread,

Aug 9, 2019, 6:11:04 AM8/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

Andrew, Bruce: I'm sorry, I'm having real difficulty understanding the need for a "quantitative" analysis, here. It makes absolutely no sense.

Now, for the case where, for example.... let's say... an xBitManip "ROR" instruction, that's easy to do. Crypto uses it, therefore crypto performance is adversely impacted, therefore it's easy to justify.

Or, let's say... an opcode that can be used for big-integer math "carry". You get some algorithms, you look at what the performance is before and after, and it's easy to justify.

However with 3D and OpenMP, it's a whole different ballgame (not so with Machine Learning, which can use the same APIs, just with different accuracy requirements).

Not only is the market *already pre-justified*, by massive billion-dollar Corporations such as NVIDIA, AMD, Intel etc., there's pre-defined hardware APIs in the form of the Khronos OpenCL opcode specification:

https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

that these companies collaborated to create such a specification, that *IS* the "justification". attempting to do a "quantitative analysis" of *already proven opcodes* - repeating decades-long Industry-proven customer-backed ongoing acceptance is both staggeringly time-consuming and completely pointless, at the same time.

Machine Learning augmentations on the other hand are a different story, as the field is so new that not even the billion-dollar GPU companies have proven up-to-date useful hardware.

The justification for these opcodes is thus effectively one word: "Vulkan". If it's in the Vulkan spec, that *is* the justification, because failure to include opcodes that SPIR-V supports *automatically* penalises any such hardware.

The other justification is - again - not quantitative, it's collaborative. The whole point of RISC-V is that it provides a central focal point for different stake-holders to utilise it and reduce their NREs and their customer costs and time to market. Having these trigonometric opcodes - and compliance with the Vulkan API that goes with that - *is* a justification in and of itself.

So i'm really sorry, but a "quantitative" analysis of a near-exact replica of the Kronos OpenCL opcodes is misleading, pointless, and a hopeless waste of everybody's time.

Does that make sense?

l.

lkcl

unread,

Aug 9, 2019, 6:13:34 AM8/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 11:04:29 AM UTC+1, Jacob Lifshay wrote:

On Thu, Aug 8, 2019 at 9:16 PM lkcl <luke.l...@gmail.com> wrote:
> Hmmmm, if speed or power consumption of an implementation is compromised by that, it would be bad (and also Khronos nonconformant, see below).

I know that for Vulkan/OpenGL compliance, it's based on if it produces
the correct results, not how long it takes to compute.

okay, that's good to know: I didn't realise that.

thx jacob.

l.

lkcl

unread,

Aug 9, 2019, 7:55:16 AM8/9/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 6:11:04 PM UTC+8, lkcl wrote:

> So i'm really sorry, but a "quantitative" analysis of a near-exact replica of the Kronos OpenCL opcodes is misleading, pointless, and a hopeless waste of everybody's time.

By contrast, one of our sponsors had a user ask, "hey POSITs are up and coming, and have greater accuracy. how about including them?"

I had to sheepishly and diplomatically explain to them that adoption of POSITs in hardware is just the start. With no Industry-standard adoption, as coalesced in the Khronos Vulkan API, it would be five man-years of effort, at the end of which *there would still be no user adoption*.

The reason is simple: shader engines are already compiled into SPIRV that meet the Vulkan Specification.

If POSITs are not part of that spec, the hardware don't get used, period.

Now, did you see, anywhere in this thread, any mention of a proposal to include POSITs? Answer: no, because (being undiplomatic) it's a waste of time.

I did not waste anyone's time even mentioning it because we *already did the analysis*.

By contrast, early in the thread, we began to do an analysis of which operations could be implemented in terms of others. ATAN2 can be used to implement ATAN by setting y=1.0 and LOGP1 can be used to implement LOG and give better accuracy anyway.

Hypotenuse likewise was pointed out kindly by Mitch as being very straightforward to implement in terms of ln and sqrt however for really high performance applications the extra clock cycles may not be acceptable, so there is a Zhypot extension that lists them.

These decisions are the kinds that simply apply standard RISC ISA development rules, and will need careful review before finalising.

The actual set of opcodes themselves, there is simply no doubt at all that they are easily justifiable, as the level of adoption worldwide as Industry Standards as defined by the Khronos Group *unquestionably* proves their worth.

Now, unfortunately, because of the "giving up" wording, I have to use specific language here, in case the conversation ends. "If we do not hear otherwise, we will ASSUME that the above is reasonable, and that the approach being taken IN GOOD FAITH is acceptable for a Z RISC-V standard".

Apologies, I have to say that. It leaves the ball in the RISCV Foundation's court, should there be no further response, leading to prejudicial DELAY in our progress. You will have seen the forbes article by now, so will understand why it is becomes necessary to use this wording.

L.

Allen Baum

unread,

Aug 9, 2019, 2:12:00 PM8/9/19

to lkcl, RISC-V ISA Dev, MitchAlsup, Jacob Lifshay, Libre-RISCV General Development

On Thu, Aug 8, 2019 at 9:16 PM lkcl <luke.l...@gmail.com> wrote:

> Compiling for a specific platform means that support for required accuracy modes is guaranteed (and therefore does not need discovery sequences), while allowing portable code to execute discovery sequences to detect support for alternative accuracy modes.

The latter will be essential for detecting the "fast_*" capability.

Main point, I cannot emphasise enough how critical it is that userspace software get at the underlying hardware characteristics. This for Khronos Standards Compliance.

Sensible proposal, Dan. will write it up shortly.

Well, one way to get at the underlying HW characteristics is an ECALL to a higher priv mode; That will take a little longer, but I don't see why it would be used more than once during the run of an applications, so that should be OK.

But a similar issue has come up in several forums, and I've proposed a standard discovery mechanism that will enable all kinds discovery using a CSR interface.

Writing the CSR initializes a pointer into a discovery data structure, and reading it will return the contents of the structure at that offset (with optional autoincrement)

The data structure is a linked list of capabilities (ID, offset of next capability) where the actual capability is implementation dependent (on the ID).

So, a config string ID, debug capability ID, ISA capability ID can be independently defined to be YAML, binary or ascii formats, as desired.

How that is implemented under the covers is implementation dependent: at one end of the spectrum is a ROM whose interface is the CSR, and at the other end all accesses trap to M-mode code that knows where to find the information.

I think this is backwards compatible with the current architectural definition of the configuration string; it just makes access more portable (from a SW point of view), expands it to allow other kinds of capabilities and formats that may be more easily parsable and enables future expansion.

Allen Baum

unread,

Aug 9, 2019, 3:03:10 PM8/9/19

to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development

On Fri, Aug 9, 2019 at 1:46 AM lkcl <luke.l...@gmail.com> wrote:

> If there is a measurable and significant improvement on some large
> body of code, such as SPEC for example, then that would be grounds for
> considering inclusion in a RISC-V Foundation standard extension.

It took Jeff Bush on Nyuzi about... 2 years to get to the point of being able to do that level of assessment.

OK - but what's your point?

Or, rather, why do you expect that you can or should be able to skip 2 years of work?

Or, why should you expect anyone to make a major adoption of a standard that might turn out to be fatally flawed because it was rushed to ratification without the requisite homework?

If you want to create a standard (make no mistake, that's what you're doing) that will be widely adopted, there is a lot of heavy lifting that can't be swept under the rug.

I'm sorry to say that your team may have the resources to design something pretty nifty - but not big enough to handle the other part of that, which is to demonstrate it is the right nifty thing that others will adopt (at the expense of adopting someone else's nifty standard).

You do have an advantage - SW developers will prefer an open source solution - but not at the expense of a flawed open source solution, and it is unfortunately up to you to show that it isn't flawed (to clear- I'm not saying it is flawed - just that it needs evidence to back it up).

lkcl

unread,

Aug 10, 2019, 12:03:02 PM8/10/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, program...@gmail.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 7:12:00 PM UTC+1, Allen Baum wrote:

Well, one way to get at the underlying HW characteristics is an ECALL to a higher priv mode; That will take a little longer, but I don't see why it would be used more than once during the run of an applications, so that should be OK.

i think samuel pointed out that this is (sorely) needed: there was something where, at the moment, you have to get at some information right at boot time then "hope like hell it never gets changed". providing a mechanism for user-space to call up to higher priv levels would fix that.

But a similar issue has come up in several forums, and I've proposed a standard discovery mechanism that will enable all kinds discovery using a CSR interface.

excellent.

Writing the CSR initializes a pointer into a discovery data structure, and reading it will return the contents of the structure at that offset (with optional autoincrement)
The data structure is a linked list of capabilities (ID, offset of next capability) where the actual capability is implementation dependent (on the ID).
So, a config string ID, debug capability ID, ISA capability ID can be independently defined to be YAML, binary or ascii formats, as desired.

cool.

l.

lkcl

unread,

Aug 10, 2019, 1:16:45 PM8/10/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Friday, August 9, 2019 at 8:03:10 PM UTC+1, Allen Baum wrote:

On Fri, Aug 9, 2019 at 1:46 AM lkcl <luke.l...@gmail.com> wrote:

> If there is a measurable and significant improvement on some large
> body of code, such as SPEC for example, then that would be grounds for
> considering inclusion in a RISC-V Foundation standard extension.

It took Jeff Bush on Nyuzi about... 2 years to get to the point of being able to do that level of assessment.

OK - but what's your point?
Or, rather, why do you expect that you can or should be able to skip 2 years of work?

if you read the follow-ups it will become clear. the amount of work that went into the Vulkan / OpenCL standard is... staggering.

* repeating that quantitive evaluation is completely pointless and further

* DEVIATING from that quantitive evaluation that resulted in the selection of those opcodes is also pointless and further

* REMOVING from the list of opcodes that have Industry-wide adoption results in performance penalties (if not done carefully).

i illustrated in a cross-over post that i've *already* rejected requests for the addition of POSITs, for example [and didn't bother anyone here with them].

i also described that we've already begun the process of evaluating "equivalence" (LOG=LOGP1(x-1) etc.) thanks to input from Mitch for that one. we've saved i think it's at least four opcodes already from being needed.

Or, why should you expect anyone to make a major adoption of a standard that might turn out to be fatally flawed because it was rushed to ratification without the requisite homework?

we don't! we expect everyone to help do a proper evaluation! it's everyone's community, and we *all* win by contributing to it! why is that so hard to understand or follow through on, in a fashion that's respectful and doesn't make people feel completely unwelcome??

we do *NOT* expect people to act in a hostile fashion, trying to gain "one up" over anyone else (directly or indirectly). we're keenly aware that this is for a wider benefit, so it's up to *all* of us to take responsibility and work together.

If you want to create a standard (make no mistake, that's what you're doing) that will be widely adopted, there is a lot of heavy lifting that can't be swept under the rug.

well if someone actually bothered to ****** well tell us what the process was - which we've requested six, seven, eight, nine ten ******* times, we'd not be in this ******* mess _would_ we!

as "outsiders", because we are excluded from membership by the NDA terms and the conflict of interest with our business objectives, with everyone else *knowing* what the process is, of *course* they're going to get frustrated with us.

*and that's just not good enough*. the RISC-V Founders set up this stupid ITU-style Standards process - without consulting anyone or listening to feedback, so don't blame *us* if it doesn't work out!

where is the documentation on how to submit proposals?

why has nobody answered our requests on what the process is?

if it is on the closed / secretive RISC-V wiki, *why* is such critical documentation closed and secretive?

why are *we* being penalised for the RISC-V Foundation failing to take responsibility for something as fundamental and basic as providing easily-accessible written procedures on how to submit extensions?

does written documentation on such procedures even *exist*??

I'm sorry to say that your team may have the resources to design something pretty nifty - but not big enough to handle the other part of that, which is to demonstrate it is the right nifty thing that others will adopt (at the expense of adopting someone else's nifty standard).

well... i'm not a fan of that approach ["winning" over someone else's ideas]. if there's an alternative standard, that would save us time and effort, that's GREAT.

the problem comes - as is the case with how both RVV and BitManip have been developed (both of which we need to complete a GPU) - we're *EXCLUDED* from contributing and participating in the innovation.

so we've been FORCED into a position of creating an entire new vectorisation standard.

we're not happy about it.

You do have an advantage - SW developers will prefer an open source solution - but not at the expense of a flawed open source solution, and it is unfortunately up to you to show that it isn't flawed (to clear- I'm not saying it is flawed - just that it needs evidence to back it up).

i understand, as it would be completely insane to drop USD $10m onto a silicon roll-out only to have it fail because it hadn't been designed properly in the first place.

*we know this just as well as everyone else*. [i had a mentor for 12 years who worked for LSI Logic, and was the head of Samsung's R&D]

the approach that i take is that i actually don't trust "myself". from the software engineering and reverse-engineering that i've done, i find some way to "prove" that the approach is correct.

in the case of the IEEE754 FP Unit, that involved running hundreds of thousands of conformance tests against softfloat-3. why? because softfloat-3 is "proven".

in the case of the proposed Transcendentals and Trigonometric functions, they're *DIRECTLY* lifted from the OpenCL SPIR-V extended instruction set:

https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

the Khronos Consortium has *really* big names behind it. replicating the N+ years of effort that's gone into why they added those particular opcodes is genuinely a stupid thing to do [likewise, deviating from that list is pretty pointless, as there's no API call in Vulkan that will use it].

there is therefore *absolutely no point* trying to do "quantitative analysis" of work that's *already* been proven.

let's try it.

let's take COS.

how should we "quantitatively analyse" whether COS should be included.

um... let's do some research. how many Vulkan applications writers use that function? well, we don't know, because it's an API function call.

i know, let's go onto the khronos forums and ask, "hey cool 3D and OpenCL dudes, how's it hangin'? we wanna like put COS into RISC-V. can you like, give us access to your proprietary source code and algorithms, and all the proprietary shader models of your multi-million-dollar games, so we can do a quantitative analysis of whether to put COS into hardware?"

after a suitably stunned silence, they'll just burst out laughing, won't they? not so much at the requests for access to proprietary trade secrets, but for the sheer banality of the question in the first place, and the lack of appreciation of why and how COS went into Vulkan.

about the only other possible way it could be done is with some academic research. and you know what? i bet that that academic research will basically be along the lines of:

"well, the Vulkan API has wide adoption, and, well, y'know: all the Transcendentals and Trigonometric functions seem to be heavily used by a wide and diverse range of applications, world-wide, y'know, cos, well, it's a proven standard. um, yeah, that's about all we can say, apart from this has been such a boring paper to write, with such a blindingly-obvious conclusion that we actually can't get it published in any academic journals, as it's just not original or interesting enough".

to me it's so frickin obvious that you stick - to the letter - to the Vulkan OpenCL Opcodes that i'm having severe difficulty understanding why this is not blindingly ****** obvious to other people.

the other aspect that i'm particularly pissed off about is that, as a Libre / Open team, the requests that we take on this additional work *are not coming with sponsorship offers and funding associated with them*.

that's completely unacceptable for *us* to be expected to foot the bill for advancement to RISC-V's adoption into such an important strategic area as 3D and OpenCL, especially given that billion dollar Corporations will end up - YET AGAIN - spongeing off of Libre initiatives.

this is something that's actually in the RISC-V Membership Agreement. Members are *REQUIRED* to fund Libre/Open developers when it advances RISC-V.

sorry, Allen - you can probably tell, i'm really not very happy or impressed.

particularly that, as it stands, i can't possibly put this thread in front of potential members of the upcoming Open 3D Alliance. they'd take one look at it, and, completely horrified, would respond "you *seriously* think we are going to trust our business prospects to the RISC-V Foundation if that's how they respond to Libre/Open innovators??"

as i said in the cross-over post: of course i get why a comprehensive evaluation is necessary. as a Certification Mark Holder i've spent *six years* developing a Standard, so i *know* how it works.

it's just that in this particular case, the Transcendentals and Trigonometrics are *already* proven. if they weren't, AMD, NVIDIA, Intel and other *massive* Corporations behind the Khronos Group would never have signed off on them, nor put them into their GPU Hardware, would they?

sometimes, there's other ways *other* than "quantitative analysis" to demonstrate that something's acceptable for a Standard. and that's to adopt *someone else's* Industry-Grade Standard.

l.

lkcl

unread,

Aug 11, 2019, 5:49:55 AM8/11/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

so let's do a quick recap, ratcheting the "disbelief and pissed-off-ness" down a notch or five, and also have the expectation that people will read the CRNHQ resource and bring a desire to work *together* rather than *against* this proposal. a reminder:

https://www.crnhq.org/cr-kit/#power

* the proposal is based on a trigonometric and transcendental subset of the Industry-proven OpenCL "extended" SPIR-V opcodes, ratified by the Khronos Group and implemented for decades in hardware that is backed by multi-billion dollar Corporations such as NVidia, AMD, Intel and others.

the full list is here and (thank you to a private offline response, it is important to emphasise that *only* opcodes from this list are being and always have been part of the proposal):

https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html

if evaluators and contributors to this proposal have not read that list, they need to do so. half_* refers to FP16 opcodes. fast_* refers to "reduced accuracy" opcodes.

* thanks to Dan's input, the "fast_*" OpenCL opcodes need *not* be added as separate and distinct RISC-V opcodes: a special "accuracy mode" field in FCSR can be flipped, and the hardware can (if it chooses to) route the opcode's processing to a separate, faster unit (or perform early-out on an existing one).

this meets the elegant "RISC" design criteria in a neat and extensible way that does not pollute the (pressurised) RISC-V opcode space in the truly dreadful way that e.g. SIMD does in other architectures.

* the proposal has been pre-vetted by extremely experienced Standards writers, familiar with the crucial need to "get it right first time", and the pre-vetting process has *already* excluded functions such as CORDIC, SLERP, LERP, as well as excluding user feature-requests for POSITs from this proposal.

we expect other participants in this process to respect the competence and experience of all other participants, that each of us brings something to the table that others might lack, and to *NOT* try to "win", "score points" or "make others look bad". efforts to do so are a public embarrassment and undermine the chances of successful long-term adoption of RISC-V in new and innovative areas that require huge ongoing long-term collaboration by many stakeholders *including those that are not and cannot be part of the RISC-V Foundation*.

* there is *no* public documentation on how to go about proposing extensions, so it is necessary to be *patient* and *collaborative* - working *together* - to work out how to gain and guage the level of "trust" that a Standard has to have.

[dictating that process by stating, without discussion or consultation, "quantitative analysis is demanded" and then walking away is not going to work.

in particular, it fails to meet the required standard of "Reasonable" behaviour by Trademark Holders, thus leaving the Trademark Holders vulnerable to weakening of the Trademark or, if it continues persistently and consistently, outright invalidation.]

* power-performance and meeting pre-existing Industry Standards is the Absolute King of the 3D world - both in the embedded space as well as the high-end markets.

anything that compromises power, performance or conformance to Industry Standards is an AUTOMATIC fail. this because 3D Shader Engines are complex, will contain Trade secrets that it is unreasonable to expect to be revealed, and are in *binary* form, conforming to the Vulkan API in the "SPIR-V" Intermediate Representation.

analysis of proprietary shader algorithms, developed by *third party adopters* of the Vulkan API, is both impractical, unreasonable, pointless and misleading.

anything that fails to meet the requirements of Vulkan (which includes OpenCL) is just absolutely not going to cut it.

* the amount of effort required to duplicate work that has been *already carried out* by the multi-billion-dollar-backed Khronos Group, to meet the expectation to provide "quantitative analysis" is both completely unreasonable and completely unnecessary.

as a proven Industry Standard, whilst blind "rubber-stamping" is clearly just as unacceptable, the Khronos Group's list of OpenCL opcodes *is* the canonical list that is required - in some form - to meet demanding industry expectations.

thus we need an *alternative* method of providing the expected and absolutely critical "trust" that has to be met.

due to the massive level of pre-existing adoption of the OpenCL opcodes and their *pre-existing* acceptance world-wide, through the prevalent world-wide use of the Vulkan API, we can effectively "bypass" the "usual" process of carrying out "quantitative analysis" because, quite simply, if that analysis does not conclude that the OpenCL opcodes be in, resulting in failure of Khronos Conformance tests (not RISC-V Conformance tests: *Khronos* Conformance tests), or compromising of power and performance, it doesn't matter what the academic papers say, the product *will* be a flat-out automatic failure.

thus, also, by the same logic, that conformance to the *Khronos* tests is a hard requirement, any "additions" such as POSITs, and any "nice to have" functions (CORDIC, triple-operand normalisation 1/(x^2+y^2+z^2) for example) are *also* completely pointless and have (as mentioned above), *already been excluded from the proposal*.

i will leave that with you for a few days to think about and evaluate, particularly as to how to go about participating in this process in an inclusive and "non-combatative" fashion where *all* of us bring our expertise to the table, and respect and appreciate *all* of the other contributors for doing so.

3D and OpenCL is a ridiculously challenging area that goes well beyond what an ordinary desktop, server or other "normal" CPU needs to do. working together and *listening* to what all participants have to say, and constructively valuing and including their input, is how we meet the principle on which RISC-V is supposed to be founded: learning from the lessons of past ISA development. *not* by one Company dominating and strangling the process, through its employees telling everyone how it's going to be.

l.

Jacob Lifshay

unread,

Aug 11, 2019, 1:35:05 PM8/11/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, Libre-RISCV General Development, Atif Zafar, Grant Jennings

Atif Zafar asked me to post the following observations:

A couple of observations reading the past set of emails:

Vulkan is the future of 3d graphics. It provides the lowest level access to graphics hardware of any API
Much of the functionality of other APIs - WebGL, OpenGL, OpenGL/ES, DirectX etc. can be built on top of Vulkan and there are efforts to this regard.
High level features such as graphics pipeline commands - i.e. single plane 3d clipping of triangles - need to be built in a higher level API

So I agree with you that we should look at SPIR-V and the Vulkan ISA seriously. Now that ISA is very complex and many of the instructions may possibly be reconstructed from simpler ones. We need to thus perhaps look at a "minimized" subset of the Vulkan ISA instructions that truly define the atomic operations from which the full ISA can be constructed. So the instruction decode hardware can implement this "higher-level ISA" - perhaps in microcode - from the "atomic ISA" at runtime while hardware support is only provided for the "atomic ISA".

From the SIGGRAPH BOF it was clear there are competing interests. Some people wanted explicit texture mapping instructions while others wanted HPC type threaded vector extensions. Although each of these can be accommodated we need to adjudicate the location in the process pipeline where they belong - atomic ISA, higher-level ISA or higher-level graphics library.

Jacob Lifshay

unread,

Aug 11, 2019, 1:36:41 PM8/11/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, Libre-RISCV General Development, Atif Zafar, Grant Jennings

On Sun, Aug 11, 2019, 10:34 Jacob Lifshay <program...@gmail.com> wrote:

Atif Zafar asked me to post the following observations:

I forgot to state that he wrote the following:

lkcl

unread,

Aug 11, 2019, 11:17:13 PM8/11/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

On Sunday, August 11, 2019 at 6:35:05 PM UTC+1, Jacob Lifshay wrote:
Atif Zafar asked me to post the following observations:

> A couple of observations reading the past set of emails:
>

> 1. Vulkan is the future of 3d graphics. It provides the lowest level access to graphics hardware of any API

and was originally just OpenCL. 3D support was added as an extension.

> 2. Much of the functionality of other APIs - WebGL, OpenGL, OpenGL/ES, DirectX etc. can be built on top of Vulkan and there are efforts to this regard.

there have been two libre/open projects spotted that provide "gateways" via Vulkan to OpenGL. one, interestingly, is a Vulkan-to-Gallium3D adapter, the other is a straight (direct) OpenGL-to-Vulkan adapter.

both are strategically important in that they significantly reduce the coding effort required to make any industry-standard open effort "immediately useful". later efforts can build on the gateways, optimising them to get better performance whilst still being able to sell product.

> 3. High level features such as graphics pipeline commands - i.e. single plane 3d clipping of triangles - need to be built in a higher level API

yes.

> So I agree with you that we should look at SPIR-V and the Vulkan ISA seriously.

> Now that ISA is very complex and many of the instructions may possibly be

> reconstructed from simpler ones.

yes. as the slides from SIGGRAPH2019 show, the number of opcodes needed is enormous. Texturisation, mipmaps, Z-Buffers, texture buffers, normalisation, dotproduct, LERP/SLERP, sizzle, alpha blending, interpolation, vectors and matrices, and that's just the *basics*!

further optimisations will need to be added over time, as well.

the pressure on the OP32 space is therefore enormous, given that "embedded" low-power requirements cannot be met by moving to the 48 and 64 bit opcode space.

> We need to thus perhaps look at a "minimized" subset of the Vulkan ISA

> instructions that truly define the atomic operations from which the full

> ISA can be constructed. So the instruction decode hardware can implement

> this "higher-level ISA" - perhaps in microcode - from the "atomic ISA"

> at runtime while hardware support is only provided for the "atomic ISA".

yes. a microcode engine is something that may be useful in other implementations as well (for other purposes) so should i feel be a separate proposal.

> From the SIGGRAPH BOF it was clear there are competing interests. Some people

> wanted explicit texture mapping instructions while others wanted HPC type

> threaded vector extensions.

interesting.

> Although each of these can be accommodated we need to adjudicate the location

> in the process pipeline where they belong - atomic ISA, higher-level ISA or

> higher-level graphics library.

OpenCL compliance is pretty straightforward to achieve. it could be done by any standard supercomputer Vector Compute Engine.

a good Vector ISA does **NOT** automatically make a successful GPU (cf: MIAOW, Nyuzi, Larrabee).

3D Graphics is ridiculously complex and comprehensive, and therefore requires careful step-by-step planning to meet the extremely demanding and heavily optimised de-facto industry-standard expectations met by modern GPUs, today (fixed-functions are out: shader engines are in).

we took the strategy with the Libre RISC-V SoC to do a Vulkan SPIR-V to LLVM-IR compiler for very good reasons.

firstly: that our first milestone - operational compliance on *x86* LLVM - removes the possibility of hardware implementation dependence or bugs, and gets us to a "known good" position to move to the next phase. like MesaGL, it also makes a good Reference Implementation.

secondly: to begin an *iterative* process of adding in hardware acceleration, associated SPIR-V opcode support and associated LLVM-IR compiler support for the same, one opcode or opcode group at a time.

by evaluating the performance increase at each phase, depending on Alliance Member customer requirements, we can move forward quickly and in a useful and quantitatively-measureable fashion, meeting (keeping) the Khronos Group's Conformance in mind at *every* step.

some of the required opcodes are going to be blindingly obvious (the transcendentals and trigonometrics), others are going to be both harder to implement, requiring significant research to track down, more than anything. Vulkan's "Texture" features are liberally sprinkled throughout the spec, for example, and the data structures used in the binary-formatted texture data need to be tracked down.

other opcodes will be critically and specifically dependent on the existence of Vector and in some cases Matrix support. Swizzle is particularly challenging as in its full form it requires a whopping 32 bits of immediate data, in order to cover a 3-arg operand if used with 4-long vectors.

l.

lkcl

unread,

Aug 11, 2019, 11:20:28 PM8/11/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

https://libre-riscv.org/ztrans_proposal/#khronos_equiv

(thank you to michael for the assistance, above).

as we had people say that it was "not obvious" what or why equivalence with Khronos OpenCL opcodes is being proposed, we have created a special section with an "equivalence" table between the proposed opcodes and the Khronos OpenCL opcodes.

this table makes it abundantly clear how obvious the need for these opcodes is, how much of a no-brainer decision it is, and why "quantitative analysis" - *for these opcodes and these opcodes alone* - is completely unnecessary.

other opcodes (SLERP, LERP, CORDIC, Vector Normalisation) are *NOT* obvious, and *WILL* require further research and careful analysis.

l.

Jacob Lifshay

unread,

Aug 12, 2019, 2:31:44 AM8/12/19

to lkcl, RISC-V ISA Dev, Mitchalsup, Libre-RISCV General Development, Atif Zafar, Grant Jennings

On Sun, Aug 11, 2019 at 8:17 PM lkcl <luke.l...@gmail.com> wrote:
> > So I agree with you that we should look at SPIR-V and the Vulkan ISA seriously.
> > Now that ISA is very complex and many of the instructions may possibly be
> > reconstructed from simpler ones.
>
> yes. as the slides from SIGGRAPH2019 show, the number of opcodes needed is enormous. Texturisation, mipmaps, Z-Buffers, texture buffers, normalisation, dotproduct, LERP/SLERP, sizzle, alpha blending, interpolation, vectors and matrices, and that's just the *basics*!

note: lerp is another term for linear interpolation

>
> further optimisations will need to be added over time, as well.
>
> the pressure on the OP32 space is therefore enormous, given that "embedded" low-power requirements cannot be met by moving to the 48 and 64 bit opcode space.
>
> > We need to thus perhaps look at a "minimized" subset of the Vulkan ISA
> > instructions that truly define the atomic operations from which the full
> > ISA can be constructed. So the instruction decode hardware can implement
> > this "higher-level ISA" - perhaps in microcode - from the "atomic ISA"
> > at runtime while hardware support is only provided for the "atomic ISA".
>
> yes. a microcode engine is something that may be useful in other implementations as well (for other purposes) so should i feel be a separate proposal.

I personally think that having LLVM inline the corresponding
implementations of the more complex operations is the better way to
go. For most Vulkan shaders, they are compiled at run-time, so LLVM
can do feature detection to determine which operations are implemented
in the hardware and which ones need a software implementation.

This reduces the pressure on the opcode space by a lot.

> > From the SIGGRAPH BOF it was clear there are competing interests. Some people
> > wanted explicit texture mapping instructions while others wanted HPC type
> > threaded vector extensions.
>
> interesting.

Having texture mapping instructions in HW is a really good idea for
traditional 3D, since Vulkan allows the texture mode to be dynamically
selected, trying to implement it in software would require having the
texture operations use dynamic dispatch with maybe static checks for
recently used modes to reduce the latency. See VkSampler's docs.

> > Although each of these can be accommodated we need to adjudicate the location
> > in the process pipeline where they belong - atomic ISA, higher-level ISA or
> > higher-level graphics library.
>
> OpenCL compliance is pretty straightforward to achieve. it could be done by any standard supercomputer Vector Compute Engine.
>
> a good Vector ISA does **NOT** automatically make a successful GPU (cf: MIAOW, Nyuzi, Larrabee).
>
> 3D Graphics is ridiculously complex and comprehensive, and therefore requires careful step-by-step planning to meet the extremely demanding and heavily optimised de-facto industry-standard expectations met by modern GPUs, today (fixed-functions are out: shader engines are in).

Fixed function operations in the form of custom opcodes or separate
accelerators are still used for several operations that are rather
slow to do with standard Vector or SIMD instructions: Triangle
Rasterization (get list of pixels/fragments/samples in a triangle --
one of the slower parts for SW to implement because of all the finicky
special-cases) and some ray-tracing operations (libre-riscv probably
won't implement that and just rely on SW if we implement ray-tracing
extensions at all for our first design)

> we took the strategy with the Libre RISC-V SoC to do a Vulkan SPIR-V to LLVM-IR compiler for very good reasons.
>
> firstly: that our first milestone - operational compliance on *x86* LLVM - removes the possibility of hardware implementation dependence or bugs, and gets us to a "known good" position to move to the next phase. like MesaGL, it also makes a good Reference Implementation.
>
> secondly: to begin an *iterative* process of adding in hardware acceleration, associated SPIR-V opcode support and associated LLVM-IR compiler support for the same, one opcode or opcode group at a time.
>
> by evaluating the performance increase at each phase, depending on Alliance Member customer requirements, we can move forward quickly and in a useful and quantitatively-measureable fashion, meeting (keeping) the Khronos Group's Conformance in mind at *every* step.
>
> some of the required opcodes are going to be blindingly obvious (the transcendentals and trigonometrics), others are going to be both harder to implement, requiring significant research to track down, more than anything. Vulkan's "Texture" features are liberally sprinkled throughout the spec, for example, and the data structures used in the binary-formatted texture data need to be tracked down.

data formats: https://www.khronos.org/registry/DataFormat/specs/1.1/dataformat.1.1.pdf

> other opcodes will be critically and specifically dependent on the existence of Vector and in some cases Matrix support. Swizzle is particularly challenging as in its full form it requires a whopping 32 bits of immediate data, in order to cover a 3-arg operand if used with 4-long vectors.

There are 340 (4^4 + 4^3 + 4^2 + 4^1) possible swizzle operations from
a 4-element vector to 1, 2, 3, or 4-element vectors. That should
easily fit in 10 bits with a straight-forward encoding. If we want to
swizzle in 4 possible constants (or a second input vector) as well as
the 4 elements of the input vector, there are 4680 possible (8^4 + 8^3
+ 8^2 + 8^1) swizzle operations, which I can fit in 14bits of
immediate with a not particularly dense encoding.

Encoding for 4 input elements and 4 constants (or second input vector):
2 bits: element-count of output vector
3 bits: output element 0
3 bits: output element 1
3 bits: output element 2
3 bits: output element 3

Encoding for only 4 input elements:
2 bits: element-count of output vector
2 bits: output element 0
2 bits: output element 1
2 bits: output element 2
2 bits: output element 3

So, you overestimated the number of immediate bits needed by quite a lot.

Jacob

lkcl

unread,

Aug 12, 2019, 6:26:24 AM8/12/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

On Monday, August 12, 2019 at 2:31:44 PM UTC+8, Jacob Lifshay wrote:
.

>
> I personally think that having LLVM inline the corresponding
> implementations of the more complex operations is the better way to
> go. For most Vulkan shaders, they are compiled at run-time, so LLVM
> can do feature detection to determine which operations are implemented
> in the hardware and which ones need a software implementation.

Excellent that saves on develipment effort.

> This reduces the pressure on the opcode space by a lot.
>
> > > From the SIGGRAPH BOF it was clear there are competing interests. Some people
> > > wanted explicit texture mapping instructions while others wanted HPC type
> > > threaded vector extensions.
> >
> > interesting.
>
> Having texture mapping instructions in HW is a really good idea for
> traditional 3D, since Vulkan allows the texture mode to be dynamically
> selected, trying to implement it in software would require having the
> texture operations use dynamic dispatch with maybe static checks for
> recently used modes to reduce the latency. See VkSampler's docs.

The one opcode that Mitch pointed out would be useful is the texture LD and interpolate. A 2D reference into a texture map plus an x scale and y scale.

> So, you overestimated the number of immediate bits needed by quite a lot.

I assumed a FMAC as worst case plus 2 bits per offset XYZW to simplify decode. 4 ops each a 4 vector @ 8 bit to swizzle all 4 vector elements is 32.

Either way it is mad extra immed bits meaning opcodes must be 48, 64 or even greater.

We can solve that wirh swizzle MV
L.

> Jacob

MitchAlsup

unread,

Aug 12, 2019, 1:12:35 PM8/12/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

When I did a 3D GPU, There was a HW unit each to perform::

a) vertex to thread assignment

b) rasterize primitive

c) interpolate rasterized point

d) Texture load

and some higher layer HW that was used to time the various activities through the 100,000 clock pipeline.

If you include Tessellation and Geometry, both of whom can generate a volcano of new primitives, There

are significant performance gains (more than 2×) to be had by doing the above in HW function units.

MitchAlsup

unread,

Aug 12, 2019, 1:52:16 PM8/12/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Sunday, August 11, 2019 at 10:20:28 PM UTC-5, lkcl wrote:

https://libre-riscv.org/ztrans_proposal/#khronos_equiv

I would like to point out that the general implementations of ATAN2 do a bunch of special case checks and then simply call ATAN.

double ATAN2( double y, double x )

{ // IEEE 754-2008 quality ATAN2

// deal with NANs

if( ISNAN( x ) ) return x;

if( ISNAN( y ) ) return y;

// deal with infinities

if( x == +∞ && |y|== +∞ ) return copysign( π/4, y );

if( x == +∞ ) return copysign( 0.0, y );

if( x == -∞ && |y|== +∞ ) return copysign( 3π/4, y );

if( x == -∞ ) return copysign( π, y );

if( |y|== +∞ ) return copysign( π/2, y );

// deal with signed zeros

if( x == 0.0 && y != 0.0 ) return copysign( π/2, y );

if( x >=+0.0 && y == 0.0 ) return copysign( 0.0, y );

if( x <=-0.0 && y == 0.0 ) return copysign( π, y );

// calculate ATAN2 textbook style

if( x > 0.0 ) return ATAN( |y / x| );

if( x < 0.0 ) return π - ATAN( |y / x| );

}

Yet the proposed encoding makes ATAN2 the primitive and has ATAN invent a constant and then call/use ATAN2.

When one considers an implementation of ATAN, one must consider several ranges of evaluation::

x Î [ -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );

x Î (-1.0, +1.0]:: ATAN( x ) = + ATAN( x );

x Î [ 1.0, +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );

I should point out that the add/sub of π/2 can not lose significance since the result of ATAN(1/x) is bounded 0..π/2

The bottom line is that I think you are choosing to make too many of these into OpCodes, making the hardware

function/calculation unit (and sequencer) more complicated that necessary.

----------------------------------------------------------------------------------------------------------------------------------------------------

I might suggest that if there were a way for a calculation to be performed and the result of that calculation

chained to a subsequent calculation such that the precision of the result-becomes-operand is wider than

what will fit in a register, then you can dramatically reduce the count of instructions in this category while retaining

acceptable accuracy:

z = x / y

can be calculated as::

z = x × (1/y)

Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008 accurate, but GPUs want speed and

1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It is also not "that inaccurate" displaying

0.625-to-0.52 ULP.

Given that one has the ability to carry (and process) more fraction bits, one can then do high precision

multiplies of π or other transcendental radixes.

And GPUs have been doing this almost since the dawn of 3D.

lkcl

unread,

Aug 12, 2019, 8:11:21 PM8/12/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Tuesday, August 13, 2019 at 1:52:16 AM UTC+8, MitchAlsup wrote:
> On Sunday, August 11, 2019 at 10:20:28 PM UTC-5, lkcl wrote:
> https://libre-riscv.org/ztrans_proposal/#khronos_equiv
>
>
> I would like to point out that the general implementations of ATAN2 do a bunch of special case checks and then simply call ATAN.

Appreciated. I recorded these insights on the page (to move offpage, to discussion, at a later point).

> The bottom line is that I think you are choosing to make too many of these into OpCodes, making the hardware
> function/calculation unit (and sequencer) more complicated that necessary.

We do have to be careful to ensure that multiple disparate Platform implementors are happy, and that tends to suggest that the extension remains close to a RISCV ISA paradigm.

> ----------------------------------------------------------------------------------------------------------------------------------------------------
> I might suggest that if there were a way for a calculation to be performed and the result of that calculation
> chained to a subsequent calculation such that the precision of the result-becomes-operand is wider than
> what will fit in a register, then you can dramatically reduce the count of instructions in this category while retaining
> acceptable accuracy:
>
>
> z = x / y
> can be calculated as::
> z = x × (1/y)
>
>
> Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008 accurate, but GPUs want speed and
> 1/y is fully pipelined (F32) while x/y cannot be (at reasonable area).

Sigh yehhh this is... ok let me put it this way. If we were doing a from scratch dedicated GPU ISA (along the lines of proprietary GPUs, with associated software RPC / IPC Marshalling system between the completely disparate ISAs) I would in absolutely no way start from a RISC-V base.

That's not because the RISCV Foundation is a pain to deal with, it's *technical* reasons, namely that it is a retrofit into an ISA that was designed for a completely different market than 3D.

> Given that one has the ability to carry (and process) more fraction bits, one can then do high precision
> multiplies of π or other transcendental radixes.
>
>
> And GPUs have been doing this almost since the dawn of 3D.

Appreciated. Background, first. Can skip if short of time

---

Basically what you are recommending is a microcode ISA. This is something that is on the table as an option (an idea floated by Atif from Pixilica), and one that we are sort-of looking to put into the hardware of the Libre RISCV ALUs, by having a long "opcode" that activates *parts* of the pipeline (pre and post FP normalisation and special cases) so that it can be share between INT and FP.

Also, 64 bit will be performed by "recycling" intermediary results back through the pipeline, again under the control of that microcode-like long "opcode". It's a FSM with automatic operand forwarding in other words.

What you describe - the special cases that turn ATAN2 into ATAN - could be performed conveniently within the "recycling" paradigm by carrying out the special cases as one "cycle", the DIV as another (or the mul and the 1/x as two) and finally the FSM hands the intermediate over to ATAN.

The nice thing about this microarchitecture is that the intermediate data can be of any width, as well as contain any number of intermediate operands.

My feeling is - and this is not ruling out the possibility - that microcode ops, exposed to the actual ISA level - would not only need a lot of thought, they'd need special attention to be paid to the register file (no longer 32 bits, it would be 36 or some other arbitrary width sufficient to store the intermediary results, efficiently), and more, as well.

Complicated, and also concern at deviating from RISCV's ISA, significantly. Maybe even *increasing* the number of opcodes, due to fragmentation of specialist micro operations (such as ATAN2 specialcases).

If those specialcases were done as RISCV operations, that's a *lot* of instructions to trade off against simply having ATAN2.

Overall then I think what I am talking myself into is support for the pseudo-microcode-like FSM engine within our design, with associated "feedback" back to the beginning of the pipeline(s). It is not a full blown microcode design, yet has a similar effect, just without needing to expose microcode details to the actual ISA.

Other implementors may choose to do things differently, particularly those that stick to the UNIX Platform Accuracy profile.

So that is background.

---

We therefore I think have a case for bringing back ATAN and including ATAN2.

The reason is that whilst a microcode-like GPU-centric platform would do ATAN2 in terms of ATAN, a UNIX-centric platform would do it the other way round.

(that is the hypothesis, to be evaluated for correctness. feedback requested).

Thie because we cannot compromise or prioritise one platfrom's speed/accuracy over another. That is not reasonable or desirable, to penalise one implementor over another.

Thus, all implementors, to keep interoperability, must both have both opcodes and may choose, at the architectural and routing level, which one to implement in terms of the other.

Allowing implementors to choose to add either opcode and let traps sort it out leaves an uncertainty in the software developer's mind: they cannot trust the hardware, available from many vendors, to be performant right across the board.

Standards are a pig.

L.

MitchAlsup

unread,

Aug 12, 2019, 9:05:23 PM8/12/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

The alternative is to designate a few OpCodes in a sequence as a single result producer, with the intermediate result kept larger than register width and fed back to the in-sequent instruction (preserving accuracy.)

Jacob Lifshay

unread,

Aug 12, 2019, 11:00:59 PM8/12/19

to MitchAlsup, RISC-V ISA Dev, Luke Kenneth Casson Leighton, Libre-RISCV General Development

Notice how the ATAN implementation has a division before the
polynomial (or whatever implements the function in (-1.0, +1.0] ),
that division can be merged with the ATAN2 division to end up with
ATAN2 taking the same time as ATAN (assuming all the special cases can
be handled in parallel to the division). For the case of ATAN2(1.0, x)
with -1.0 < x <= +1.0, the division can be skipped if the pipeline is
designed to be able to forward operations to a later than normal
stage. So, in the case that reciprocal is the same speed as division
(which it is in our current division pipeline) ATAN2 is not more
expensive than ATAN except for the special cases which don't take up
that many gates to decode them.

Jacob Lifshay

lkcl

unread,

Aug 13, 2019, 12:32:56 AM8/13/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Tuesday, August 13, 2019 at 2:05:23 AM UTC+1, MitchAlsup wrote:

The alternative is to designate a few OpCodes in a sequence as a single result producer, with the intermediate result kept larger than register width and fed back to the in-sequent instruction (preserving accuracy.)

yehyeh, i totally get it / like it, as a design concept. the implications however are for context-switches (which you can "fudge" in a dedicated GPU as it's not going to be dealing with other general-purpose tasks, but you can't really do in a Hybrid CPU/GPU), the intermediate registers would need saving/restoring.

that in turn puts pressure on the register file size, which is already big for a GPU. or requires a "special" register file, other than FP, INT (and in some cases V as well). which in turn requires _more_ opcodes to... yeh, you get the idea :)

which is why i said that starting from RISC-V, for GPU purposes, isn't necessarily the smartest thing to do / be constrained by.

ho hum :)

l.

lkcl

unread,

Aug 13, 2019, 12:45:04 AM8/13/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

i did a double-take at that one. i know you didn't sneeze whilst holding down the "0" key because there's a comma in the middle :)

some context: we discussed this on-list back in june:

http://bugs.libre-riscv.org/show_bug.cgi?id=91#c1

If you include Tessellation and Geometry, both of whom can generate a volcano of new primitives, There
are significant performance gains (more than 2×) to be had by doing the above in HW function units.

hmmm.... definitely worth it.

Jacob Lifshay

unread,

Aug 13, 2019, 12:58:34 AM8/13/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, Libre-RISCV General Development, Atif Zafar, Grant Jennings

On Mon, Aug 12, 2019, 03:26 lkcl <luke.l...@gmail.com> wrote:
>
> On Monday, August 12, 2019 at 2:31:44 PM UTC+8, Jacob Lifshay wrote:
> > So, you overestimated the number of immediate bits needed by quite a lot.

I was assuming a separate swizzle opcode instead of swizzle on every op. if we want to combine them, we can implement that using macro-op fusion.

Jacob

lkcl

unread,

Aug 13, 2019, 2:46:22 AM8/13/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

found the (prelimilary, draft) page with some ideas under consideration:

https://libre-riscv.org/simple_v_extension/specification/mv.x/

i like the macro-op fusion concept.

the other alternative: VBLOCK context. [for readers not familiiar with VBLOCK, the draft spec is here: https://libre-riscv.org/simple_v_extension/vblock_format/]

concept: mark a register as "if you use that register, within this current VBLOCK, its use indicates a desire to change the behavior of the current instruction".

register tagging, in other words.

the only thing is, fitting a third context (beyond the vector table format and the predicate table format) will need a redesign, or to go into the 192+ bit RV opcode format.

l.

lkcl

unread,

Aug 13, 2019, 3:43:44 AM8/13/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

On Monday, August 12, 2019 at 7:31:44 AM UTC+1, Jacob Lifshay wrote:

data formats: https://www.khronos.org/registry/DataFormat/specs/1.1/dataformat.1.1.pdf

at home, today, able to download that document. *shocked* - it's 154 frickin pages!

chapters 5-7 cover the format itself.

section 10.1 covers 16-bit FP numbers

section 10.2 covers 11-bit unsigned FP numbers

section 10.3 covers unsigned 10-bit FP numbers

section 10.4 covers a *custom* FP number format

five chapters are derived from OpenGL image formats

chapter 18 covers a "scalable" texture format.

holy cow. and people implement this in hardware?? *manic scared laughter*

:)

Jacob Lifshay

unread,

Aug 13, 2019, 3:56:25 AM8/13/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

most of the defined buffer/texture formats are optional, the only required float formats are f16, f32, f64, and to be able to decode VK_FORMAT_E5B9G9R9_UFLOAT_PACK32 and VK_FORMAT_B10G11R11_UFLOAT_PACK32

Jacob

lkcl

unread,

Aug 13, 2019, 5:30:30 AM8/13/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org, at...@pixilica.com, gnarly...@gmail.com

On Tuesday, August 13, 2019 at 7:46:22 AM UTC+1, lkcl wrote:

the only thing is, fitting a third context (beyond the vector table format and the predicate table format) will need a redesign, or to go into the 192+ bit RV opcode format.

https://libre-riscv.org/simple_v_extension/vblock_format/#swizzle_format

extended the VBLOCK format (to VBLOCK2). it's the 192+ bit RISC-V ISA format.

(for people not familiar with the VBLOCK format) note that it is *not* necessary to load the entire block format into internal instruction buffers: the PCVBLK CSR is designed to allow "restarting", following a context-switch, so that a (minimal) implementation may go back and re-read the VBLOCK context and then "restart" instructions in the middle of a block.

this type of scheme was, interestingly, discussed a couple of months back, on isa-dev, independently.

the new VBLOCK2 extended format allows for specifying up to 8 registers to be "swizzled", and also extends the number of Register and Predication blocks that may be added as context, as well.

i think this should be *in addition* to having an *actual* MV.swizzle instruction, because there will be circumstances where the overhead of a VBLOCK context is greater than that of a single OP32.

note that due to the way that SV works, many-to-one "retargetting" is permitted. i.e. any given register number in rs1 or rs2 or rd may be requested to "redirect" to the EXACT SAME REGISTER, just with different "context", such as element width, predicate or not-predicate, predicate-with-zeroing, predicate-without-zeroing, and now "swizzle".

it's both powerful and confusing at the same time.

l.

MitchAlsup

unread,

Aug 13, 2019, 11:40:39 AM8/13/19

to RISC-V ISA Dev, Mitch...@aol.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

My actual implementation folds ATAN into ATAN2 and changes the last 2 lines into::

// calculate ATAN2 high performance style

// Note: at this point x != y

//

if( x > 0.0 )

{

if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );

if( y < 0.0 && |y| > |x| ) return + ATAN( y / x );

if( y > 0.0 && |y| < |x| ) return + ATAN( y / x );

if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );

}

if( x < 0.0 )

{

if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );

if( y < 0.0 && |y| > |x| ) return + π - ATAN( y / x );

if( y > 0.0 && |y| < |x| ) return + π - ATAN( y / x );

if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );

}

This way the adds and subtracts from the constant are not in a precision precarious position.

Jacob Lifshay

lkcl

unread,

Aug 13, 2019, 5:05:30 PM8/13/19

to RISC-V ISA Dev, Mitch...@aol.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Tuesday, August 13, 2019 at 11:40:39 PM UTC+8, MitchAlsup wrote:

> My actual implementation folds ATAN into ATAN2 and changes the last 2 lines into::

Thx Mitch have made a note so it's not lost in list noise.

Also appreciate the microcode suggestions, have to give that some serious thought, whether to do something truly microcode-like or whether to do just a mini SRAM that contains subroutines that reprogrammable RISCV opcodes can use.

Much lower level may prove more useful, micro code operations to do FP prenormalisation, post normalisation etc. all at expanded bitwidths.

L.

Dan Petrisko

unread,

Aug 13, 2019, 5:33:23 PM8/13/19

to lkcl, RISC-V ISA Dev, MitchAlsup, Libre-RISCV General Development

Also appreciate the microcode suggestions, have to give that some serious thought, whether to do something truly microcode-like or whether to do just a mini SRAM that contains subroutines that reprogrammable RISCV opcodes can use.

I would advise being cautiously analytical about this approach. SRAMs scale much more poorly than logic in advanced tech nodes, especially low-capacity SRAMs. It may be more area efficient to just implement the hardware rather than microcode (not to mention performant).

Of course, programmability is beneficial for post-tapeout enhancements/bugfixes (which also requires a loading mechanism, which is a non-trivial addition if it's the only ucode in the design).

Best,

Dan Petrisko

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/b7b9644c-88a0-483b-a3d9-e8476d364985%40groups.riscv.org.

MitchAlsup

unread,

Aug 13, 2019, 5:38:21 PM8/13/19

to RISC-V ISA Dev, Mitch...@aol.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

The "KIND" of microcode I suggest is the kind where the Calculation Unit is microcoded and the data path it sits on remains <whatever it was>.

Microcoding the main F/D/E part of the pipeline is strongly discouraged.

Think of microcode as if it were a sequencer, just express that sequencer in such a way that the synthesizer can implement it with a table, with just gates, or with some kind of ROM. Then you don't have to decide if it is microcoded (or not) the synthesizer carries the load.

lkcl

unread,

Aug 13, 2019, 7:02:16 PM8/13/19

to RISC-V ISA Dev, Mitch...@aol.com, luke.l...@gmail.com, libre-r...@lists.libre-riscv.org

On Tuesday, August 13, 2019 at 10:38:21 PM UTC+1, MitchAlsup wrote:

Think of microcode as if it were a sequencer, just express that sequencer in such a way that the synthesizer can implement it with a table, with just gates, or with some kind of ROM. Then you don't have to decide if it is microcoded (or not) the synthesizer carries the load.

[with thanks to Dan as well, for the advice to avoid SRAMs] - part of the issue i am having difficulty with here on these lists is that i don't have the same kind of formal training as other participants, yet am capable of deriving solutions "as needed" - some of them entirely new, some extremely well-known for decades and "obvious" to anyone with formal training.

i completely lack the "technical terms" to communicate both the latter *and* the former in an "immediate" fashion that satisfies busy "formally trained" technical people.

so yes, after about 4-5 round-trip messages over several days, we can finally identify that the scheme that Jacob and I came up with - a combination of an opcode field, an FSM that advances that opcode field, and a "feedback" loop that puts intermediary results *back* into the pipeline - is in fact called a "gate-driven microcode engine" in modern computing terminology.

gah, finally! so, yes, going back several messages: yes, absolutely love the microcode idea, Mitch :)

l.

lkcl

unread,

Aug 14, 2019, 9:56:03 AM8/14/19

to RISC-V ISA Dev, luke.l...@gmail.com, Mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Monday, August 12, 2019 at 6:52:16 PM UTC+1, MitchAlsup wrote:

On Sunday, August 11, 2019 at 10:20:28 PM UTC-5, lkcl wrote:
https://libre-riscv.org/ztrans_proposal/#khronos_equiv

I would like to point out that the general implementations of ATAN2 do a bunch of special case checks and then simply call ATAN.

if there's no objections, i'll put ATAN back in [where it was formerly an alias ATAN(x)=ATAN2(1.0, x)]. that will require implementors to microcode the algorithm that Mitch described, and allow them to keep both 3D and UNIX Profiles optimal without compromise of either profile.

l.

Jacob Lifshay

unread,

Aug 14, 2019, 3:44:16 PM8/14/19

to Luke Kenneth Casson Leighton, RISC-V ISA Dev, Mitchalsup, libre-r...@lists.libre-riscv.org

if atan is going back in, it would be a good idea to include atanpi.

We may want to put frecip back in as well, for those who implement division as multiply by reciprocal.

Jacob

lkcl

unread,

Aug 14, 2019, 5:00:29 PM8/14/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Thursday, August 15, 2019 at 3:44:16 AM UTC+8, Jacob Lifshay wrote:

> if atan is going back in, it would be a good idea to include atanpi.
>

Done.

> We may want to put frecip back in as well, for those who implement division as multiply by reciprocal.

Khronos OpenCL spec has native_recip and half_recip. On the basis that both Mitch and the Khronos Group probably know what they're doing, I'd agree.

All these would fall under the Zfacc FCSR field. Must post that, new thread.

L.

lkcl

unread,

Sep 8, 2019, 10:56:00 AM9/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

https://gitlab.freedesktop.org/panfrost/mali-isa-docs/blob/master/Midgard.md

Transcendental opcodes in ARM's MALI ISA include

E8 - fatan_pt2
F0 - frcp (reciprocal)
F2 - frsqrt (inverse square root, 1/sqrt(x))
F3 - fsqrt (square root)
F4 - fexp2 (2^x)
F5 - flog2
F6 - fsin
F7 - fcos
F9 - fatan_pt1

pt1 stands for "pi times 1" and pt2 should br obvious.

NVIDIA CUDA transcendentals

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mathematical-functions-appendix

Further internal documentation is hard to find.

AMD VEGA

http://developer.amd.com/wordpress/media/2017/08/Vega_Shader_ISA_28July2017.pdf

sin, cos, exp, log, rcp, rsqrt, all in FP16/32/64.

Intesestingly no tan, atan, arc or hypot.

If these are standard opcodes in commercial GPUs, requesting specific and individual quantitative analysis is puzzling in the extreme. Their inclusion is so obviously critical for commercial success in the field of 3D and HPC that it is the equivalent of asking for quantitative analysis of "integer add" or "mul" for a DSP.

L.

Allen Baum

unread,

Sep 8, 2019, 1:56:09 PM9/8/19

to lkcl, RISC-V ISA Dev, mitch...@aol.com, libre-r...@lists.libre-riscv.org

It doesn’t matter whether someone else implements them or not. What matters is their cost (area, power, design time, validation), and their benefit (primarily their effect on performance. )

E.g. tan can be replaced by sin/cos, which is cheap (if you already have sin and cos) and possibly only a little slower. The benefit may be totally trivial, and the cost ( delay in getting the product out, extra architects, validated, implementors) May be substantial.

And that’s why you do quantitative analysis.
You are basically assuming that someone has effectively handed you that analysis- and that is not necessarily the case ( unless perhaps they are also handing you the design database as well for exactly the process you are designing for and you have exactly the same design constraints).

YMMV.

-Allen

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/a84259e2-1acf-4ae5-ab80-6c332c77a6e6%40groups.riscv.org.

lkcl

unread,

Sep 8, 2019, 7:45:45 PM9/8/19

to RISC-V ISA Dev, luke.l...@gmail.com, mitch...@aol.com, libre-r...@lists.libre-riscv.org

On Monday, September 9, 2019 at 1:56:09 AM UTC+8, Allen Baum wrote:
> It doesn’t matter whether someone else implements them or not. What matters is their cost (area, power, design time, validation), and their benefit (primarily their effect on performance. )

Indeed. I do appreciate that. I did a query on academia for DIV units a couple months back and am now being bombarded every few days with different implementations, each with their own unique take on optimisations.

>
> E.g. tan can be replaced by sin/cos, which is cheap (if you already have sin and cos) and possibly only a little slower. The benefit may be totally trivial, and the cost ( delay in getting the product out, extra architects, validated, implementors) May be substantial.

Yes. You can see from the differences in those commercial GPUs, they each made a different call. That's a clue.

Here's the thing: the requirements in each market are so radically different that it is impossible to call it one way or the other.

For example, the basis of the Pixilica SIGGRAPH2019 BoF was that commercial GPUs are focussing so heavily on mass volume appeal (profit being the driver) that specialist long tail markets are ostracised, interestingly leaving a substantial business opportunity for something "different".

The meetup a few days ago, several very experienced engineers endorsed this alternative approach.

Think Silicon specialise in ultra low power GPUs, typically for example in smartwatches, where idle power has to be in microwatts, and GPU usage measured in milliwatts. Accuracy here is nearly irrelevant.

Then we also have HPC where power consumption is less a priority, and accuracy is paramount.

Performing any kind of quantitative analysis to cover all of these markets is not just pointless it should be blindingly obvious that it is dangerously misleading.

Whatever analysis claims that in *one* market for *one* stakeholder is The Best Way Forward, automatically and inherently penalises others.

Not to mention, as Mitch's patent shows, algorithms can in fact be developed that cover a huge range of transcendentals with the same hardware.

The problem being: that very same hardware, if targetted specially at the less-accurate GPU market is no longer suited to IEE754 high accuracy maktets.

The best that quantitative analysis can be used for is to work out a subdivision of the likelihood of any given subset of transcendentals is suited to a common set of markets.

And even there with the continuing advancements in RTL and algorithms and sheer overwhelming diversity of existing ones, it is still both a massive task and near-pointless at the same time.

In any other area of computer science such as BitManip, where there is huge diversity and a massive range of algorithms, absolutely it makes sense.

Transcendentals are used for 3D and numerical computation, and... errr... that's the lot.

The requirements are to allow these multiple disparate vendors from four possible market areas (platforms: embedded, 3d embedded, UNIX, 3D UNIX) with radically different focii to save money and development time through collaboration and shared APIs in both hardware and software.

Ultimately, with a bit of logical deductive reasoning based on the requirements, we can shortcut the entire process and conclude that the only way to satisfy the requirements is to include pretty much everything, and break it down into subsets in a similar fashion to BitManip.

I can only apologise that I am extremely fast at being able to make such time-saving assessments, and not being very good at explaining how they were arrived at. Which I appreciate does not help when it comes to a formal standard, where everything has to be fully justified in some way.

When someone actually responds with the requirements that I asked for over a month ago on what the official criteria are for standards, this will go a lot more smoothly.

L.