As part of developing a Libre GPU that is intended for 3D, specialist Compute and Machine Learning, standard operations used in OpenCL are pretty much mandatory [1].
As they will end up in common public usage - upstream compilers with high volumes of downloads - it does not make sense for these opcodes to be relegated to "custom" status ["custom" status is suitable only for embedded proprietary usage that will never see the public light of day].
Also, they are not being proposed as part of RVV for the simple reason that as "scalar" opcodes, they can be used with *scalar* designs. It makes more sense that they be deployable in "embedded" designs (that may not have room for RVV, particularly as CORDIC seems to cover the vast majority of trigonometric algorithms and more [2]), or in NUMA parallel designs, where a cluster of shaders makes up for a lack of "vectorisation".
In addition: as scalar opcodes, they fit into the (huge, sparsely populated) FP opcode brownfield, whereas the RVV major opcode is much more under pressure.
The list of opcodes is at an early stage, and participation in its development is open and welcome to anyone involved in 3D and OpenCL Compute applications.
Context, research, links and discussion are being tracked on the libre riscv bugtracker [3].
L.
[1] https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html
[2] http://www.andraka.com/files/crdcsrvy.pdf
[3] http://bugs.libre-riscv.org/show_bug.cgi?id=127
Is this proposal going to <eventually> include::a) statement on required/delivered numeric accuracy per transcendental ?
b) a reserve on the OpCode space for the double precision equivalents ?
c) a statement on <approximate> execution time ?
You may have more transcendentals than necessary::1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals..... ASINH( x ) = ln( x + SQRT(x**2+1) )
2) LOG(x) = LOGP1(x) + 1.0... EXP(x) = EXPM1(x-1.0)That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.
Is this proposal going to <eventually> include::
a) statement on required/delivered numeric accuracy per transcendental ?
b) a reserve on the OpCode space for the double precision equivalents ?
c) a statement on <approximate> execution time ?
You may have more transcendentals than necessary::1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals..... ASINH( x ) = ln( x + SQRT(x**2+1) )
2) LOG(x) = LOGP1(x) + 1.0... EXP(x) = EXPM1(x-1.0)That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:Is this proposal going to <eventually> include::a) statement on required/delivered numeric accuracy per transcendental ?From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.
b) a reserve on the OpCode space for the double precision equivalents ?the 2 bits right below the funct5 field select from:00: f3201: f6410: f1611: f128so f64 is definitely included.see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratifiedit would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.c) a statement on <approximate> execution time ?that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).
https://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm Since: ln(a) = 2Tanh-1( (a-1) / (a+1) The function ln(α) is obtained by multiplying by 2 the final result
On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:Is this proposal going to <eventually> include::a) statement on required/delivered numeric accuracy per transcendental ?From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.Correctly rounded results will require a lot more difficult hardware and more cycles of execution.Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.
I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19
An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.
My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.I have no feeling at all on where to draw the line.
this tends to suggest that three platform specs are needed:* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/43b3c671-7e13-4229-838e-71c7e293941b%40groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/0a8e035c-0996-44ba-af8d-d19be84575f5%40groups.riscv.org.
Hello,My $0.02 of contributionRegarding the comment of 3 platforms:
> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)No, IEEE, ARM is an embedded platform and they implement IEEE in all of them> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)Standard IEEE, simple no 'new' on this sector.> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.
> > * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
> No, IEEE, ARM is an embedded platform and they implement IEEE in all of them
I can see the sense in that one. I just thought that some 3D implementors, particularly say in specialist markets, would want the choice.
Hmmmm, perhaps a 3D Embedded spec, separate from "just" Embedded.
> > * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
> Standard IEEE, simple no 'new' on this sector.
Makes sense. Cannot risk noninteroperability, even if it means a higher gate count or larger latency.
> > * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
>
>
>
>
> No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.
Ok, this is where that's not going to fly. As Mitch mentioned, full IEEE754 compliance would result in doubling the gate count and/or decreasing latency through longer pipelines.
In speaking with Jeff Bush from Nyuzi I learned that a GPU is insanely dominated by its FP ALU gate count: well over 50% of the entire chip.
If you double the gate count due to imposition of unnecessary accuracy (unnecessary because due to 3D Standards compliance all the shader algorithms are *designed* to lower accuracy requirements), the proposal will be flat-out rejected by adopters because products based around it will come with a whopping 100% power-performance penalty compared to industry standard alternatives.
So this is why I floated (ha ha) the idea of a new Platform Spec, to give implementors the space to meet industry-driven requirements and remain competitive.
Ironically our implementation will need to meet UNIX requirements, it is one of the quirks / penalties of a hybrid design.
L.
Hi folks,We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.
On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:Hi folks,We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.
I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.
Jacob Lifshay
On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:
> It's not obvious that the proposed opcodes justify standardization.
It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"
Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.
Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.
There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.
It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.
It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.
The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.
Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.
Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.
If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.
Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.
SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".
For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.
I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).
L.
Just to emphasise, Luis, Andrew: "on their own" is precisely what each of the proprietary 3D GPU Vendors have done, and made literally billions of dollars by doing so.
Saying "we are on our own" and expecting that to mean that not conforming to IEEE754 would kill the proposal, this is false logic.
MALI (ARM), Vivante, the hated PowerVR, NVidia, AMD/ATI, Samsung's new GPU (with Mitch's work in it), and many more, they *all* went "their own way", hid the hardware behind a proprietary library, and *still made billions of dollars*.
This should tell you what you need to know, namely that a new 3D GPU Platform Spec which has specialist FP accuracy requirements to meet the specific needs of this *multi BILLION dollar market* is essential to the proposal's successful industry adoption.
If we restrict it to UNIX (IEEE754) it's dead.
If we restrict it to *not* require IEEE754, it's dead.
The way to meet all the different industry needs: new Platform Specs.
That does not affect the actual opcodes. They remain the same, no matter the Platform accuracy requirements.
Thus the software libraries and compilers all remain the same, as well.
L.
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.
It does not make sense to allocate opcode space under these circumstances.
On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
It does not make sense to allocate opcode space under these circumstances.[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]l.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6a391409-31c8-498b-8209-ed4fef5ba4a6%40groups.riscv.org.
[mostly OT for the thread, but still relevant]
On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:
> It's not obvious that the proposed opcodes justify standardization.
It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"
Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.
Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.
There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.
It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.
It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.
The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.
Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.
Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.
If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.
Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.
SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".
For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.
I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/666729c1-c7c9-47a4-af20-0cc78f6cea99%40groups.riscv.org.
On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:[mostly OT for the thread, but still relevant]
On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:
> It's not obvious that the proposed opcodes justify standardization.
It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"Oh, man. This is great. "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?
ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?
thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so. HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.something like that?
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:- machine-learning-mode: fast as possible-- maybe need additional requirements such as monotonicity for atanh?- GPU-mode: accurate to within a few ULP-- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines- almost-accurate-mode: accurate to <1 ULP(would 0.51 or some other value be better?)- fully-accurate-mode: correctly rounded in all cases- ma