As part of developing a Libre GPU that is intended for 3D, specialist Compute and Machine Learning, standard operations used in OpenCL are pretty much mandatory [1].
As they will end up in common public usage - upstream compilers with high volumes of downloads - it does not make sense for these opcodes to be relegated to "custom" status ["custom" status is suitable only for embedded proprietary usage that will never see the public light of day].
Also, they are not being proposed as part of RVV for the simple reason that as "scalar" opcodes, they can be used with *scalar* designs. It makes more sense that they be deployable in "embedded" designs (that may not have room for RVV, particularly as CORDIC seems to cover the vast majority of trigonometric algorithms and more [2]), or in NUMA parallel designs, where a cluster of shaders makes up for a lack of "vectorisation".
In addition: as scalar opcodes, they fit into the (huge, sparsely populated) FP opcode brownfield, whereas the RVV major opcode is much more under pressure.
The list of opcodes is at an early stage, and participation in its development is open and welcome to anyone involved in 3D and OpenCL Compute applications.
Context, research, links and discussion are being tracked on the libre riscv bugtracker [3].
L.
[1] https://www.khronos.org/registry/spir-v/specs/unified1/OpenCL.ExtendedInstructionSet.100.html
[2] http://www.andraka.com/files/crdcsrvy.pdf
[3] http://bugs.libre-riscv.org/show_bug.cgi?id=127
Is this proposal going to <eventually> include::a) statement on required/delivered numeric accuracy per transcendental ?
b) a reserve on the OpCode space for the double precision equivalents ?
c) a statement on <approximate> execution time ?
You may have more transcendentals than necessary::1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals..... ASINH( x ) = ln( x + SQRT(x**2+1) )
2) LOG(x) = LOGP1(x) + 1.0... EXP(x) = EXPM1(x-1.0)That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.
Is this proposal going to <eventually> include::
a) statement on required/delivered numeric accuracy per transcendental ?
b) a reserve on the OpCode space for the double precision equivalents ?
c) a statement on <approximate> execution time ?
You may have more transcendentals than necessary::1) for example all of the inverse hyperbolic can be calculated to GRAPHICs numeric quality with short sequences of already existing transcendentals..... ASINH( x ) = ln( x + SQRT(x**2+1) )
2) LOG(x) = LOGP1(x) + 1.0... EXP(x) = EXPM1(x-1.0)That is:: LOGP1 and EXPM1 provide greater precision (especially when the result is near zero) than their sister functions, and the compiler can easily add the additional instruction to the instruction stream where appropriate.
On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:Is this proposal going to <eventually> include::a) statement on required/delivered numeric accuracy per transcendental ?From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.
b) a reserve on the OpCode space for the double precision equivalents ?the 2 bits right below the funct5 field select from:00: f3201: f6410: f1611: f128so f64 is definitely included.see table 11.3 in Volume I: RISC-V Unprivileged ISA V20190608-Base-Ratifiedit would probably be a good idea to split the trancendental extensions into separate f32, f64, f16, and f128 extensions, since some implementations may want to only implement them for f32 while still implementing the D (f64 arithmetic) extension.c) a statement on <approximate> execution time ?that would be microarchitecture specific. since this is supposed to be an inter-vendor (icr the right term) specification, that would be up to the implementers. I would assume that they are at least faster then a soft-float implementation (since that's usually the whole point of implementing them).For our implementation, I'd imagine something between 8 and 40 clock cycles for most of the operations. sin, cos, and tan (but not sinpi and friends) may require much more than that for large inputs for range reduction to accurately calculate x mod 2*pi, hence why we are thinking of implementing sinpi, cospi, and tanpi instead (since they require calculating x mod 2, which is much faster and simpler).
https://www.researchgate.net/publication/230668515_A_fixed-point_implementation_of_the_natural_logarithm_based_on_a_expanded_hyperbolic_CORDIC_algorithm Since: ln(a) = 2Tanh-1( (a-1) / (a+1) The function ln(α) is obtained by multiplying by 2 the final result
On Wednesday, August 7, 2019 at 6:43:21 PM UTC-5, Jacob Lifshay wrote:On Wed, Aug 7, 2019, 15:36 'MitchAlsup' via RISC-V ISA Dev <isa...@groups.riscv.org> wrote:Is this proposal going to <eventually> include::a) statement on required/delivered numeric accuracy per transcendental ?From what I understand, they require correctly rounded results. We should eventually state that somewhere. The requirement for correctly rounded results is so the instructions can replace the corresponding functions in libm (they're not just for GPUs) and for reproducibility across implementations.Correctly rounded results will require a lot more difficult hardware and more cycles of execution.Standard GPUs today use 1-2 bits ULP for simple transcendentals and 3-4 bits for some of the harder functions.Standard GPUs today are producing fully pipelined results with 5 cycle latency for F32 (with 1-4 bits of imprecision)Based on my knowledge of the situation, requiring IEEE 754 correct rounding will double the area of the transcendental unit, triple the area used for coefficients, and come close to doubling the latency.
I can point you at (and have) the technology to perform most of these to the accuracy stated above in 5 cycles F32.I have the technology to perform LN2P1, EXP1M in 14 cycles, SIN, COS including argument reduction in 19 cycles, POW in 34 cycles while achieving "faithfull rounding" of the result in any of the IEEE 754-2008 rounding modes and using a floating point unit essentially the same size as an FMAC unit that can also do FDIV and FSQRT. SIN and COS have full Payne and Hanek argument reduction, which costs 4-cycles and allows for "silly arguments to be properly processed:: COS( 6381956970095103×2^797) = -4.68716592425462761112×10-19
An old guy at IBM (a Fellow) made a long and impassioned plea in a paper from the late 1970s or early 1980s that whenever something is put "into the instruction set" that the result be as accurate as possible. Look it up, it's a good read.At the time I was working for a mini-computer company where a new implementation was not giving binary accurate results compared to an older generation. This was traced to an "enhancement" in the F32 and F64 accuracy from the new implementation. To a customer, they all wanted binary equivalence, even if the math was worse.
My gut feeling tell me that the numericalists are perfectly willing to accept an error of 0.51 ULP RMS on transcendental calculations.My gut feeling tell me that the numericalists are not willing to accept an error of 0.75 ULP RMS on transcendental calculations.I have no feeling at all on where to draw the line.
this tends to suggest that three platform specs are needed:* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/43b3c671-7e13-4229-838e-71c7e293941b%40groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/0a8e035c-0996-44ba-af8d-d19be84575f5%40groups.riscv.org.
Hello,My $0.02 of contributionRegarding the comment of 3 platforms:
> * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)No, IEEE, ARM is an embedded platform and they implement IEEE in all of them> * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)Standard IEEE, simple no 'new' on this sector.> * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.No, just adopt IEEE-754, it is a standard, it is a standard for a reason. Anything out of IEEE-754, it does not conform with IEEE and for such you are on your own. However, you still can demonstrate your deviation from the standard.
> > * Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
> No, IEEE, ARM is an embedded platform and they implement IEEE in all of them
I can see the sense in that one. I just thought that some 3D implementors, particularly say in specialist markets, would want the choice.
Hmmmm, perhaps a 3D Embedded spec, separate from "just" Embedded.
> > * UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
> Standard IEEE, simple no 'new' on this sector.
Makes sense. Cannot risk noninteroperability, even if it means a higher gate count or larger latency.
> > * a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
>
>
>
>
> No, simply use IEEE that it is all, and based on the IEEE standard you can measure the deviation in your system.
Ok, this is where that's not going to fly. As Mitch mentioned, full IEEE754 compliance would result in doubling the gate count and/or decreasing latency through longer pipelines.
In speaking with Jeff Bush from Nyuzi I learned that a GPU is insanely dominated by its FP ALU gate count: well over 50% of the entire chip.
If you double the gate count due to imposition of unnecessary accuracy (unnecessary because due to 3D Standards compliance all the shader algorithms are *designed* to lower accuracy requirements), the proposal will be flat-out rejected by adopters because products based around it will come with a whopping 100% power-performance penalty compared to industry standard alternatives.
So this is why I floated (ha ha) the idea of a new Platform Spec, to give implementors the space to meet industry-driven requirements and remain competitive.
Ironically our implementation will need to meet UNIX requirements, it is one of the quirks / penalties of a hybrid design.
L.
Hi folks,We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.
On Wed, Aug 7, 2019, 23:30 Andrew Waterman <wate...@eecs.berkeley.edu> wrote:Hi folks,We would seem to be putting the cart before the horse. ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative. It does not make sense to allocate opcode space under these circumstances.Since there are ways to implement transcendental functions in HW that are faster than anything possible in SW (I think Mitch mentioned a 5-cycle sin implementation), I would argue that having instructions for them is beneficial, and, since they would be useful on a large number of different implementations (GPUs, HPC, bigger desktop/server processors), it's worth standardizing the instructions, since otherwise the custom opcodes used for them would become effectively standardized (as mentioned by Luke) and no longer useful as custom opcodes on implementations that need fast transcendental functions.
I have no problems ending up with different encodings and/or semantics than currently chosen, as long as that's done early enough and in a public manner so that we can implement without undue delay the chosen opcodes without being incompatible with the final spec.
Jacob Lifshay
On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:
> It's not obvious that the proposed opcodes justify standardization.
It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"
Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.
Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.
There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.
It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.
It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.
The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.
Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.
Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.
If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.
Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.
SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".
For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.
I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).
L.
Just to emphasise, Luis, Andrew: "on their own" is precisely what each of the proprietary 3D GPU Vendors have done, and made literally billions of dollars by doing so.
Saying "we are on our own" and expecting that to mean that not conforming to IEEE754 would kill the proposal, this is false logic.
MALI (ARM), Vivante, the hated PowerVR, NVidia, AMD/ATI, Samsung's new GPU (with Mitch's work in it), and many more, they *all* went "their own way", hid the hardware behind a proprietary library, and *still made billions of dollars*.
This should tell you what you need to know, namely that a new 3D GPU Platform Spec which has specialist FP accuracy requirements to meet the specific needs of this *multi BILLION dollar market* is essential to the proposal's successful industry adoption.
If we restrict it to UNIX (IEEE754) it's dead.
If we restrict it to *not* require IEEE754, it's dead.
The way to meet all the different industry needs: new Platform Specs.
That does not affect the actual opcodes. They remain the same, no matter the Platform accuracy requirements.
Thus the software libraries and compilers all remain the same, as well.
L.
ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.
It does not make sense to allocate opcode space under these circumstances.
On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
It does not make sense to allocate opcode space under these circumstances.[reminder and emphasis: there are potentially *four* completely separate and distinct Platforms, all of which share / need these exact same opcodes]l.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/6a391409-31c8-498b-8209-ed4fef5ba4a6%40groups.riscv.org.
[mostly OT for the thread, but still relevant]
On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:
> It's not obvious that the proposed opcodes justify standardization.
It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"
Both Jacob and I have Asperger's. In my case, I think in such different conceptual ways that I use language that bit differently, such that it needs "interpretation". Rogier demonstrated that really well a few months back, by "interpreting" something on one of the ISAMUX/ISANS threads.
Think of what I write as being a bit like the old coal mine "canaries". You hear "tweet tweet croak", and you don't understand what the bird said before it became daisy-food but you know to run like hell.
There are several aspects to this proposal. It covers multiple areas - multiple platforms, with different (conflicting) implementation requirements.
It should be obvious that this is not going to fit the "custom" RISCV paradigm, as that's reserved for *private* (hard fork) toolchains and scenarios.
It should also be obvious that, as a public high profile open platform, the pressure on the compiler upstream developers could result in the Altivec SSE nightmare.
The RISCV Foundation has to understand that it is in everybody's best interests to think ahead, strategically on this one, despite it being well outside the core experience of the Founders.
Note, again, worth repeating: it is *NOT* intended or designed for EXCLUSIVE use by the Libre RISCV SoC. It is actually inspired by Pixilar's SIGGRAPH slides, where at the Bof there were over 50 attendees. The diversity of requirements of the attendees was incredible, and they're *really* clear about what they want.
Discussing this proposal as being a single platform is counterproductive and misses the point. It covers *MULTIPLE* platforms.
If you try to undermine or dismiss one area, it does *not* invalidate the other platforms's requirements and needs.
Btw some context, as it may be really confusing as to why we are putting forward a *scalar* proposal when working on a *vector* processor.
SV extends scalar operations. By proposing a mixed multi platform Trans / Trigonometric *scalar* proposal (suitable for multiple platforms other than our own), the Libre RISCV hybrid processor automatically gets vectorised [multi issue] versions of those "scalar" opcodes, "for free".
For a 3D GPU we still have yet to add Texture opcodes, Pixel conversion, Matrices, Z Buffer, Tile Buffer, and many more opcodes. My feeling is that RVV's major opcode brownfield space simply will not cope with all of those, and going to 48 bit and 64 bit is just not an option, particularly for embedded low power scenarios, due to the increased I-cache power penalty.
I am happy for *someone else* to do the work necessary to demonstrate otherwise on that one: we have enough to do, already, if we are to keep within budget and on track).
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/666729c1-c7c9-47a4-af20-0cc78f6cea99%40groups.riscv.org.
On Thursday, August 8, 2019 at 7:30:38 AM UTC+1, waterman wrote:ISA-level support for correctly rounded transcendentals is speciously attractive, but its utility is not clearly evident and is possibly negative.wait... hang on: there are now *four* potential Platforms against which this statement has to be verified. are you saying that for a *UNIX* platform that correctly-rounded transcendentals are potentially undesirable?
On Thu, Aug 8, 2019 at 1:36 AM lkcl <luke.l...@gmail.com> wrote:[mostly OT for the thread, but still relevant]
On Thursday, August 8, 2019 at 3:33:25 PM UTC+8, waterman wrote:
> It's not obvious that the proposed opcodes justify standardization.
It's outside of your area of expertise, Andrew. Just as for Luis, all the "metrics" that you use will be screaming "this is wrong, this is wrong!"Oh, man. This is great. "Andrew: outside his element in computer arithmetic" is right up there with "Krste: most feared man in computer architecture".
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
On Thursday, August 8, 2019 at 11:09:28 AM UTC+1, Jacob Lifshay wrote:maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:*thinks*... *puzzled*... hardware can't be changed, so you'd need to pre-allocate the gates to cope with e.g. UNIX Platform spec (libm interoperability), so why would you need a CSR to switch "modes"?
ah, ok, i think i got it, and it's [potentially] down to the way we're designing the ALU, to enter "recycling" of data through the pipeline to give better accuracy.are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?
thus, for example, our hardware would (purely as an example) be optimised to produce OpenCL-compliant results during "3D GPU Platform mode", and as such would need less gates to do so. HOWEVER, for when that exact same hardware was used in the GNU libm library, it would set "UNIX Platform FP hardware mode", and consequently produce results that were accurate to UNIX Platform requirements (whatever was decided - IEEE754, 0.5 ULP precision, etc. etc. whatever it was).in this "more accurate" mode, the latency would be increased... *and we wouldn't care* [other implementors might], because it's not performance-critical: the switch is just to get "compliance".that would allow us to remain price-performance-watt competitive with other GPUs, yet also meet UNIX Platform requirements.something like that?
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:- machine-learning-mode: fast as possible-- maybe need additional requirements such as monotonicity for atanh?- GPU-mode: accurate to within a few ULP-- see Vulkan, OpenGL, and OpenCL specs for accuracy guidelines- almost-accurate-mode: accurate to <1 ULP(would 0.51 or some other value be better?)- fully-accurate-mode: correctly rounded in all cases- maybe more modes?
> are you suggesting that implementors be permitted to *dynamically* alter the accuracy of the results that their hardware produces, in order to comply with *more than one* of the [four so far] proposed Platform Specs, *at runtime*?
> yes.
Ok. I like it. It's kinda only sonething that hybrid CPU/GPU combinations would want, however the level of interest that Pixilica got at SIGGRAPH 2019 in their hybrid CPU/GPU concept says to me that this is on the right track.
Also a dynamic switch stops any fighting over whether one Platform Spec should get priority preference to the exclusion of others.
Will update the page shortly.
>
> also, having explicit mode bits allows emulating more accurate operations when the HW doesn't actually implement the extra gates needed.
Oh, yes, good point, however it would only be mandatory for UNIX* Platforms to provide such traps.
> This allows greater software portability (allows converting a libm call into a single instruction without requiring hw that implements the required accuracy).
and associated performance penalties of doing so (extra conditional tests) if the trap isn't there. The conditional tests which substitute for a lack of a trap adversely impact performance for *both* modes.
>
> I do think that there should be an exact-rounding mode even if the UNIX platform doesn't require that much accuracy, otherwise, HPC implementations (or others who need exact rounding) will run into the same dilemma of needing more instruction encodings again.
Hmm hmm.... well, you know what? If it's behind a CSR Mode flag, and traps activate on unsupported modes, I see no reason why there should not be an extra accuracy mode.
L.
>
> One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.
Hmm don't know. Hendrik pointed out that Ahmdahl / IBM370 mainframe problem that extra accuracy caused.
I don't know if that lesson from history matters [in 2019].
No clue. Don't know enough to offer an opinion either way. Anyone any recommendations?
L.
On Thursday, August 8, 2019 at 7:56:15 PM UTC+8, Jacob Lifshay wrote:
>
> One more part: hw can implement a less accurate mode as if a more accurate mode was selected, so, for example, hw can implement all modes using hw that produces correctly-rounded results (which will be the most accurate mode defined) and just ignore the mode field since correct-rounding is not less accurate than any of the defined modes.
Hmm don't know. Hendrik pointed out that Ahmdahl / IBM370 mainframe problem that extra accuracy caused.
>
> maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
No definitely not ISAMUX/ISANS, its purpose is for switching (paging in) actual opcodes. Not quite true, LE/BE kinda flips in LE variants of LD/ST.
An FP CSR (dedicated or fields) makes more sense I think because it's quite a few bits, and I can see some potential value in the same bits being applied to F, G, H and Q as well.
Hmmm
Hendrik's example was that Ahmdahl hardware had correct (accurate) FP, where the IBM 370 did not.
Applications writers ran into problems when running on *more accurate* hardware. Ahmdahl had to patch the OS, with associated performance penalty, to *downgrade* the FP accuracy and to emulate IBM's *inaccurate* hardware, precisely.
What I do not know is whether there was something unique about the 370 mainframe and applications being written for it, or, if now in 2019, this is sufficiently well understood such that all FP applications writers have properly taken *better* accuracy (not worse accuracy: *better* accuracy) into consideration in the design of their programs.
Not knowing the answer to that question - not knowing if it is a risky proposition or not - tends to suggest to me that erring on the side of caution and *not* letting implementors provide more accuracy than the FP Accuracy CSR requests is the "safer" albeit more hardware-burdensome option.
Hence why I said I have no clue what the best answer is, here.
L.
> What I do not know is whether there was something unique about the 370 mainframe and applications being written for it, or, if now in 2019, this is sufficiently well understood such that all FP applications writers have properly taken *better* accuracy (not worse accuracy: *better* accuracy) into consideration in the design of their programs.
I *think* this is what Andrew might have been trying to get across.
L.
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/24df3c2f-8fec-4384-bfd6-ad5a9bdc97cd%40groups.riscv.org.
l.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAPweEDxRm7HV_sjVwebbos8SbemTMe%3DwPP7qLTcEh8KD%2BQJMAw%40mail.gmail.com.
On Wed, Aug 7, 2019, 22:20 lkcl <luke.l...@gmail.com> wrote:this tends to suggest that three platform specs are needed:
* Embedded Platform (where it's entirely up to the implementor, as there will be no interaction with public APIs)
* UNIX Platform (which would require strict IEEE754 accuracy, for use in GNU libm, OR repeatable numericalist-acceptable accuracy)
* a *NEW* 3D Platform, where accuracy is defined by strict conformance to a high-profile standard e.g. OpenCL / Vulkan.
That wouldn't quite work on our GPU design, since it's supposed to be both a GPU and a CPU that conforms to the UNIX Platform, it would need to meet the requirements of the UNIX Platform and the 3D Platform, which would still end up with correct rounding being needed.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAC2bXD58jzG4yTAmhr1D0qnoir6UhBSkbfUw8ik_Q0hgn88_ug%40mail.gmail.com.
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:
Floating-point operations use either a static rounding mode encoded in the instruction, or a dynamic rounding mode held in frm. Rounding modes are encoded as shown in Table 11.1. A value of 111 in the instruction’s rm field selects the dynamic rounding mode held in frm. If frm is set to an invalid value (101–111), any subsequent attempt to execute a floating-point operation with a dynamic rounding mode will raise an illegal instruction exception.
the unsupported modes would cause a trap to allow emulation where traps are supported. emulation of unsupported modes would be required for unix platforms.
there would be a mechanism for user mode code to detect which modes are emulated (csr? syscall?) (if the supervisor decides to make the emulation visible) that would allow user code to switch to faster software implementations if it chooses to.
emulation of unsupported modes would be required for unix platforms.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/26e2386a-8a8e-450b-9ab7-dc2453ccce71%40groups.riscv.org.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/26e2386a-8a8e-450b-9ab7-dc2453ccce71%40groups.riscv.org.
maybe a solution would be to add an extra field to the fp control csr (or isamux?) to allow selecting one of several accurate or fast modes:Preface: As Andrew points out, any ISA proposal must be associated with a quantitative evaluation to consider tradeoffs.A natural place for a standard reduced accuracy extension "Zfpacc" would be in the reserved bits of FCSR.
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/65871042-2e4a-4a76-869d-e785f5d8bd16%40groups.riscv.org.
> But unless it becomes a ratified standard, the RV Compliance Framework and Test Suite just won't deal with it.
I see the logic behind that. It costs money and time.
> If the customer is big enough, and the market is big enough, maybe that custom extension becomes a standard - at which point it all works.
Indeed. The doenside: that unfortunately would require a wing and a prayer that the opcode space doesn't get used up for other [official] purposes in between those two events.
It's an extremely risky approach, as the implementor will sure as hellfire exists not want their product - one that cost them tens to hundreds of millions to develop and market - to be "relegated" to nonstandard status by an *incompatible* post-silicon after-sales Standardisation effort, in the [highly likely] event of an opcode clash.
You can guarantee they'll fight that one, to the detriment of the entire RISCV community [cf: Altivec SSE nightmare]
On another note: I think, *deep breath*, sad to say it, the RISCV Foundation looks at our team, operating from the outside, excluded from participation due to the Membership Agreement being an NDA, and considers us to be a bit of a joke.
Our input - including warnings - can therefore be "safely ignored", just like Open Source contributors ideas and input can and have been ignored, for many years, now.
Because we don't come with a billion dollar Corporate cheque book automatically attached, our input and perspective cannot possibly have any impact, because how can such unrealistically stupid and deluded idealistic time wasters possibly get the money to actually deliver silicon, right?
Things will get a lot easier when that perspective changes. I hope and trust that that change occurs before it is too late.
L.
Couple of things about that:
* In the [new] 3D Embedded Platform, speed and performance are nonessential: even accuracy is nonessential. Cost savings on SDKs, power, etc are the higher priority.
CORDIC, for these profiles, is perfect, because of the huge number of operands it covers.
* For the Libre RISCV SoC, chances are high that we will use it (at least for a first revision), and will do so by treating each iteration as a combinatorial block.
Several of those blocks will be chained together *per pipeline*, which will increase gate latency and that is perfectly acceptable as the clock rate target is only 800mhz (not 4ghz).
This trick is one that we have deployed in the FPDIV/SQRT/RSQRT pipeline, using high radix stages as well, to get the pipeline length down even further in that instance.
Whether CORDIC algorithm enhancements exist that will allow us to do more than one bit at a time? Haven't looked yet.
Given that CORDIC is at heart just a simple add and compare, I really do not expect the chaining of multiple iterations as combinatorial blocks to have that big an adverse effect (not on an 800mhz target).
With the mantissa being 23 bit, three chains would easily get us to a 9-10 long pipeline for FP32. 4 chains would give a 7-8 stage pipeline.
This is a really good return on implementation time investment for such a huge number of operations being covered by such a ridiculously simple and elegant algorithm.
L.
> Whether CORDIC algorithm enhancements exist that will allow us to do more than one bit at a time? Haven't looked yet.
Pffh. Yes. *lots*.
On Friday, August 9, 2019 at 1:39:53 PM UTC+8, andrew wrote:
> On Thu, Aug 8, 2019 at 8:12 PM lkcl <luke.l...@gmail.com> wrote:
> On Thursday, August 8, 2019 at 10:14:58 PM UTC+8, andrew wrote:
>
>
>
> > As I mentioned earlier, an instruction being useful is not by itself a justification for adding it to the ISA. Where’s the quantitative case for these instructions?
>
>
>
> 3D is a Billion dollar mass volume market, proprietary GPUs are going after mass volume, inherently excluding the needs of (profitable) long-tail markets.
>
>
> That’s a business justification,
Yes it is. That's the start of the logic chain.
> not an architectural one, and in any case it’s a justification for different instructions than the ones you’ve proposed.
Andrew: I appreciate that you're busy
: so am I. If you could give a little bit more detail, by spending the time describing a way forward instead of putting up barriers where we have to guess how to work around them, that would save a lot of time.
For example, in order to move forward with a solution, I would expect such a statement above to include some sort of description or hint as to what the alternative instructions might look like.
Then those can be formally evaluated to see if they meet first the business justification, then if that passes, the time can be spent on architectural evaluation.
> The market you’re describing isn’t populated with products that have ISA-level support for correctly rounded transcendentals; they favor less accurate operations or architectural support for approximations and iterative refinement.
Iterative pipelined refinements, in order to meet timing critical needs of [some] of the business requirements, yes.
Andrew: the proprietary vendors are "custom" profiles. Their design approach gas to be excluded from consideration as something to follow.
The proprietary vendors typically have conplete custom architectures. Custom ISAs. Even NVIDIA's new architecture fits the *custom* RISCV profile.
This has a software development penalty due to the need to "marshall" and "unmarshall" the OpenGL / OpenCL / Vulkan API parameters on the userspace (x86/ARM/MIPS) side, stream them over to the GPU's memory space (typically over a PCIe PHY), and unpack them.
The response to the API call goes through the exact same insane process. This so that, in the case of eg MALI, PowerVR, Vivante etc they can sell product independently of the main CPU ISA.
We are proposing something that runs DIRECTLY ON THE PROCESSOR. I apologise I know you don't like capitals, I have written the above about eight or nine tines now over the past year and it's starting to get irritating that its significance is being ignored.
Hybrid CPU / GPUs such as that proposed by Pixilica (and independently by our team) are much simpler to implement, make debugging far easier, and save hugely on development time.
Hybrid CPU/GPU/VPUs therefore *need* - at the architectural level - a close tie-in to the host ISA. That's just how it has to be.
And as high profile "open" products, the compiler tools also neef full upstream support for them.
This is why they simply cannot and will not work as "custom" extensions. I have repeated this literally dozens of times for over 18 months. Eventually it will sink in.
L.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/3faac876-96d5-4844-9d54-3a2e3e38e194%40groups.riscv.org.
Andrew: I appreciate that you're busyGood point - I capitulate.
Okaay, thank you for explaining.
> If there is a measurable and significant improvement on some large
> body of code, such as SPEC for example, then that would be grounds for
> considering inclusion in a RISC-V Foundation standard extension.
It took Jeff Bush on Nyuzi about... 2 years to get to the point of being able to do that level of assessment.
We're a small team, only 2 full time engineers, total sponsorship: EUR 50,000.
The RISCV Foundation receives.. what... several million investment and membership fees?
So as a Libre Project, and as GPUs is a market that we know there is demand for... you get the general idea.
In the meantime I will reach out to my contact and explain the situation to him. He will then be in a better position to explain to new Alliance Members what'w needed.
> If it improves only some narrow specialized task then that might
> justify a custom extension.
No, depending on how much silicon area is dedicated to FP, it'll be somewhere around one or two orders of magnitude performance improvement in both OpenCL and 3D.
For 3D embedded, the collaboration itself (that the opcodes exist at all and have software support) is the key benefit. Performance metrics in the 3D embedded space are actually completely and utterly misleading.
> But you haven't even shown that, other
> than "hardware good -- software bad". Is it even measurable, even on
> *your* workload?
Remember, there's several different platforms. They all have different requirements. Driving the entire proposal from a perormance or "quantitive" perspective is both detrimental, misleading, and misses the point.
> We sure don't know the answer.
It'll be about... estimated... six to eight months before we have RTL that can run anything.
In the meantime, *for those platforms that desire performance*, simple assessments of what libm currently does in s/w, and replacing that with *single cycle* hardware opcodes gives a clear idea of the performance gains.
Also: 3D requires *guaranteed* real time response times. Iterative blocking algorithms are absolutely unacceptable as they break the guaranteed pixel frame rate requirements.
For example, we have to do FPDIV as a pipeline, iterative N.R. is a No, and there is one FPDIV per pixel in Normalisation (and one RSQRT).
Jeff Bush's paper, nyuzi_pass2016, is also a good reference. It shows what happens in 3D if you *don't* have the right primitives.
Nyuzi, like Larrabee, is fantastic as a Vector Compute Engine. As a 3D GPU, it is a paltry 25% of modern GPU performance for the same silicon area / power budget.
The Larrabee team were not permitted to reveal that little fact in their original paper. whoops :)
> Potential market size is irrelevant The most it does is provide
> justification for doing the quantitative performance evaluation in the
> first place.
Funding. We're doing this project from charitable donations, from NLNet. Find the funding, we can do the evaluation.
Otherwise, someone else has to do it. We may be able to find someone, through our contacts, but there's definitely no budget available for our small team to do six to eight months of research, here.
Sorry.
Ok, so thank you for clarifying, Bruce: I'll ask around, see how this can be solved.
L.
On Thu, Aug 8, 2019 at 9:16 PM lkcl <luke.l...@gmail.com> wrote:
> Hmmmm, if speed or power consumption of an implementation is compromised by that, it would be bad (and also Khronos nonconformant, see below).
I know that for Vulkan/OpenGL compliance, it's based on if it produces
the correct results, not how long it takes to compute.
> So i'm really sorry, but a "quantitative" analysis of a near-exact replica of the Kronos OpenCL opcodes is misleading, pointless, and a hopeless waste of everybody's time.
By contrast, one of our sponsors had a user ask, "hey POSITs are up and coming, and have greater accuracy. how about including them?"
I had to sheepishly and diplomatically explain to them that adoption of POSITs in hardware is just the start. With no Industry-standard adoption, as coalesced in the Khronos Vulkan API, it would be five man-years of effort, at the end of which *there would still be no user adoption*.
The reason is simple: shader engines are already compiled into SPIRV that meet the Vulkan Specification.
If POSITs are not part of that spec, the hardware don't get used, period.
Now, did you see, anywhere in this thread, any mention of a proposal to include POSITs? Answer: no, because (being undiplomatic) it's a waste of time.
I did not waste anyone's time even mentioning it because we *already did the analysis*.
By contrast, early in the thread, we began to do an analysis of which operations could be implemented in terms of others. ATAN2 can be used to implement ATAN by setting y=1.0 and LOGP1 can be used to implement LOG and give better accuracy anyway.
Hypotenuse likewise was pointed out kindly by Mitch as being very straightforward to implement in terms of ln and sqrt however for really high performance applications the extra clock cycles may not be acceptable, so there is a Zhypot extension that lists them.
These decisions are the kinds that simply apply standard RISC ISA development rules, and will need careful review before finalising.
The actual set of opcodes themselves, there is simply no doubt at all that they are easily justifiable, as the level of adoption worldwide as Industry Standards as defined by the Khronos Group *unquestionably* proves their worth.
Now, unfortunately, because of the "giving up" wording, I have to use specific language here, in case the conversation ends. "If we do not hear otherwise, we will ASSUME that the above is reasonable, and that the approach being taken IN GOOD FAITH is acceptable for a Z RISC-V standard".
Apologies, I have to say that. It leaves the ball in the RISCV Foundation's court, should there be no further response, leading to prejudicial DELAY in our progress. You will have seen the forbes article by now, so will understand why it is becomes necessary to use this wording.
L.
> Compiling for a specific platform means that support for required accuracy modes is guaranteed (and therefore does not need discovery sequences), while allowing portable code to execute discovery sequences to detect support for alternative accuracy modes.
The latter will be essential for detecting the "fast_*" capability.
Main point, I cannot emphasise enough how critical it is that userspace software get at the underlying hardware characteristics. This for Khronos Standards Compliance.
Sensible proposal, Dan. will write it up shortly.
> If there is a measurable and significant improvement on some large
> body of code, such as SPEC for example, then that would be grounds for
> considering inclusion in a RISC-V Foundation standard extension.
It took Jeff Bush on Nyuzi about... 2 years to get to the point of being able to do that level of assessment.
Well, one way to get at the underlying HW characteristics is an ECALL to a higher priv mode; That will take a little longer, but I don't see why it would be used more than once during the run of an applications, so that should be OK.
But a similar issue has come up in several forums, and I've proposed a standard discovery mechanism that will enable all kinds discovery using a CSR interface.
Writing the CSR initializes a pointer into a discovery data structure, and reading it will return the contents of the structure at that offset (with optional autoincrement)The data structure is a linked list of capabilities (ID, offset of next capability) where the actual capability is implementation dependent (on the ID).So, a config string ID, debug capability ID, ISA capability ID can be independently defined to be YAML, binary or ascii formats, as desired.
On Fri, Aug 9, 2019 at 1:46 AM lkcl <luke.l...@gmail.com> wrote:
> If there is a measurable and significant improvement on some large
> body of code, such as SPEC for example, then that would be grounds for
> considering inclusion in a RISC-V Foundation standard extension.
It took Jeff Bush on Nyuzi about... 2 years to get to the point of being able to do that level of assessment.OK - but what's your point?Or, rather, why do you expect that you can or should be able to skip 2 years of work?
Or, why should you expect anyone to make a major adoption of a standard that might turn out to be fatally flawed because it was rushed to ratification without the requisite homework?
If you want to create a standard (make no mistake, that's what you're doing) that will be widely adopted, there is a lot of heavy lifting that can't be swept under the rug.
I'm sorry to say that your team may have the resources to design something pretty nifty - but not big enough to handle the other part of that, which is to demonstrate it is the right nifty thing that others will adopt (at the expense of adopting someone else's nifty standard).
You do have an advantage - SW developers will prefer an open source solution - but not at the expense of a flawed open source solution, and it is unfortunately up to you to show that it isn't flawed (to clear- I'm not saying it is flawed - just that it needs evidence to back it up).
A couple of observations reading the past set of emails:
So I agree with you that we should look at SPIR-V and the Vulkan ISA seriously. Now that ISA is very complex and many of the instructions may possibly be reconstructed from simpler ones. We need to thus perhaps look at a "minimized" subset of the Vulkan ISA instructions that truly define the atomic operations from which the full ISA can be constructed. So the instruction decode hardware can implement this "higher-level ISA" - perhaps in microcode - from the "atomic ISA" at runtime while hardware support is only provided for the "atomic ISA".
From the SIGGRAPH BOF it was clear there are competing interests. Some people wanted explicit texture mapping instructions while others wanted HPC type threaded vector extensions. Although each of these can be accommodated we need to adjudicate the location in the process pipeline where they belong - atomic ISA, higher-level ISA or higher-level graphics library.
Atif Zafar asked me to post the following observations:
Excellent that saves on develipment effort.
> This reduces the pressure on the opcode space by a lot.
>
> > > From the SIGGRAPH BOF it was clear there are competing interests. Some people
> > > wanted explicit texture mapping instructions while others wanted HPC type
> > > threaded vector extensions.
> >
> > interesting.
>
> Having texture mapping instructions in HW is a really good idea for
> traditional 3D, since Vulkan allows the texture mode to be dynamically
> selected, trying to implement it in software would require having the
> texture operations use dynamic dispatch with maybe static checks for
> recently used modes to reduce the latency. See VkSampler's docs.
The one opcode that Mitch pointed out would be useful is the texture LD and interpolate. A 2D reference into a texture map plus an x scale and y scale.
> So, you overestimated the number of immediate bits needed by quite a lot.
I assumed a FMAC as worst case plus 2 bits per offset XYZW to simplify decode. 4 ops each a 4 vector @ 8 bit to swizzle all 4 vector elements is 32.
Either way it is mad extra immed bits meaning opcodes must be 48, 64 or even greater.
We can solve that wirh swizzle MV
L.
> Jacob
double ATAN2( double y, double x )
{ // IEEE 754-2008 quality ATAN2
// deal with NANs
if( ISNAN( x ) ) return x;
if( ISNAN( y ) ) return y;
// deal with infinities
if( x == +∞ && |y|== +∞ ) return copysign( π/4, y );
if( x == +∞ ) return copysign( 0.0, y );
if( x == -∞ && |y|== +∞ ) return copysign( 3π/4, y );
if( x == -∞ ) return copysign( π, y );
if( |y|== +∞ ) return copysign( π/2, y );
// deal with signed zeros
if( x == 0.0 && y != 0.0 ) return copysign( π/2, y );
if( x >=+0.0 && y == 0.0 ) return copysign( 0.0, y );
if( x <=-0.0 && y == 0.0 ) return copysign( π, y );
// calculate ATAN2 textbook style
if( x > 0.0 ) return ATAN( |y / x| );
if( x < 0.0 ) return π - ATAN( |y / x| );
}
x Î [ -∞, -1.0]:: ATAN( x ) = -π/2 + ATAN( 1/x );
x Î (-1.0, +1.0]:: ATAN( x ) = + ATAN( x );
x Î [ 1.0, +∞]:: ATAN( x ) = +π/2 - ATAN( 1/x );
I should point out that the add/sub of π/2 can not lose significance since the result of ATAN(1/x) is bounded 0..π/2
The bottom line is that I think you are choosing to make too many of these into OpCodes, making the hardware
function/calculation unit (and sequencer) more complicated that necessary.
----------------------------------------------------------------------------------------------------------------------------------------------------
I might suggest that if there were a way for a calculation to be performed and the result of that calculation
chained to a subsequent calculation such that the precision of the result-becomes-operand is wider than
what will fit in a register, then you can dramatically reduce the count of instructions in this category while retaining
acceptable accuracy:
z = x / y
can be calculated as::
z = x × (1/y)
Where 1/y has about 26-to-32 bits of fraction. No, it's not IEEE 754-2008 accurate, but GPUs want speed and
1/y is fully pipelined (F32) while x/y cannot be (at reasonable area). It is also not "that inaccurate" displaying
0.625-to-0.52 ULP.
Given that one has the ability to carry (and process) more fraction bits, one can then do high precision
multiplies of π or other transcendental radixes.
And GPUs have been doing this almost since the dawn of 3D.
The alternative is to designate a few OpCodes in a sequence as a single result producer, with the intermediate result kept larger than register width and fed back to the in-sequent instruction (preserving accuracy.)
If you include Tessellation and Geometry, both of whom can generate a volcano of new primitives, Thereare significant performance gains (more than 2×) to be had by doing the above in HW function units.
data formats: https://www.khronos.org/registry/DataFormat/specs/1.1/dataformat.1.1.pdf
the only thing is, fitting a third context (beyond the vector table format and the predicate table format) will need a redesign, or to go into the 192+ bit RV opcode format.
// calculate ATAN2 high performance style
// Note: at this point x != y
//
if( x > 0.0 )
{
if( y < 0.0 && |y| < |x| ) return - π/2 - ATAN( x / y );
if( y < 0.0 && |y| > |x| ) return + ATAN( y / x );
if( y > 0.0 && |y| < |x| ) return + ATAN( y / x );
if( y > 0.0 && |y| > |x| ) return + π/2 - ATAN( x / y );
}
if( x < 0.0 )
{
if( y < 0.0 && |y| < |x| ) return + π/2 + ATAN( x / y );
if( y < 0.0 && |y| > |x| ) return + π - ATAN( y / x );
if( y > 0.0 && |y| < |x| ) return + π - ATAN( y / x );
if( y > 0.0 && |y| > |x| ) return +3π/2 + ATAN( x / y );
}
This way the adds and subtracts from the constant are not in a precision precarious position.
Jacob Lifshay
Thx Mitch have made a note so it's not lost in list noise.
Also appreciate the microcode suggestions, have to give that some serious thought, whether to do something truly microcode-like or whether to do just a mini SRAM that contains subroutines that reprogrammable RISCV opcodes can use.
Much lower level may prove more useful, micro code operations to do FP prenormalisation, post normalisation etc. all at expanded bitwidths.
L.
Also appreciate the microcode suggestions, have to give that some serious thought, whether to do something truly microcode-like or whether to do just a mini SRAM that contains subroutines that reprogrammable RISCV opcodes can use.
--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/b7b9644c-88a0-483b-a3d9-e8476d364985%40groups.riscv.org.
Think of microcode as if it were a sequencer, just express that sequencer in such a way that the synthesizer can implement it with a table, with just gates, or with some kind of ROM. Then you don't have to decide if it is microcoded (or not) the synthesizer carries the load.
On Sunday, August 11, 2019 at 10:20:28 PM UTC-5, lkcl wrote:I would like to point out that the general implementations of ATAN2 do a bunch of special case checks and then simply call ATAN.
> if atan is going back in, it would be a good idea to include atanpi.
>
Done.
> We may want to put frecip back in as well, for those who implement division as multiply by reciprocal.
Khronos OpenCL spec has native_recip and half_recip. On the basis that both Mitch and the Khronos Group probably know what they're doing, I'd agree.
All these would fall under the Zfacc FCSR field. Must post that, new thread.
L.
Transcendental opcodes in ARM's MALI ISA include
E8 - fatan_pt2
F0 - frcp (reciprocal)
F2 - frsqrt (inverse square root, 1/sqrt(x))
F3 - fsqrt (square root)
F4 - fexp2 (2^x)
F5 - flog2
F6 - fsin
F7 - fcos
F9 - fatan_pt1
pt1 stands for "pi times 1" and pt2 should br obvious.
NVIDIA CUDA transcendentals
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#mathematical-functions-appendix
Further internal documentation is hard to find.
AMD VEGA
http://developer.amd.com/wordpress/media/2017/08/Vega_Shader_ISA_28July2017.pdf
sin, cos, exp, log, rcp, rsqrt, all in FP16/32/64.
Intesestingly no tan, atan, arc or hypot.
If these are standard opcodes in commercial GPUs, requesting specific and individual quantitative analysis is puzzling in the extreme. Their inclusion is so obviously critical for commercial success in the field of 3D and HPC that it is the equivalent of asking for quantitative analysis of "integer add" or "mul" for a DSP.
L.